Hello, I encountered this issue today. This looks like a pretty complete solution ready to go. Did this ever make it into Guile/Guix?
(Ironically I was also reaching for a "make-custom-textual-output-port" the other day!) -- Sarah Mark H Weaver <m...@netris.org> writes: > Hi Ludovic, > > Ludovic Courtès <l...@gnu.org> writes: > >> Mark H Weaver <m...@netris.org> skribis: >> >>> Ludovic Courtès <l...@gnu.org> writes: >> >> [...] >> >>>> So there are two things. To fix the issue you reported (build output >>>> that goes through), I think we must simply turn off UTF-8 decoding from >>>> ‘process-stderr’ and leave that entirely to ‘build-event-output-port’. >>> >>> Can we assume that UTF-8 is the appropriate encoding for >>> (current-build-output-port)? My interpretation of the Guix manual entry >>> for 'current-build-output-port' suggests that the answer should be "no". >> >> What goes to ‘current-build-output-port’ comes from builds processes. >> It’s usually UTF-8 but it can be anything, including binary garbage, >> which should be gracefully handled. >> >> That’s why ‘process-stderr’ currently uses ‘read-maybe-utf8-string’. > > I agree that we should (permissively) interpret the build process output > as UTF-8, regardless of locale settings. However, the encoding of > 'current-build-output-port' is orthogonal, and I see no reason to assume > that it's UTF-8. > > As 'process-stderr' is currently implemented, it makes no assumptions > about the encoding of 'current-build-output-port'. That's because it > uses only textual I/O on it. The end result is that the UTF-8 build > output is effectively converted into the port encoding of > 'current-build-output-port', whatever it might be. I think that's how > it should be, no? > >>> Also, in your previous message you wrote: >>> >>> The problem is the first layer of UTF-8 decoding that happens in >>> ‘process-stderr’, in the ‘%stderr-next’ case. We would need to >>> disable it, but only if the build output port is >>> ‘build-event-output-port’ (i.e., it’s capable of interpreting >>> “multiplexed build output” correctly.) >>> >>> It sounds like you're suggesting that 'process-stderr' should look to >>> see if (current-build-output-port) is a 'build-event-output-port', and >>> in that case it should use binary I/O primitives to write raw binary >>> data to it, otherwise it should use text I/O primitives and write >>> characters to it. Do I understand correctly? >> >> Yes. (Actually, rather than guessing if (current-build-output-port) is >> a ‘build-event-output-port’, there could be a fluid to ask for the use >> of raw binary primitives.) >> >>> IMO, it would be cleaner to treat 'build-event-output-port' uniformly, >>> and specifically as a textual port of unknown encoding. >> >> (You mean ‘current-build-output-port’, right?) > > Yes, indeed. > >> I think you’re right. I’m not yet entirely sure what the implications >> are. There’s a couple of tests in tests/store.scm for UTF-8 >> interpretation that describe behavior that I think we should preserve. > > I certainly agree that we should preserve those tests. I would go > further and add two more tests that bind 'current-build-output-port' to > a port with a non-UTF-8 encoding (e.g. UTF-16) and verify that the λ > gets converted correctly. The test build process would output the λ as > UTF-8, but it should be written to 'current-build-output-port' as > e.g. UTF-16. > > What do you think? > >>> I would suggest changing 'build-event-output-port' to create an R6RS >>> custom *textual* output port, so that it wouldn't have to worry about >>> encodings at all, and it would only be given whole characters. >>> Internally, it would be doing exactly what you suggest above, but those >>> details would be encapsulated within the custom textual port. >>> >>> However, I don't think we can use Guile's current implementation of R6RS >>> custom textual output ports, which are currently built on Guile's legacy >>> soft ports, which I suspect have a similar bug with multibyte characters >>> sometimes being split (see 'soft_port_write' in vports.c). >>> >>> Having said all of this, my suggestions would ultimately entail having >>> two separate places along the stderr pipeline where 'utf8->string!' >>> would be used, and maybe that's too much until we have a more optimized >>> C implementation of it. >> >> Yeah it looks like we don’t yet have custom textual output ports that we >> could rely on, do we? >> >> I support your work to add that in Guile proper! > > For now, I can offer a new implementation of custom textual output ports > built upon custom binary ports and the 'utf8->string!' that I previously > sent. See attached. > > Thanks, > Mark > > GNU Guile 2.2.4 > Copyright (C) 1995-2017 Free Software Foundation, Inc. > > Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'. > This program is free software, and you are welcome to redistribute it > under certain conditions; type `,show c' for details. > > Enter `,help' for help. > scheme@(guile-user)> (load "utf8-decoder.scm") > scheme@(guile-user)> (load "guile-new-custom-textual-ports.scm") > scheme@(guile-user)> (define (my-write! str start count) > (pk 'my-write! (substring str start (+ start count))) > count) > scheme@(guile-user)> (define port (make-custom-textual-output-port "test1" > my-write! #f #f #f)) > scheme@(guile-user)> (display "Hello λ world!" port) > scheme@(guile-user)> (force-output port) > > ;;; (my-write! "Hello λ world!") > scheme@(guile-user)> (string->utf8 "λ") > $2 = #vu8(206 187) > scheme@(guile-user)> (string->utf8 "Hello λ world!") > $3 = #vu8(72 101 108 108 111 32 206 187 32 119 111 114 108 100 33) > scheme@(guile-user)> (put-bytevector port #vu8(72 101 108 108 111 32 206)) > scheme@(guile-user)> (force-output port) > > ;;; (my-write! "Hello ") > scheme@(guile-user)> (put-bytevector port #vu8(187 32 119 111 114 108 100 33)) > scheme@(guile-user)> (force-output port) > > ;;; (my-write! "λ world!") > scheme@(guile-user)> > > ;;; Copyright © 2019 Mark H Weaver <m...@netris.org> > ;;; > ;;; This program is free software: you can redistribute it and/or modify > ;;; it under the terms of the GNU General Public License as published by > ;;; the Free Software Foundation, either version 3 of the License, or > ;;; (at your option) any later version. > ;;; > ;;; This program is distributed in the hope that it will be useful, > ;;; but WITHOUT ANY WARRANTY; without even the implied warranty of > ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > ;;; GNU General Public License for more details. > ;;; > ;;; You should have received a copy of the GNU General Public License > ;;; along with this program. If not, see <http://www.gnu.org/licenses/>. > > (use-modules (rnrs io ports)) > > (define (make-custom-textual-output-port id > write! > get-position > set-position! > close) > (let (;; Allocate a per-port string buffer which will be used as a > ;; temporary buffer for decoding, to avoid heap allocation > ;; during normal operation. > (buffer (make-string 4096)) > ;; 'state' is the UTF-8 decoder state, which represents a > ;; proper prefix of a well-formed UTF-8 byte sequence. These > ;; are bytes that 'binary-write!' has accepted and reported as > ;; having been written, although we are not able to decode > ;; them into a character to pass to (textual) 'write!' until > ;; more bytes arrive. > (state 0)) > (define (binary-write! bv start count) > (call-with-values (lambda () > ;; XXX FIXME: Consider performing this > ;; decoding strictly. > (utf8->string! state bv start (+ start count) > buffer 0 (string-length buffer))) > (lambda (new-state bv-pos char-count) > (let* (;; Avoid calling write! with (char-count = 0) unless > ;; (count = 0) was passed to us, because calling > ;; 'write!' with count=0 has a special meaning: it > ;; means to pass an EOF object to the byte/character > ;; sink. > (chars-accepted (if (and (zero? char-count) > (not (zero? count))) > 0 > (write! buffer 0 char-count))) > ;; Compute 'bytes-accepted' in such a way that the > ;; bytes from STATE are not included, because they > ;; were passed to us in previous calls, and are not > ;; part of the bytevector range that we are now being > ;; asked to write. However, it's important to note > ;; that if 'write!' did not accept the bytes from > ;; STATE, 'bytes-accepted' will be negative. We must > ;; handle that case specially below. > (bytes-accepted (- count (string-utf8-length > (substring buffer > chars-accepted > char-count))))) > ;; If 'bytes-accepted' is negative, that means the bytes > ;; from STATE were not written. This can only happen if > ;; 'chars-accepted' is 0, because 'write!' can only accept > ;; whole code points, and the bytes from STATE are part of > ;; at most a single code point. In this case, we must > ;; leave STATE unchanged and return 0. > (if (negative? bytes-accepted) > 0 > (begin > (set! state new-state) > bytes-accepted)))))) > (define (binary-close) > (set! buffer #f) > (when close (close))) > (define port > (make-custom-binary-output-port id > binary-write! > get-position > set-position! > binary-close)) > ;; Always use UTF-8 as the encoding for custom textual ports, as > ;; an internal implementation detail, to ensure that all Unicode > ;; characters will pass through regardless of the current locale. > (set-port-encoding! port "UTF-8") > port))