l...@gnu.org (Ludovic Courtès) writes:

> This has been addressed in two ways:

No, it hasn't.

>   1. In 2.0, (srfi srfi-6) uses Unicode-capable string ports (commit
>      ecb48dc.)

This issue report is not about adding more optional functionality on
top.  It is about _removing_ unwarranted redirection and complication
from existing core functionality.

The artifacts of making with-input-from-string and with-output-to-string
go through an additional character->bytevector->character
encoding/recoding layer are not invisible.

>   2. In 2.2, string ports are always Unicode-capable, and
>      ‘%default-port-encoding’ is ignored (commit 6dce942.)

String ports should not be "Unicode capable" but transparent.
Characters in, characters out.  ftell/fseek should be based on character
position in strings rather than offsets in a magically created
bytestream of some particular encoding.

> So for 2.0, the workaround is to either use (srfi srfi-6), or force
> ‘%default-port-encoding’ to "UTF-8".

Which is what the latter _only_ does.  It still interprets
set-port-encoding! with respect to a byte stream meaning, and it still
calculates positions according to a byte stream meaning not related to
string positions:

(use-modules (srfi srfi-6))
(define s (list->string (map integer->char '(20 200 2000 20000))))
(let ((port (open-input-string s)))
  (let loop ((ch (read-char port)))
    (if (not (eof-object? ch))
        (begin
          (format #t "~d, pos=~d\n" (char->integer ch) (ftell port))
            (loop (read-char port))))))

20, pos=1
200, pos=3
2000, pos=5
20000, pos=8

Tying string ports to an artificial bytevector presentation in a manner
bleeding through like that means that it is not possible to synchronize
string positions and stream positions when parts of the source string
are _not_ processed from within the stream.

Which is precisely the problem I am currently dealing with while porting
LilyPond: it has its own lexer working on an (utf-8 encoded) byte stream
which is at the same time available as a string port.  Whenever embedded
Scheme is interpreted, the string port is moved to the proper position,
GUILE reads an expression and is told what to do with it, the string
port position is picked off and the LilyPond lexer is moved to the
respective position to continue.

If you take a look at
<URL:http://git.savannah.gnu.org/cgit/lilypond.git/tree/scm/parser-ly-from-scheme.scm>,
ftell on a string port is here used for correlating the positions of
parsed subexpressions with the original data.  Reencoding strings in
utf-8 is not going to make this work with string indexing since ftell
does not bear a useful relation to string positions.

The behavior of ftell and port-encoding is perfectly fine for reading
from bytevectors or files, and reading from bytevectors or files also
does not incur a encode-when-open action governed by
%default-port-encoding in GUILE-2.0 and by hardwired UTF-8 in GUILE-2.2.

But strings are already decoded characters.  Reencoding makes no sense
and detaches things like ftell and fseek from the actual input into the
port.

-- 
David Kastrup



Reply via email to