Sent from Mail for Windows

From: Eli Zaretskii
Sent: Sunday, 7 July 2024 13:05
To: Jean Abou Samra
Cc: r...@defaultvalue.org; guile-devel@gnu.org
Subject: Re: Improving the handling of system data (env, users, paths, ...)

> From: Jean Abou Samra <j...@abou-samra.fr>
> Cc: guile-devel@gnu.org
> Date: Sun, 07 Jul 2024 12:03:06 +0200
> 
> Le dimanche 07 juillet 2024 à 08:33 +0300, Eli Zaretskii a écrit :
> > 
> >     - The internal representation is a superset of UTF-8, in that it
> >       is capable of representing characters for which there are no
> >       Unicode codepoints (such as GB 18030, some of whose characters
> >       don't have Unicode counterparts; and raw bytes, used to
> >       represent byte sequences that cannot be decoded).  It uses
> >       5-byte UTF-8-like sequences for these extensions.
> 
> 
>> Guile is a Scheme implementation, bound by Scheme standards and compatibility
>> with other Scheme implementations (and backwards compatibility too).
>
>Yes, I understand that.

Going by what you are saying below, I think you don’t.

>> I just tried (aref (cadr command-line-args) 0) in a lisp-interaction-mode
>> Emacs buffer after launching "emacs $'\xb5'". It gave 4194229 = 0x3fffb5,
>> which quite logically is outside the Unicode code point range 0 - 0x110000.
>That's not how you get a raw byte from a multibyte string in Emacs.
>IOW, you code is wrong, if what you wanted was to get the 0xb5 byte.
>I guess you assumed something about 'aref' in Emacs that is not true
>with multibyte strings that include raw bytes.  So what you got
>instead is the internal Emacs "codepoint" for raw bytes, which are in
>the 0x3fff00..0x3fffff range.

I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, they 
were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and demonstrating that 
the result is bogus in Scheme.  In Scheme, ‘(string-ref ...)’ needs to return a 
character, and there exists no (Unicode) character with codepoint 4194229, so 
what Emacs returns here would be bogus for (Guile) Scheme.

>From the Emacs manual:

>For example, you can access individual characters in a string using the 
>function aref (see Functions that Operate on Arrays).

Thus, (aref the-string index) is the equivalent of (string-ref the-string 
index). I do not see any indication they were trying to extract the byte 
itself, rather they were extracting the _character_ corresponding to the byte, 
and demonstrating that this ‘character’ is, in fact, not actually a character 
in Scheme (or in other words, no such character exists in Scheme).

>> This doesn't work for Guile, since a character is a Unicode code point
>> in the Scheme semantics.
>See above: the problem doesn't exist if one uses the correct APIs.

AFAICT, there are no correct APIs. Fundamentally (whether for compatibility or 
by choice), characters in (Guile) Scheme are _Unicode_ characters and (Scheme) 
strings consists of _only_ such _Unicode characters_. Yet, in Elisp strings 
consists of more stuff – whether that be characters from Emacs’ extended set, 
or a mixture of Unicode and raw bytes, in both cases the Elisp APIs that would 
return characters return things that aren’t _Unicode_ characters, and hence 
aren’t appropriate APIs for Guile.

This doesn’t mean that Emacs’ model can’t be adopted – rather, it could perhaps 
be partially adopted, but whenever the resulting ‘string’ contains things that 
aren’t (Unicode) characters, the result may not be called a ‘string’, and some 
of the things in the not-string may not be called ‘characters’.

> >     - Emacs has its own code for code-conversion, for moving by
> >       characters through multibyte sequences, for producing a Unicode
> >       codepoint from a byte sequence in the super-UTF-8 representation
> >       and back, etc., so it doesn't use libc routines for that, and
> >       thus doesn't depend on the current locale for these operations.
> 
> Guile's encoding conversions don't rely on the libc locale. They use
> GNU libiconv.

>That's okay, but what about other APIs, like conversion between
characters and their multibyte representations,

This is not an _other_ API, this is precisely the (ice-9 iconv) API. See 
string->bytevector and bytevector->string (well, you need to turn the single 
character into a string consisting of a single character first, but this is 
trivial, simply do (string [insert-character-here])).

> returning the length of a string in characters, etc.?  AFAIK, libiconv 
> doesn't provide
these facilities.

This is a basic string API, just do string-length like in (all?) Schemes. In 
Scheme, strings consists of characters, so string-length returns the length of 
a string in characters.

Best regards,
Maxime Devos.

Reply via email to