RE: Improving the handling of system data (env, users, paths, ...)

Maxime Devos Sun, 07 Jul 2024 07:59:37 -0700

>> >> Guile is a Scheme implementation, bound by Scheme standards and 
>> >> compatibility
>> >> with other Scheme implementations (and backwards compatibility too).
>> >
>> >Yes, I understand that.
>> 
>> Going by what you are saying below, I think you don’t.
>
>Thank you for your vote of confidence.


That was not a vote of confidence, if anything, it’s the contrary.

> I’m pretty sure that they weren’t intending to get the 0xb5 byte. Rather, 
> they were using the equivalent of ‘string-ref’ (i.e., ‘aref’) and 
> demonstrating that the result is bogus in Scheme.  In Scheme, ‘(string-ref 
> ...)’ needs to return a character, and there exists no (Unicode) character 
> with codepoint 4194229, so what Emacs returns here would be bogus for (Guile) 
> Scheme.

>aref in Emacs and string-ref in Guile are not the same, and if Guile
needs to produce a raw byte in this scenario, it can be easily
arranged.  In Emacs we have other goals.

It is the opposite. In Guile, string-ref does not need to produce bytes, but 
characters – just like aref (modulo difference in how Scheme and Emacs define 
‘byte’).

>IOW, I think this argument is pointless, since it is easy to adapt the
mechanism to what Guile needs.

No – the argument is about how it is impossible to adapt the mechanism to 
Guile, since bytes aren’t characters in Unicode.

> >From the Emacs manual:
> 
> >For example, you can access individual characters in a string using the 
> >function aref (see Functions that Operate on Arrays).
> 
> Thus, (aref the-string index) is the equivalent of (string-ref the-string 
> index).

>No, because a raw byte is not a character.

Yes, because characters are characters. Both string-ref and aref return 
characters. This is documented in both the Emacs and Guile manual:

Again, from the Emacs manual:

> A string is a fixed sequence of characters. [...] Since strings are arrays, 
> and therefore sequences as well, you can operate on them with the general 
> array and sequence functions documented in Sequences, Arrays, and Vectors. 
> For example, you can access individual characters in a string using the 
> function aref (see Functions that Operate on Arrays).

Hence, (aref the-string index) returns (Emacs) characters.

Likewise, from the Guile manual:

> Scheme Procedure: string-ref str k
>C Function: scm_string_ref (str, k)
Return character k of str using zero-origin indexing. k must be a valid index 
of str.

Clearly, these are equivalent (modulo difference in the meaning of 
‘characters’).

>If Guile restricts itself to Unicode characters and only them, it will
lack important features.  So my suggestion is not to have this
restriction.

Guile restricting strings to Unicode _is_ an important feature (simplicity, and 
compatibility).

Guile extending strings beyond Unicode is a _limitation_ (compatibility and 
other trickiness for applications).

I could imagine in the far future there might be too little codepoints left in 
Unicode, in which case the range of what Guile (and more generally, Scheme and 
Unicode) considers characters needs to be extended (even if that has some 
compatibility implicaitons), but that time hasn’t arrived yet.

The important feature of this thread, is supporting file names (and getenv 
stuff, etc.) that doesn’t fit properly in the ‘string’ model. As mentioned 
earlier (in the initial message, even), there are solutions to that do not 
impose the ‘let characters go beyond Unicode’ limitation.

>I think the fact that this discussion is held, and that Rob suggested
to use Latin-1 for the purpose of supporting raw bytes is a clear
indication that Guile, too, needs to deal with "character-like" data
that does not fit the Unicode framework. 

True, and I never claimed otherwise.

> So I think saying that strings in Guile can only hold Unicode characters will 
> not give you what this discussion attempts to give.

Sure, and I wasn’t trying to. What I (and IIUC, the other person as well) was 
doing was mentioning how neither the Emacs’s thing is a solution. (Whether 
because of backwards compatibility, or whether because of not _wanting_ to 
conflate bytes with characters (and not wanting to go beyond Unicode) with all 
the consequences this conflation would imply for applications.)

> In particular, how will you
handle the situations described by Rob where a file has a name that is
not a valid UTF-8 sequence (thus not "characters" as long as you
interpret text as UTF-8)?

Scheme does not interpret text as UTF-8, that’s an internal implementation 
detail and a matter of things like locales. Instead, to Scheme text is 
(Unicode) characters.

I have outlined a solution (that does not conflate characters with bytes) in 
another response. IIRC, it was in a response so Rob. I would propose actually, 
you know, reading it. I’m not sure, but IIRC Rob also mentioned another 
solution (i.e., just accept bytevectors in some locations, or do Latin-1).

Also, this structure makes no sense. Even if I did not provide an alternative 
solution of my own, that wouldn’t mean Emacs’s thing is the answer. (Negative) 
criticism can be valid without providing alternatives.

Best regards,
Maxime Devos.

RE: Improving the handling of system data (env, users, paths, ...)

Reply via email to