On Sat, Jul 06, 2024 at 03:32:17PM -0500, Rob Browning wrote:
> 
> 
> * Problem
> 
> System data like environment variables, user names, group names, file
> paths and extended attributes (xattr), etc. are on some systems (like
> Linux) binary data, and may not be encodable as a string in the current
> locale.

Since this might get lost in the ensuing discussion, yes: in Linux (and
relatives) file names are byte arrays, not strings.

> It's perhaps worth noting, that while typically unlikely, any given
> directory could contain paths in an arbitrary collection of encodings:

Exactly: it's the creating process's locale what calls the shots. So
if you are in a multi-locale environment (e.g. users with different
encodings) this will happen.

> At a minimum, I suggest Guile should produce an error by default
> (instead of generating incorrect data) when the system bytes cannot be
> encoded in the current locale.

Yes, perhaps.

[iso-8859-1]

> There are disadvantages to this approach, but it's a fairly easy
> improvement.

I'm not a fan of this one: watching Emacs's development, people end
up using Latin-1 as a poor substitute of "byte array" :-)

> The most direct (and compact, if we do convert to UTF-8) representation
> would bytevectors, but then you would have a much more limited set of
> operations available (i.e. strings have all of srfi-13, srfi-14, etc.)
> unless we expanded them (likely re-using the existing code paths).  Of
> course you could still convert to Latin-1, perform the operation, and
> convert back, but that's not ideal.

It would be the right one, and let users deal with explicit conversions
from/to strings, so they see the issues happening, but alas, you are
right: it's very inconvenient.

> Finally, while I'm not sure how I feel about it, one notable precedent
> is Python's "surrogateescape" approach[5], which shifts any unencodable
> bytes into "lone Unicode surrogates", a process which can (and of course
> must) be safely reversed before handing the data back to the system.  It
> has its own trade-offs/(security)-concerns, as mentioned in the PEP.

FWIW, that's more or less what Emacs's internal encoding does: it is roughly
UTF-8, but reserves some code points to odd bytes (which it then displays
as backslash sequences). It's round-trip safe, but has its own set of sharp
edges, and naive [1] users get caught in them from time to time.

What's my point? Basically, that we shouldn't try to get it 100% right,
because there's possibly no way, and we pile up a lot of complexity which
is very difficult to get rid of (most languages have their painful transitions
to tell stories about).

I think it's ok to try some guesswork to make user's lives easier, but
perhaps to (by default) fail noisily at the least suspicion than to carry
happily away with wrong results.

Guessing UTF-8 seems a safe bet: for one, everybody (except Javascript) is
moving in that direction, for the other, you notice quickly when it isn't
(as opposed to ISO-8859-x, which will trundle along, producing funny content).

Cheers
-- 
t

Attachment: signature.asc
Description: PGP signature

Reply via email to