Re: Improving the handling of system data (env, users, paths, ...)

Rob Browning Tue, 09 Dec 2025 16:52:42 -0800

Mike Gran <[email protected]> writes:

> To halfway follow Emacs's lead, Guild could use some of Unicode's
> Private Use Area characters to represent raw bytes.


So when I started this thread, I wasn't aware of a relevant r7rs
proposal:
https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling
(cf. https://codeberg.org/scheme/r7rs/issues/51) which has similarities
with python's "surrogateescape" approach[1], and perhaps similarities
with Emacs' approach, though I think it's *much* more narrowly construed
and more limited.

It may be that if Guile were to encode/decode all "system data" using
that 'noncharacters strategy, then as (eventually) with python, many
Guile programs, both existing and future, would "just work" without the
authors needing to be aware of all of the related concerns/complexity
--- and those who do care would have options.

For example, you'd end up with "foo\ufddb\ufdd5" for a $'foo\xb5' on the
command line for a UTF-8 locale, instead of just losing the information
and receiving "foo?" via 'substitute, as you do now.  And were you to
write that argument to stdout, 'noncharacters would just automatically
reverse it back to $'foo\xb5'.

But even if noncharacters were plausible, it raises questions.  For
example, ports currently have a %default-port-encoding and
%default-port-conversion-strategy which are fluids that default to the
locale encoding and 'substitute respectively.  (And some string
functions borrow these defaults.  For example scm_to_locale_string(n)
uses the %default-port-conversion-strategy.)

  - Can/should we eventually change the default port strategy from
    'substitute to 'error (or 'noncharacters if we go that route), so
    that you have to explicitly request a strategy that loses
    information?  (I'm still inclined to think so.)

  - Should file content be treated as "system data", or should it be a
    separate category (even if the defaults for both are the same).
    i.e. is there any compelling case for being able to override a
    thread's ports defaults and sysdata defaults independently, or can
    we just have unified defaults one way or another.

...and again, I *think* I'm mostly wondering about an incremental
improvement.  I could imagine that with sufficient resources, some might
also want a way to work with all of the system data as bytevectors.  But
I currently view that as "nice to have" since I suspect it's a lot more
work for something that won't be all that much more efficient for common
cases (if we do switch to UTF-8 internally), and something that would
require a lot more work (design and code) to be anywhere near as
convenient.  For example, we have far more support for manipulating
paths as strings than we do for manipulating them as bytevectors.

[1] 
https://docs.python.org/3/c-api/init_config.html#c.PyConfig.filesystem_encoding

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Re: Improving the handling of system data (env, users, paths, ...)

Reply via email to