On Tue, 09 Dec 2025, Rob Browning <[email protected]> wrote:
[...]
> It may be that if Guile were to encode/decode all "system data" using
> that 'noncharacters strategy, then as (eventually) with python, many
> Guile programs, both existing and future, would "just work" without the
> authors needing to be aware of all of the related concerns/complexity
> --- and those who do care would have options.
>
> For example, you'd end up with "foo\ufddb\ufdd5" for a $'foo\xb5' on the
> command line for a UTF-8 locale, instead of just losing the information
> and receiving "foo?" via 'substitute, as you do now.  And were you to
> write that argument to stdout, 'noncharacters would just automatically
> reverse it back to $'foo\xb5'.

This is a concern for Unix platforms and for system programming like
GASH:

  $ LC_ALL=C mkdir $'foo\xb5'
  $ cd $'foo\xb5'
  $ gash -c pwd
  /home/old/tmp/foo?

With the advent of Guile Pre-Scheme as a system programming language, I
think this concern is real and should be addressed properly.

Regarding the surrogate-escape, I think it would be possible for a
malicious actor to somehow inject bytes that would go undetected by a
sanitization procedure.  The malicious string would then get converted
back to bytes to be sent to the OS.

In the context of system programming, especially with setuid programs,
the surrogate-escape approach seems somewhat dangerous if not only and
only used to do conversion of OS strings given by the OS itself.  This
is also a problem on its own.  The conversion between runtime strings
and OS strings need to be done at every boundaries to be transparent to
users.

> But even if noncharacters were plausible, it raises questions.  For
> example, ports currently have a %default-port-encoding and
> %default-port-conversion-strategy which are fluids that default to the
> locale encoding and 'substitute respectively.  (And some string
> functions borrow these defaults.  For example scm_to_locale_string(n)
> uses the %default-port-conversion-strategy.)
>
>   - Can/should we eventually change the default port strategy from
>     'substitute to 'error (or 'noncharacters if we go that route), so
>     that you have to explicitly request a strategy that loses
>     information?  (I'm still inclined to think so.)

Since this is only a concern for system programming, I would argue that
only those would need to change the default port strategy.

Also, I personally would not want the default to be `error' and prefer
`substitute'.  The former would break any program that print a UTF-8
string that encodes a Latin character such as `é' when run on a CI that
has `LANG=C'.

[...]

> ...and again, I *think* I'm mostly wondering about an incremental
> improvement.  I could imagine that with sufficient resources, some might
> also want a way to work with all of the system data as bytevectors.  But
> I currently view that as "nice to have" since I suspect it's a lot more
> work for something that won't be all that much more efficient for common
> cases (if we do switch to UTF-8 internally), and something that would
> require a lot more work (design and code) to be anywhere near as
> convenient.  For example, we have far more support for manipulating
> paths as strings than we do for manipulating them as bytevectors.

If the concern is only about paths, then what we may want is a better
path abstraction than strings.  Opaque path objects with a set of
operations would hide all the details internally and we could expose
getters for the underlying bytevector or encode it to a string with the
desired conversion strategy, including surrogate-escape.

[...]

Thanks,
Olivier
-- 
Olivier Dion
oldiob.ca

Reply via email to