Olivier Dion <[email protected]> writes:

> Regarding the surrogate-escape, I think it would be possible for a
> malicious actor to somehow inject bytes that would go undetected by a
> sanitization procedure.  The malicious string would then get converted
> back to bytes to be sent to the OS.

Depending on what you mean, and just for reference, I *think* python's
particular approach may attempt to address this in part by restricting
it to ASCII-compatible encodings, and then having ASCII "smuggling"
provoke an error: https://peps.python.org/pep-0383/#discussion

Of course if it helped, I think strings with noncharacters could be
easily "tagged", at least in the in the current UTF-8 conversion effort.
We already have "ASCII-only" and "UTF-8" string flavors, and it may be
fairly easy to add a "noncharacters" flag.

I assume that a primary case we're talking about here is the concern
that one could hide "normal text", characters/bytes that would have been
escaped or filtered out by any guards, by smuggling them through as
noncharacters.

In any case, I agree that the general topic should be considered with
respect to any potential approaches, and is relevant to all of them
(e.g. if we just use bytevectors, then people have to have
bytevector-flavored versions of every escaping/filtering/etc. function).

> Since this is only a concern for system programming, I would argue that
> only those would need to change the default port strategy.
>
> Also, I personally would not want the default to be `error' and prefer
> `substitute'.  The former would break any program that print a UTF-8
> string that encodes a Latin character such as `é' when run on a CI that
> has `LANG=C'.

At the moment, I still feel like unmentioned data-loss should never be
the default, i.e. I feel like you should know, and have to make a
choice, if the content of a file, or a path, or an environment variable
is "unreadable", and we shouldn't just quietly lose the information.
Though we could still make it easy to say "don't care" globally if
that's really often desired.

I suppose I could see considering a practical exception for say stderr.
Otherwise, I feel like either the port should be binary, not text, or
you should add 'substitute yourself, given some knowledge about the
source (and we should make that easy, with the considerations well
documented).

> If the concern is only about paths, then what we may want is a better
> path abstraction than strings.  Opaque path objects with a set of
> operations would hide all the details internally and we could expose
> getters for the underlying bytevector or encode it to a string with the
> desired conversion strategy, including surrogate-escape.

Even without limited resources, I wouldn't be sure, but given limited
resources, I still currently lean toward thinking that we'd be better
off if feasible to start with some way to allow people to use all of the
available string handling "normally" for system data, since nearly all
of it *is* going to be decodable now.  And I also suspect a sizable
majority of people are going to want and expect to handle paths, command
line arguments, user names, group names, env vars, and file content more
or less as strings as much as possible (*if* we find a reasonable way to
do it).

Augmenting bytevectors with all of the desirable functionality would I
think be a lot of work, and also still somewhat awkward.  To be
comparable to strings, among other things, we'd need sensible,
bytevector compatible regular expression handling, globbing, parsers,
etc.  And assuming it's opt-in, everyone would have to know, and be
willing, to stop using strings for system data, or programs will work
right until they don't.

As I think you suggested, we could also go with some type-specific path
abstraction, but I'd be very hesitant about that unless it was because
we were going to try to tackle "cross platform paths" (e.g. Common Lisp;
see also, limited resources) --- and then we've still done nothing to
address all the other system data, *and* we have perhaps even more work
to do to provide sensible pattern matching, etc.

Note that I'm not certain about any of this, and I have previously, for
example, favored bytevectors.

-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Reply via email to