Mike Gran <[email protected]> writes: > To halfway follow Emacs's lead, Guild could use some of Unicode's > Private Use Area characters to represent raw bytes.
So when I started this thread, I wasn't aware of a relevant r7rs proposal: https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling (cf. https://codeberg.org/scheme/r7rs/issues/51) which has similarities with python's "surrogateescape" approach[1], and perhaps similarities with Emacs' approach, though I think it's *much* more narrowly construed and more limited. It may be that if Guile were to encode/decode all "system data" using that 'noncharacters strategy, then as (eventually) with python, many Guile programs, both existing and future, would "just work" without the authors needing to be aware of all of the related concerns/complexity --- and those who do care would have options. For example, you'd end up with "foo\ufddb\ufdd5" for a $'foo\xb5' on the command line for a UTF-8 locale, instead of just losing the information and receiving "foo?" via 'substitute, as you do now. And were you to write that argument to stdout, 'noncharacters would just automatically reverse it back to $'foo\xb5'. But even if noncharacters were plausible, it raises questions. For example, ports currently have a %default-port-encoding and %default-port-conversion-strategy which are fluids that default to the locale encoding and 'substitute respectively. (And some string functions borrow these defaults. For example scm_to_locale_string(n) uses the %default-port-conversion-strategy.) - Can/should we eventually change the default port strategy from 'substitute to 'error (or 'noncharacters if we go that route), so that you have to explicitly request a strategy that loses information? (I'm still inclined to think so.) - Should file content be treated as "system data", or should it be a separate category (even if the defaults for both are the same). i.e. is there any compelling case for being able to override a thread's ports defaults and sysdata defaults independently, or can we just have unified defaults one way or another. ...and again, I *think* I'm mostly wondering about an incremental improvement. I could imagine that with sufficient resources, some might also want a way to work with all of the system data as bytevectors. But I currently view that as "nice to have" since I suspect it's a lot more work for something that won't be all that much more efficient for common cases (if we do switch to UTF-8 internally), and something that would require a lot more work (design and code) to be anywhere near as convenient. For example, we have far more support for manipulating paths as strings than we do for manipulating them as bytevectors. [1] https://docs.python.org/3/c-api/init_config.html#c.PyConfig.filesystem_encoding -- Rob Browning rlb @defaultvalue.org and @debian.org GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4
