Hi Mark! On Mon 23 May 2011 21:42, Mark H Weaver <m...@netris.org> writes:
> The tentative plan is to use normal strings to represent pathnames, > command-line arguments, environmental variable values, and other such > POSIX byte strings. Apologies for not giving you prompt feedback on this idea. Basically I think it sounds like a great, workable plan. > For purposes of this email, suppose they are called > scm_to_permissive_stringn and scm_from_permissive_stringn. On top of > these we would implement scm_to_permissive_locale_stringn, > scm_from_permissive_locale_stringn, and some other convenience > functions. Sounds good. "Permissive" sounds a bit odd but I can't think of another name. "Foreign"? "Corrupt"? "Possibly invalid"? "Nonsense"? "Raw"? "Cooked"? "Bytes"? "scm_from_utf8_byte_string"? > Since scm_from_permissive_stringn maps invalid bytes to private-use code > points in the range U+109700..U+1097FF, we must ensure that properly > encoded code points in that range are mapped to something else. > Otherwise, two distinct POSIX byte strings might map to the same SCM > string. The simplest solution is to consider any byte sequence which > would map to our reserved range to be invalid, and thus mapped one byte > at a time using this scheme. For example, U+1097FF is represented in > UTF-8 as 0xF4 0x89 0x9F 0xBF. Although scm_from_stringn would map this > sequence of bytes to the single code point U+1097FF (when using UTF-8), > scm_from_permissive_stringn would instead consider this entire byte > sequence to be invalid, and instead map it to the 4 code points > U+1097F4, U+109789, U+10979F, U+1097BF. Works for me. > So the tentative plan is to provide this alternative mapping, and use it > whenever accessing POSIX byte strings, whether they be filenames, > command-line arguments, environment variable values, fields within a > passwd, group, wtmp, or utmp file, system information (e.g. the hostname > or information from uname), etc. Cool. > We should allow the user to access this mapping directly, via > > scm_{to,from}_permissive_stringn, > scm_{to,from}_permissive_locale_stringn, > scm_{to,from}_permissive_utf8_stringn, > > and also between strings and bytevectors in both Scheme and C: > > permissive-string->utf8, > permissive-utf8->string, > scm_permissive_string_to_utf8, > scm_permissive_utf8_to_string, > > and we should probably add procedures to convert between strings and > bytevectors using other encodings as well, most importantly the locale > encoding. > > We'd also need permissive-string->pointer and > permissive-pointer->string. > > I'm not sure about the names. Suggestions welcome. I'm liking "bytes". scm_from_locale_byte_stringn. byte-string->utf8. Perhaps not clear enough though. WDYT? > Regarding Noah's proposal to allow handling pathnames as sequences of > path components: both Andy and I like this idea. However, as always, > the devil's in the details. I'll write more about this in another > email. Sure, let's get this lowest level in first. Are you on it? :-) There is no hurry of course, just so we know... Cheers, Andy -- http://wingolog.org/