Re: Filenames and other POSIX byte strings as SCM strings without loss

Andy Wingo Fri, 01 Jul 2011 04:55:55 -0700

Hi Mark!

On Mon 23 May 2011 21:42, Mark H Weaver <m...@netris.org> writes:


> The tentative plan is to use normal strings to represent pathnames,
> command-line arguments, environmental variable values, and other such
> POSIX byte strings.

Apologies for not giving you prompt feedback on this idea.  Basically I
think it sounds like a great, workable plan.

> For purposes of this email, suppose they are called
> scm_to_permissive_stringn and scm_from_permissive_stringn.  On top of
> these we would implement scm_to_permissive_locale_stringn,
> scm_from_permissive_locale_stringn, and some other convenience
> functions.

Sounds good.  "Permissive" sounds a bit odd but I can't think of another
name.  "Foreign"?  "Corrupt"?  "Possibly invalid"?  "Nonsense"?  "Raw"?
"Cooked"?  "Bytes"?  "scm_from_utf8_byte_string"?

> Since scm_from_permissive_stringn maps invalid bytes to private-use code
> points in the range U+109700..U+1097FF, we must ensure that properly
> encoded code points in that range are mapped to something else.
> Otherwise, two distinct POSIX byte strings might map to the same SCM
> string.  The simplest solution is to consider any byte sequence which
> would map to our reserved range to be invalid, and thus mapped one byte
> at a time using this scheme.  For example, U+1097FF is represented in
> UTF-8 as 0xF4 0x89 0x9F 0xBF.  Although scm_from_stringn would map this
> sequence of bytes to the single code point U+1097FF (when using UTF-8),
> scm_from_permissive_stringn would instead consider this entire byte
> sequence to be invalid, and instead map it to the 4 code points
> U+1097F4, U+109789, U+10979F, U+1097BF.

Works for me.

> So the tentative plan is to provide this alternative mapping, and use it
> whenever accessing POSIX byte strings, whether they be filenames,
> command-line arguments, environment variable values, fields within a
> passwd, group, wtmp, or utmp file, system information (e.g. the hostname
> or information from uname), etc.

Cool.

> We should allow the user to access this mapping directly, via
>
>   scm_{to,from}_permissive_stringn,
>   scm_{to,from}_permissive_locale_stringn,
>   scm_{to,from}_permissive_utf8_stringn,
>
> and also between strings and bytevectors in both Scheme and C:
>
>   permissive-string->utf8,
>   permissive-utf8->string,
>   scm_permissive_string_to_utf8,
>   scm_permissive_utf8_to_string,
>
> and we should probably add procedures to convert between strings and
> bytevectors using other encodings as well, most importantly the locale
> encoding.
>
> We'd also need permissive-string->pointer and
> permissive-pointer->string.
>
> I'm not sure about the names.  Suggestions welcome.

I'm liking "bytes".  scm_from_locale_byte_stringn.  byte-string->utf8.
Perhaps not clear enough though.  WDYT?

> Regarding Noah's proposal to allow handling pathnames as sequences of
> path components: both Andy and I like this idea.  However, as always,
> the devil's in the details.  I'll write more about this in another
> email.

Sure, let's get this lowest level in first.  Are you on it? :-)  There
is no hurry of course, just so we know...

Cheers,

Andy
-- 
http://wingolog.org/

Re: Filenames and other POSIX byte strings as SCM strings without loss

Reply via email to