marcov wrote on Tue, 21 Aug 2012:

In our previous episode, Mattias Gaertner said:

For example under Linux file names are treated as UTF-8 but are only
bytes. They can and they do contain invalid UTF-8 characters.
If your program should support this, you must use a FindFirst
with UTF-8. To be clear: I don't say the default FindFirst under Linux
must be UTF-8, I only say, there must be one version with UTF-8, e.g.
FindFirstU8 and that must directly use the Linux file functions
without conversions.

That's ugly indeed. Since that doesn't mean just an utf8 overload,

Since it's just raw bytes, it's actually as much utf-8 as it is Windows Latin-1.

but that
the entire internal trajectory behind that (searchrec inclusive) must be
1-byte without conversion. Or the 1-byte to utf16 and back conversion must
be stable.   (invF(F(x))=x

Other frameworks also have to deal with this, and generally have a particular default and allow the programmer (and sometimes the end user) to override the default behaviour. E.g., glib assumes all file names are UTF-8, but you can change this to "assume file names are encoded in the current user's locale" or to "assume file names are encoded using encoded XYZ" (either programmatically or via an environment variable). Qt assumes they are encoded in the current user's locale, but the programmer can change this to a different code page (no environment variable). In practice, the default Qt and glib behaviour is almost always the same on Linux nowadays, since UTF-8 locales are the default.

I'm not aware of a framework that allows you to say that file names are just random bytes. It would probably be possible to implement this in FPC by adding "support" for the invalid $FFFF code page (both in ansistring and in unicodestring) and never converting anything if that one is used (basically overwrite the destination string's codepage with $FFFF if it's used by the source). Other options are not supporting invalid file names in the cross-platform RTL interface (have to use platform-specific APIs to deal with them on platforms that "support" such file names, like with glib and Qt), optionally adding "raw" overloads of such functions that possibly even accept and return arrays of byte rather than strings in order to avoid any accidental conversions and to make it clear what you're dealing with.


Jonas
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to