Re: Roundtripping in Unicode

Philippe Verdy Tue, 14 Dec 2004 15:36:35 -0800

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>

Lars Kristan <[EMAIL PROTECTED]> writes:

Hmmmmm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process filenames.


No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not "an UTF-8 function".

Correct: [wc]strcpy() handles "string" instances, but not all string instances are plain-text, so they don't need to obey to UTF encoding rules (they just obey to the convention of null-byte termination, with no restriction on the string length, measured as a size in [w]char[_t] but not as a number of Unicode characters).

This is true for the whole standard C/C++ string libraries, as well as in Java (String and Char objects or "native" char datatype), and as well in almost all string handling libraries of common programming languages.

A "locale" defined as "UTF-8" will experiment lots of problems because of the various ways applications will behave face to encoding "errors" encountered in filenames: exceptions thrown aborting the program, substitution by "?" or U+FFFD causing wrong files to be accessed, some files not treated because their name was considered "invalid" althoug they were effectively created by some user of another locale...

Filenames are identifiers coded as strings, not as plain-text (even if most of these filename strings are plain-text).

The solution if then to use a locale based on a "relaxed version of UTF-8" (some spoke about defining a "NOT-UTF-8" and "NOT-UTF-16" encodings to allow any sequence of code units, but nobody has thought about how to make "NOT-UTF-8" and "NOT-UTF-16" mutually fully reversible; now add "NOT-UTF-32" to this nightmare and you will see that "NOT-UTF-32" needs to encode 2^32 distinct NOT-Unicode-codepoints, and that they must map bijectively to exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not found a solution to this problem, and I don't know if such solution even exists; if such solution exists, it should be quite complex...).

Re: Roundtripping in Unicode

Reply via email to