Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Philippe Verdy Tue, 07 Dec 2004 15:33:01 -0800

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that file, filenames on such systems seem to have unpredictable encodings.

However the problem comes, most often, when interchanging data from one system to another, through removeable volumes or shared volumes.

Needless to say, these systems were badly designed at their origin, and newer filesystems (and OS APIs) offer much better alternative, by either storing explicitly on volumes which encoding it uses, or by forcing all user-selected encodings to a common kernel encoding such as Unicode encoding schemes (this is what FAT32 and NTFS do on filenames created under Windows, since Windows 98 or NT).

I understand that there may exist situations, such as Linux/Unix UFS-like filesystems where it will be hard to decide which encoding was used for filenames (or simply for the content of plain-text files). For plain-text files, which have long-enough data in them, automatic identification of the encoding is possible, and used with success in many applications (notably in web browsers).

But foir filenames, which are generally short, automatic identification is often difficult. However, UTF-16 remains easy to identify, most often, due to the very unusual frequency of low-values in byte sequences on every even or odd position. UTF-8 is also easy to identify due to its strict rules (without these strict rules, that forbid some sequences, automatic identification of the encoding becomes very risky).

If the encoding cannot be identified precisely and explicitly, I think that UTF-16 is much better than UTF-8 (and it also offers a better compromize for total size for names in any modern language). However, it's true that UTF-16 cannot be used on Linux/Unix due to the presence of null bytes. The alternative is then UTF-8, but it is often larger than legacy encodings.

An alternative can then be a mixed encoding selection: - choose a legacy encoding that will most often be able to represent valid filenames without loss of information (for example ISO-8859-1, or Cp1252). - encode the filename with it. - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 encoded. - if there's no failure, then you must reencode the filename with UTF-8 instead, even if the result is longer. - if the strict UTF-8 decoding fails, you can keep the filename in the first 8-bit encoding... When parsing files: - try decoding filenames with *strict* UTF-8 rules. If this does not fail, then the filename was effectively encoded with UTF-8. - if the decoding failed, decode the filename with the legacy 8-bit encoding.

But even with this scheme, you will find interoperability problems because some applications will only expect the legacy encoding, or only the UTF-8 encoding, without deciding...

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to