Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Doug Ewell Tue, 07 Dec 2004 21:57:29 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> An alternative can then be a mixed encoding selection:
> - choose a legacy encoding that will most often be able to represent
> valid filenames without loss of information (for example ISO-8859-1,
> or Cp1252).
> - encode the filename with it.
> - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
> encoded.
> - if there's no failure, then you must reencode the filename with
> UTF-8 instead, even if the result is longer.
> - if the strict UTF-8 decoding fails, you can keep the filename in the
> first 8-bit encoding...
> When parsing files:
> - try decoding filenames with *strict* UTF-8 rules. If this does not
> fail, then the filename was effectively encoded with UTF-8.
> - if the decoding failed, decode the filename with the legacy 8-bit
> encoding.
>
> But even with this scheme, you will find interoperability problems
> because some applications will only expect the legacy encoding, or
> only the UTF-8 encoding, without deciding...


This technique was described as "adaptive UTF-8" by Dan Oscarsson in
August 1998:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html

although he did not go as far as Philippe did, in actually checking the
"adaptively" encoded string to make sure it would be decoded correctly.

All the same, it was decided not to go this route, partly because the
auto-detection capability of UTF-8 would be lost, partly because having
multiple context-dependent encodings of the same code points would have
been a Bad Thing (<99 C9> could be encoded adaptively but <C9 99> could
not), and partly for the reason Philippe mentions -- most existing
decoders would expect either Latin-1 or UTF-8, and would choke if handed
a mixture of the two.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to