Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > An alternative can then be a mixed encoding selection: > - choose a legacy encoding that will most often be able to represent > valid filenames without loss of information (for example ISO-8859-1, > or Cp1252). > - encode the filename with it. > - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8 > encoded. > - if there's no failure, then you must reencode the filename with > UTF-8 instead, even if the result is longer. > - if the strict UTF-8 decoding fails, you can keep the filename in the > first 8-bit encoding... > When parsing files: > - try decoding filenames with *strict* UTF-8 rules. If this does not > fail, then the filename was effectively encoded with UTF-8. > - if the decoding failed, decode the filename with the legacy 8-bit > encoding. > > But even with this scheme, you will find interoperability problems > because some applications will only expect the legacy encoding, or > only the UTF-8 encoding, without deciding...
This technique was described as "adaptive UTF-8" by Dan Oscarsson in August 1998: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html although he did not go as far as Philippe did, in actually checking the "adaptively" encoded string to make sure it would be decoded correctly. All the same, it was decided not to go this route, partly because the auto-detection capability of UTF-8 would be lost, partly because having multiple context-dependent encodings of the same code points would have been a Bad Thing (<99 C9> could be encoded adaptively but <C9 99> could not), and partly for the reason Philippe mentions -- most existing decoders would expect either Latin-1 or UTF-8, and would choke if handed a mixture of the two. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

