On Nov 30, 2007 9:56 PM, Asmus Freytag <[EMAIL PROTECTED]> wrote: > One thing to realize is that in UTF-8 you can never have a single > non-ASCII byte. It's only ever two or more in sequence. However, most > European languages that use non-ASCII characters, typically do so with > single non-ASCII characters. > (non-European languages normally can't be represented in CP437, so we > don't worry about them). > > Your example, which I've copied below (I hope it comes through on the > repost) shows this effect quite nicely. > > In addition, the valid ranges of first and following bytes in multi-byte > sequences of non-ASCII UTF-8 bytes are restricted, making it even harder > for a random pattern to be valid UTF-8. > > As a result, if a filename is legal UTF-8, it is highly unlikely that it > could have a reasonable alternative interpretation in CP437. Therefore, > treating the archive as corrupt if it contains non-ASCII bytes, would > seem a bit draconian for many types of uses. (Your case may be special). > > If, on the other hand, there's ever any doubt as to which *single-byte* > character set you are dealing with, i.e. if you find systems that use > non-CP437 and non-UTF-8, but something third, then I'd recommend your > approach, because discriminating between different singly byte character > sets is something that ranges from impractical to impossible.
Thanks for this info, it's been very useful. Kind regards, -- Marcos Caceres http://datadriven.com.au
