RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Wed, 08 Dec 2004 02:43:01 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

> Needless to say, these systems were badly designed at their
> origin, and
> newer filesystems (and OS APIs) offer much better
> alternative, by either
> storing explicitly on volumes which encoding it uses, or by
> forcing all
> user-selected encodings to a common kernel encoding such as
> Unicode encoding
> schemes (this is what FAT32 and NTFS do on filenames created
> under Windows,
> since Windows 98 or NT).
>
The UNIX (I also call it variant) principle has a problem of not knowing the encoding.
The Windows (I also call it invariant) principle has a problem that it HAS to know the encoding.

The Windows principle has another problem, it can store data from any encoding, and it also does a good job of trying to represent the data in any encoding, but it cannot guarantee identification in just any encoding. An invariant store can be implemented as UTF-8 or UTF-16. Windows uses UTF-16 and guranteed indentification used to be only possible in UTF-16. Due to UTF-8, now it can also be done in 8-bit (console, telnet). But for some reason, support for UTF-8 is still limited in some areas. And the missing rountrip capability may have something to do with it.

I basically agree that the variant approach is not a good one. But the invariant one is not an easy path. It was easier for the Windows to take it, because at the time transition was made, those systems were still single user. Hence, typically all data was in a single encoding.

Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to