RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Tue, 07 Dec 2004 10:38:01 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Doug Ewell wrote:

> John Cowan <jcowan at reutershealth dot com> wrote:
>
> > Windows filesystems do know what encoding they use. But a
> filename on
> > a Unix(oid) file system is a mere sequence of octets, of
> which only 00
> > and 2F are interpreted. (Filenames containing 20, and
> especially 0A,
> > are annoying to handle with standard tools, but not illegal.)
> >
> > How these octet sequences are translated to characters, if at all,
> > is no concern of the file system's. Some higher-level
> tools, such as
> > directory listers and shells, have hardwired assumptions,
> others have
> > changeable assumptions, but all are assumptions.
>
> OK, fair enough. Under a Unixoid file system, a file name
> consists of a
> more or less arbitrary sequence of bytes, essentially
> unregulated by the
> OS.
>
> If interpreted as UTF-8, some of these sequences may be
> invalid, and the
> files may be inaccessible.
>
> This is *exactly* the same scenario as with GB 2312, or
> Shift-JIS, or KS
> C 5601, or ISO 6937, or any other multibyte character encoding ever
> devised.
>
> This is not a problem that needs to be solved within Unicode, any more
> than it needed to be solved within those other encodings.
>

Shift-JIS was typically not mixed with other encodings, except for pure 7-bit ASCII. UTF-8 will be. And Shift-JIS had other serious problems, like the trailing backslash byte. UTF-8 has learned a lot from Shift-JIS. If there is anything still to learn, then let's welcome that.

Also, Shift-JIS (and other MBCS encodings) were a must for those cultures. UTF-8 is not a must. If there will be problems, there will be complaints. And resistance.

Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to