Re: Unicode and end users

Doug Ewell Mon, 18 Feb 2002 14:08:19 -0800

Lars Kristan <[EMAIL PROTECTED]> wrote:

> Most http servers have a functionality to display filesystem and allow
> changing directory and opening files. Hmmm, marking the generated html
file
> as UTF-8 would be a no-no thing then, unless the server guarantees that
> there are no illegal sequences in it (caused by a Latin-1 filename). Too
> bad, cause I would hope I can enter a directory or open a file even if
it is
> not displayed correctly. With characters dropped or replaced - I have no
> chance.


I think I'm starting to understand this better.  Your "illegal UTF-8
sequences" are really Latin-1 text embedded in a file or database presumed
to be UTF-8.

I would agree with John that, although the problem of mixed unmarked
charsets in a single document is real, there is nothing that UTF-8 or any
other charset can do to prevent it.

As a matter of fact, UTF-8 gives you as much of an advantage as you can
possibly hope for, *because* of the possibility of invalid sequences.  If
you know your document is part UTF-8 and part Latin-1, you can convert it
by applying the famous "Dan Oscarsson" heuristic of interpreting valid
UTF-8 sequences as UTF-8 and invalid sequences as Latin-1.  This is
strictly a no-no as far as Unicode conformance is concerned, but I
consider this an attempt to repair "mangled text" similar to the example
given in the note after C12 in Unicode 3.1.

Yes, I know the heuristic breaks down in certain cases, such as the string
"NESTLÉ®" where the Latin-1 bytes for the last two characters (0xC9 0xAE)
spell out the UTF-8 representation of U+026E LATIN SMALL LETTER LEZH (ɮ).
I've used this example myself to show the danger of relying on heuristics.
You wouldn't want to play games like this unless you already know the text
is part (valid) UTF-8 and part Latin-1, as in this case, and are willing
to take the risk of an occasional bogus conversion.

Just be glad you don't have to untangle a database with part Latin-1, part
CP437, and part HP Roman-8, as I once did.

-Doug Ewell
 Fullerton, California

Re: Unicode and end users

Reply via email to