Lars Kristan <[EMAIL PROTECTED]> wrote: > Most http servers have a functionality to display filesystem and allow > changing directory and opening files. Hmmm, marking the generated html file > as UTF-8 would be a no-no thing then, unless the server guarantees that > there are no illegal sequences in it (caused by a Latin-1 filename). Too > bad, cause I would hope I can enter a directory or open a file even if it is > not displayed correctly. With characters dropped or replaced - I have no > chance.
I think I'm starting to understand this better. Your "illegal UTF-8 sequences" are really Latin-1 text embedded in a file or database presumed to be UTF-8. I would agree with John that, although the problem of mixed unmarked charsets in a single document is real, there is nothing that UTF-8 or any other charset can do to prevent it. As a matter of fact, UTF-8 gives you as much of an advantage as you can possibly hope for, *because* of the possibility of invalid sequences. If you know your document is part UTF-8 and part Latin-1, you can convert it by applying the famous "Dan Oscarsson" heuristic of interpreting valid UTF-8 sequences as UTF-8 and invalid sequences as Latin-1. This is strictly a no-no as far as Unicode conformance is concerned, but I consider this an attempt to repair "mangled text" similar to the example given in the note after C12 in Unicode 3.1. Yes, I know the heuristic breaks down in certain cases, such as the string "NESTLÉ®" where the Latin-1 bytes for the last two characters (0xC9 0xAE) spell out the UTF-8 representation of U+026E LATIN SMALL LETTER LEZH (ɮ). I've used this example myself to show the danger of relying on heuristics. You wouldn't want to play games like this unless you already know the text is part (valid) UTF-8 and part Latin-1, as in this case, and are willing to take the risk of an occasional bogus conversion. Just be glad you don't have to untangle a database with part Latin-1, part CP437, and part HP Roman-8, as I once did. -Doug Ewell Fullerton, California