Re: Unicode and end users

David Hopwood Tue, 19 Feb 2002 00:45:42 -0800

-----BEGIN PGP SIGNED MESSAGE-----

Lars Kristan wrote:
> John Cowan wrote:
> > Frankly, your problem is insoluble, because you have set up
> > self-contradictory requirements.  Suppose you are dealing with a
> > filesystem where some names are to be interpreted as Latin-1 and others
> > as Latin-2.  The kernel will give you absolutely no help about which
> > charset to use for which names.
>
> Oh well, I did not set up the requirements. They come pretty naturally.
> Everything works fine if I keep the database in UTF-8 (well, raw for UNIX)
> and use UTF-16 => UTF-8 for Windows filenames (sorry, didn't mention those
> so far).
> The same thing should work the other way around, store Windows filenames
> directly into a UTF-16 database and use UTF-8 => UTF-16 conversion for UNIX
> filenames. Hoping that some day most of the data will be UTF-8 makes this
> even more appealing.


If you convert to a different UTF, unless that is only done internally
and not visible to any other process, you're asserting that the data is
a valid Unicode string. That policy is necessary if the Unicode Standard
is going to enforce use of one-to-one UTFs (and bijective transcoding
between them) for security reasons.

Remember that the database should be treating the ill-formed UTF-16 as
an error condition, if it is Unicode-conformant, so you should not
rely on it maintaining a distinction between different ill-formed UTF-16
sequences. That would be using ill-formed sequences to store data, which
would be poor design even if it was conformant to the Unicode Standard.
It's quite different to just minimising the harm done when a user loads
and saves a file under the incorrect assumption that it is UTF-8: the
latter is a good use of UTF-8B, the design you're suggesting is not.

There are two ways to guarantee that any filename can be represented:
 - store the filenames as byte sequences (*not* using a UTF-8 type,
   since you can't guarantee they are UTF-8). Convert the Windows
   filenames from UTF-16 to UTF-8, so that they are consistent with most
   of the Unix filenames.
 - store bytes that are not valid UTF-8 using escapes, as described below
   for URIs.

Personally I would use URI-escaping for your database example. Of course
'%' in a filename would have to be escaped as "%25".

(I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does,
at least on NT4.0, but you could reasonably treat that as an error.)

> As for any data that is not - well, the original byte
> sequence can be reconstructed and a re-conversion can be done based on
> user's settings (or selection) at display time. All you need is UTF-8B
> conversion instead of UTF-8.
> 
> How about another question here:
> 
> Most http servers have a functionality to display filesystem and allow
> changing directory and opening files. Hmmm, marking the generated html file
> as UTF-8 would be a no-no thing then, unless the server guarantees that
> there are no illegal sequences in it (caused by a Latin-1 filename).

Marking the generated HTML as UTF-8 is fine; putting the raw bytes from
the filename into it without checking them is not. This is nonconformant
to the Unicode Standard, because the server knows that the HTML file is
UTF-8 and that the filename might not be. It is also nonconformant to
HTML, and will presumably be nonconformant to the internationalised URI
specification.

The server can, however, produce a valid URI for the file by using %hh
escaping to encode any bytes that are not well-formed NFC-normalised UTF-8.
(The NFC normalisation is a requirement of HTML.) It should probably do
the same for the text of the link.

> Too bad, cause I would hope I can enter a directory or open a file even
> if it is not displayed correctly. With characters dropped or replaced -
> I have no chance.

%hh escaping will work fine here - better than what you're suggesting, in
fact, because you have no guarantee of what the browser will do when it
sees ill-formed UTF-8.

> Suppose the characters are still there and the file was not marked as UTF-8
> (works as long as all other text is in English) and I selected UTF-8 myself,
> in the browser. You would say there is no way I would want to convert a
> portion of the displayed text to UTF-16?

Converting "%hh" to UTF-16 obviously works, and round-trips properly even
if the URI goes through non-Unicode charsets (say you paste it into an
e-mail message, for example).

> Maybe I won't, maybe the system will, when I want to copy something to
> the clipboard...

The system clipboard is external to your program, so ill-formed UTF-16LE
(Windows) or UTF-8 (X-Windows) must not be cut/copied, and must be treated
as an error condition if it is pasted.

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPHB/pDkCAxeYt5gVAQGKVwf+LijF89biTgBWWNEtGSff+DmJu5cq9pEk
9k50XPu2QUxjAIv6SdCKJ7U4/Bxc24EkQii/vaB0bpsu4Rq5GbNPfURCWBSqSM5X
vu1SgsmYgshsLNNgiDAqvbrURgfEHvbVVLHwilhhRuf/B7+yicpHkLyT2srtobi5
17taETHaSUxiVQAI5nC6IOx2wQ7PachYh3gRBgKjftWKrOSscm0ROyQTZEdMFcnG
WQa5kduNgmu4MUcUuE873f89GGBjHENgxSwB65LV2AUYOGmDIhMUBR5/AiZpd0WH
Mf627vqBj7PDfmS5WFInHPdr0iUwffW0h6g2xnflRyY3zObv+1dWUw==
=Mi67
-----END PGP SIGNATURE-----

Re: Unicode and end users

Reply via email to