-----BEGIN PGP SIGNED MESSAGE----- Lars Kristan wrote: > John Cowan wrote: > > Frankly, your problem is insoluble, because you have set up > > self-contradictory requirements. Suppose you are dealing with a > > filesystem where some names are to be interpreted as Latin-1 and others > > as Latin-2. The kernel will give you absolutely no help about which > > charset to use for which names. > > Oh well, I did not set up the requirements. They come pretty naturally. > Everything works fine if I keep the database in UTF-8 (well, raw for UNIX) > and use UTF-16 => UTF-8 for Windows filenames (sorry, didn't mention those > so far). > The same thing should work the other way around, store Windows filenames > directly into a UTF-16 database and use UTF-8 => UTF-16 conversion for UNIX > filenames. Hoping that some day most of the data will be UTF-8 makes this > even more appealing.
If you convert to a different UTF, unless that is only done internally and not visible to any other process, you're asserting that the data is a valid Unicode string. That policy is necessary if the Unicode Standard is going to enforce use of one-to-one UTFs (and bijective transcoding between them) for security reasons. Remember that the database should be treating the ill-formed UTF-16 as an error condition, if it is Unicode-conformant, so you should not rely on it maintaining a distinction between different ill-formed UTF-16 sequences. That would be using ill-formed sequences to store data, which would be poor design even if it was conformant to the Unicode Standard. It's quite different to just minimising the harm done when a user loads and saves a file under the incorrect assumption that it is UTF-8: the latter is a good use of UTF-8B, the design you're suggesting is not. There are two ways to guarantee that any filename can be represented: - store the filenames as byte sequences (*not* using a UTF-8 type, since you can't guarantee they are UTF-8). Convert the Windows filenames from UTF-16 to UTF-8, so that they are consistent with most of the Unix filenames. - store bytes that are not valid UTF-8 using escapes, as described below for URIs. Personally I would use URI-escaping for your database example. Of course '%' in a filename would have to be escaped as "%25". (I've just checked whether NTFS allows ill-formed UTF-16 filenames; it does, at least on NT4.0, but you could reasonably treat that as an error.) > As for any data that is not - well, the original byte > sequence can be reconstructed and a re-conversion can be done based on > user's settings (or selection) at display time. All you need is UTF-8B > conversion instead of UTF-8. > > How about another question here: > > Most http servers have a functionality to display filesystem and allow > changing directory and opening files. Hmmm, marking the generated html file > as UTF-8 would be a no-no thing then, unless the server guarantees that > there are no illegal sequences in it (caused by a Latin-1 filename). Marking the generated HTML as UTF-8 is fine; putting the raw bytes from the filename into it without checking them is not. This is nonconformant to the Unicode Standard, because the server knows that the HTML file is UTF-8 and that the filename might not be. It is also nonconformant to HTML, and will presumably be nonconformant to the internationalised URI specification. The server can, however, produce a valid URI for the file by using %hh escaping to encode any bytes that are not well-formed NFC-normalised UTF-8. (The NFC normalisation is a requirement of HTML.) It should probably do the same for the text of the link. > Too bad, cause I would hope I can enter a directory or open a file even > if it is not displayed correctly. With characters dropped or replaced - > I have no chance. %hh escaping will work fine here - better than what you're suggesting, in fact, because you have no guarantee of what the browser will do when it sees ill-formed UTF-8. > Suppose the characters are still there and the file was not marked as UTF-8 > (works as long as all other text is in English) and I selected UTF-8 myself, > in the browser. You would say there is no way I would want to convert a > portion of the displayed text to UTF-16? Converting "%hh" to UTF-16 obviously works, and round-trips properly even if the URI goes through non-Unicode charsets (say you paste it into an e-mail message, for example). > Maybe I won't, maybe the system will, when I want to copy something to > the clipboard... The system clipboard is external to your program, so ill-formed UTF-16LE (Windows) or UTF-8 (X-Windows) must not be cut/copied, and must be treated as an error condition if it is pasted. - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBPHB/pDkCAxeYt5gVAQGKVwf+LijF89biTgBWWNEtGSff+DmJu5cq9pEk 9k50XPu2QUxjAIv6SdCKJ7U4/Bxc24EkQii/vaB0bpsu4Rq5GbNPfURCWBSqSM5X vu1SgsmYgshsLNNgiDAqvbrURgfEHvbVVLHwilhhRuf/B7+yicpHkLyT2srtobi5 17taETHaSUxiVQAI5nC6IOx2wQ7PachYh3gRBgKjftWKrOSscm0ROyQTZEdMFcnG WQa5kduNgmu4MUcUuE873f89GGBjHENgxSwB65LV2AUYOGmDIhMUBR5/AiZpd0WH Mf627vqBj7PDfmS5WFInHPdr0iUwffW0h6g2xnflRyY3zObv+1dWUw== =Mi67 -----END PGP SIGNATURE-----