RE: Nicest UTF

Lars Kristan Fri, 03 Dec 2004 06:14:40 -0800

Title: RE: Nicest UTF

Theodore H. Smith wrote:

> What would be the nicest UTF to use?
>
> I think UTF8 would be the nicest UTF.
>

I agree. But not for reasons you mentioned. There is one other important advantage: UTF-8 is stored in a way that permits storing invalid sequences. I will need to elaborate that, of course.

1.1 - Let's suppose a perfect world where we decided to have only UTF-16 (perfect in its simplicity, not strategy). You have various 8-bit data from the non-perfect past. Any data for which the encoding is known is converted to Unicode. Any errors (invalid sequences, unmappable values) are replaced with U+FFFD and logged or reported.

1.2 - Any data for which encoding is not known can only be stored in a UTF-16 database if it is converted. One needs to choose a conversion (say Latin-1, since it is trivial). When a user finds out that the result is not appealing, the data needs to be converted back to the original 8-bit sequence and then the user (or an algorithm) can try various encodings until the result is appealing.

1.3 - One is tempted to use a heuristic algorithm right from the start. But if it makes a wrong decision, you will have to first guess what it chose to undo it, and only then you can start searching for the correct conversion.

1.4 - I am assuming that storing the history of what was done is not possible or is impractical. There are cases where this assumption is more than valid. In addition to 1.3, there is an even more general problem. You don't know which data was converted using a good hint and which was converted using the default conversion. Once converted, this latter data may seem correct at first if the conversion affected only a few characters.

1.5 - A better choice for the default conversion would be to use the UTF-8 to UTF-16 conversion. If the data is really UTF-8, then we've got what we wanted. For anything else a lot of data will be lost (converted to many useless U+FFFD characters).

2.1 - In a better perfect world, we decide to have only UTF-8. Any data for which the encoding is known is converted to Unicode. Any errors (invalid sequences, unmappable values) are marked with U+FFFD and logged or reported. This is the same as in the first world, except that UTF-8 is used to store the Unicode data.

2.2 - Any data for which encoding is not known can simply be stored as-is.
2.3 - Again, it is not advisable to attempt to determine the encoding, unless this process is made very reliable. Typically, this can be achieved with larger chunks of data, but may be impossible on small chunks, even if the process is human-assisted.

2.4 - Any data that was stored as-is may contain invalid sequences, but these are stored as such, in their original form. Therefore, it is possible to raise an exception (alert) when the data is retrieved. This warns the user that additional caution is needed. That was not possible in 1.4.

3.1 - Unfortunately we don't live in either of the two perfect worlds, which makes it even worse. A database on UNIX will typically be (or can be made to be) 8-bit. Therefore perfectly able to handle UTF-8 data. On Windows however, there is a lot of support for UTF-16, but trying to work in UTF-8 could prove to be a handicap, if not close to impossible.

3.2 - Adding more UTF-8 support to Windows is of course the right thing to do. But that takes time. And it just opens the possibility for everyone to make use of the superior UTF-8 format.

3.3 - For the record: other UTF formats CAN be made equally useful to UTF-8. It requires 128 codepoints. Back in 2002, I have tried to convince people on the Unicode mailing list that this should be done, but have failed. I am now using the PUA for this purpose. And I am even tempted to hope nobody will never realize the need for these 128 codepoints, because then all my data will be non-standard.

4.1 - UTF-32 is probably very useful for certain string operations. Changing case for example. You can do it in-place, like you could with ASCII. Perhaps it can even be done in UTF-8, I am not sure. But even if it is possible today, it is definitely not guaranteed that it will always remain so, so one shouldn't rely on it.

4.2 - But UTF-8 is superior. You can make UTF-8 functions ignore invalid sequences and preserve them. But as soon as you convert UTF-8 to anything else, problems begin. You cannot preserve invalid sequences if you convert to UTF-16 (except by using unpaired surrogates). You can preserve invalid sequences when converting to UTF-32, but this again means you need to use undefined values (above 21 bits) in addition to modifying the functions so they do not modify these values. But then again, if one is to use these values, then they should be standardized. If so, why use the hyper-values, why not have them in Unicode?

5.1 - One could say that UTF-8 is inferior, because it has invalid sequences to start with. But UTF-16 and UTF-32 also have invalid sequences and/or values. The beauty of UTF-8 is that it can coexist with legacy 8-bit data. One is tempted to think that all we need is to know what is old and what is new and that this is also a benefit on its own. But this assumption is wrong. You will always come across chunks of data without any external attributes. And isn't that what 'plain text' is all about? To be plain and self contained. Stateless. Is UTF-16 stateless, if it needs the BOM? Is UTF-32LE stateless if we need to know that it is UTF-32LE? Unfortunately we won't be able to get rid of them. But I think they should not be used in data exchange. And not even for storage, wherever possible. That is what I see as a long term goal.

> Its too bad MicroSoft and Apple didn't realise the same, before they
> made their silly UCS-2 APIs.

I think UTF-8 didn't exist at the time they were making the decisions. Or am I wrong?

Lars

RE: Nicest UTF

Reply via email to