Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Peter_Constable Mon, 04 Jun 2001 13:43:34 -0700

On 06/04/2001 10:47:20 AM "Mark Davis" wrote:

>The best practice for that case is to enforce normalization on data fields
>*when the text is inserted in the field* . If one does, then canonical
>equivalents will compare as equal, whether they are encoded in UTF-8,
>UTF-8s, or UTF-16 (or, for that matter, BOCU).

The argument for UTF-8s is this:

premise: need to be able to compare sort results from different sources
premise: need to do fast sorting, hence use binary comparison
premise: need results of sorting from the different sources to be
comparable
premise: sources may be encoded using UTF-8 or UTF-16
fact: sorting by binary comparison yields different orderings for UTF-8 and
UTF-16
implication: if results of sorting a source encoded in UTF-8 is to be
compared with results of sorting a source encoded in UTF-16, sorting those
sources by binary comparison does not yield comparable results

claim of proposal: using UTF-8s will yield comparable results

The entire point of the proposal is to yield comparable results. I was
pointing out that generating comparable results also involves
normalisation. Example:

source A:
< i, m, a, g, e, s >
< 00e9, t, u, d, e >

source B:
< e, 0301, t, u, d, e >
< i, m, a, g, e, s >

Therefore, the UTF-8s proposal does not completely solve the problem of
yielding comparable results.

Your are right in saying that the same problem exists for UTF-8 as well as
UTF-8s. That affects part of the point I was making: I suggested that a
UTF-8s proposal should also incorporate a discussion of the normalisation
issues, but insofar as the normalisation also has to be solved for UTF-8,
they are logically independent. The other part of my point, though, i.e.
that UTF-8s doesn't completely solve their problem, is still valid.

You are right in saying that this can be solved by normalising when the
data is created, and that that is probably more efficient. Now, I suggested
that the ordering problem might also be addressed as the data is
normalised. If the data is normalised as it is entered into a table, then
doing both at once would entail encoding in normalised UTF-8s. Obviously, I
don't want to maintain that position, so I will modify the suggestion by
saying that the ordering problem be solved at the same time as
normalisation if that is done at the time of comparison. Clearly, not
everyone will want to do that.

Now, someone might say, "Well, why are you now backing off? This shows that
UTF-8s is needed." That has not been demonstrated. Exactly the same
arguments for and against UTF-8s still remain; all that has changed is that
my argument that "ordering can be solved at the time of normalisation
eliminates the need for UTF-8s" has been shown to be invalid. My argument
that UTF-8s does not completely solve the problems that need to be
addressed still stands.

There is yet another problem that needs to be considered in an open system:
in an earlier post, I pointed out that if a source is sending data in some
variant of UTF-8, you need to know which it is. Doug Ewell's observations
are relevant here. Now Mark responded to Doug's comments:

>UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
>supplementary character expressed with two 3-byte values, you know you do
>not have pure UTF-8. If you ever encounter a supplementary character
>expressed with a 4-byte value, you know you don't have pure UTF-8s. If you
>never encounter either one, why does it matter? Every character you read
is
>valid and correct.

The only problem is that if you encounter a 4-byte value 80% of the way
through the stream, it may require you to reprocess the entire stream (if
the initial 80% contained characters in the range e000 - ffff, then the
comparison of sorting results within that 80% may be affected). But one of
the premises here was that this has to be done fast. The hit involved in
reprocessing that 80% may be much more serious than dealing with the
ordering difference between UTF-8 and UTF-16 as the data is received.

Now, I see that there's another problem that has to be considered: when
you're receiving data from that other source, you don't necessarily know if
it has been normalised. Does the SQL standard say anything about that? (I
gather not.) Therefore, one of three things is needed:

a) the receiver has to deal with normalisation (however inefficient that
may be)
b) there needs to be a way for the sender to inform the receiver that the
data is already normalised (but then, can the receiver trust the sender to
be telling the truth and to have normalised properly?)
c) the SQL standard needs to be updated to require data to be in a specific
normal form (again, there's that trust issue, but I guess that's true with
other things as well)

I'm guessing that currently c doesn't apply (the SQL standard doesn't
require one normal form) and that b is not implemented in open systems.
Therefore, the proponents of the UTF-8s proposal currently either need to
normalise the data on the receiving end, or they need to be doing so in the
data tables as the data is generated and then assuming that to be the case
as the data is transmitted. (If they're not doing either, then the
comparisons they're doing may not be valid.) But the latter situation
amounts to a proprietary, closed system. Clearly, that's something I would
think they'd also be wanting to correct.


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>
Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Reply via email to