Thanks. That's Markus's invention.


----- Original Message -----
From: "Carl W. Brown" <[EMAIL PROTECTED]>
Sent: Wednesday, June 06, 2001 11:08
Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

> Mark,
> I like the clever ICU technique for sorting in code point order.
> U_CAPI int32_t U_EXPORT2
> u_strcmpCodePointOrder(const UChar *s1, const UChar *s2) {
>     static const UChar utf16Fixup[32]={
>         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>             0x2000, 0xf800, 0xf800, 0xf800, 0xf800
>     };
>     UChar c1, c2;
>     int32_t diff;
>     /* rotate each code unit's value so that surrogates get the highest
> values */
>     for(;;) {
>         c1=*s1;
>         c1+=utf16Fixup[c1>>11]; /* additional "fix-up" line */
>         c2=*s2;
>         c2+=utf16Fixup[c2>>11]; /* additional "fix-up" line */
>         /* now c1 and c2 are in UTF-32-compatible order */
>         diff=(int32_t)c1-(int32_t)c2;
>         if(diff!=0 || c1==0 /* redundant: || c2==0 */) {
>             return diff;
>         }
>         ++s1;
>         ++s2;
>     }
> }
> The surrogates are shifted up to the high end of the sorting sequence and
> the code points higher than the surrogates are shifted down.  This is a
> low overhead technique that might be included in the Unicode
> Using this technique avoids the need for UTF-8s.  Using this type of
> means that UTF-16 (compared in codepoint order) has the same sorting
> sequence as UTF-8 and UTF-32.  This code preserves the UTF-16 data typing.
> UChar is an unsigned 16 bit integer.
> If you did not want to preserve the unsigned integer you could just add a
> correction factor to the surrogates the make them higher than 0x0000FFFF.
> This would also make them sort higher than the rest of the code points but
> don't think it would have any less overhead.
> The point is that they are techniques that are faster that converting to
> UTF-32 that add very little overhead that "do the right thing".  All
> should sort in standard Unicode code point order regardless of encoding.
> This way everyone is reading from the same page.
> Carl
> Note this code fragment is from ICU.  This is Open Source code.  See
> for further details.
> -----Original Message-----
> Behalf Of Carl W. Brown
> Sent: Tuesday, June 05, 2001 11:09 AM
> Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> Mark,
> Now I understand.
> If they implement a UTF-16 strcmp function that is a case sensitive
> of a UTF-16 strcasecmp(stricmp) you will get the same result as a UTF-8 or
> UTF-32 compare.  To me, it seems like this is the way to go.
> Normally a strcmp function just loops through the string comparing them
> character by character.  If the loop checks for surrogates and compares
> UTF-32 code points you will always get the same result for all encoding,
> standard Unicode code point order.
> Ultimately this is the "do it right the first time" way of implementing
> Unicode.
> Carl
> -----Original Message-----
> From: Mark Davis [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 04, 2001 9:23 PM
> To: Carl W. Brown; [EMAIL PROTECTED]
> Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> Nobody has ever proposed binary compares between UTF-8 and UTF-16 strings.
> The scenario is:
> Client software uses UTF-16.
> Database software uses UTF-8s.
> Client wants to have string A < string B if and only if Database has A <
> (where A and B are in the respective client/database encodings).
> The point of standardization (for those who favor it) is that you can then
> properly tag the data in the database when transferring it between
> systems (instead of either incorrectly tagging it as UTF-8, or correctly
> tagging it with a private name -- but one that other people don't
> understand).
> I don't think the companies in favor of UTF-8s are trying to avoid
> supporting supplementary characters at all. They see it (rightly or
> as a way to solve a problem they have in this scenario without a
> hit.
> Mark
> ----- Original Message -----
> From: "Carl W. Brown" <[EMAIL PROTECTED]>
> Sent: Monday, June 04, 2001 12:55
> Subject: RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> > Mark,
> >
> > I think that I am missing some point.  Form what I hear the issue is
> > they want a way to support identical compares.  This order is not
> important.
> > What is important is that they collate the same.
> >
> > Point #1 - I don't understand why this is a standard's issue.  The way
> > build keys is an internal design issue.  You can use BOCU or whatever.
> >
> > Point # 2 - You can not do binary compares between UTF-16 and UTF-8
> > You must:
> >
> > Use UTF-16 for all keys
> > Use UTF-8 for all keys
> > Convert all UTF-16 keys to UTF-8 for compares
> > Convert all UTF-8 keys to UTF-16 for compares
> >
> > For one of the first two cases there is no issue.
> >
> > For the second two you must convert.  If you look at the total
> > overhead of converting two UCS-2 characters to two UTF-8s characters it
> > likely to be less overhead to convert a pair of UTF-16 surrogates to a
> > single UTF-8 character or to convert a UTF-8 character to a pair of
> > surrogates.
> >
> >
> >
> > This leaves me very confused as to the reason for requesting UTF-8s.
> > other reason that comes to mind is the "red herring" reason.  If they
> > you the real reason you would never approve it.
> >
> > I know you are familiar with the efforts to upgrade ICU to support
> > It was not easy and some of the situations were very subtle.  One
> > problem is issue of what is a character.  The nice 1 to 1 mapping in
> > is gone.  UTF-16 is now just another MBCS with all of its inherent
> problems.
> >
> > It becomes very tempting for a developer who has software that may not
> have
> > software systems as well organized as ICU to decide to foist the problem
> of
> > UTF-16 back on the user and the OS by ignoring surrogates all together.
> If
> > they support UTF-8 then they have a problem because they can not just
> ignore
> > surrogates.  If the Unicode Consortium legitimizes UTF-8s then they can
> make
> > it someone else's problem.  It puts them in a position to compel others
> > add UTF-8s support because it is a sanctioned form of Unicode.
> >
> >
> >
> > If you endorse UTF-8s that please setup some restrictions as to its use.
> >
> > 1) All interfaces supporting UTF-8s but also support UTF-8.
> >
> > 2) All data passed to in interface or stored but a system using UTF-8s
> > be retrievable with in UTF-8 format.
> >
> > 3) All data stored in UTF-8s must be retrievable with UTF-8 keys.
> >
> > If this is not done you will end up bifurcating UTF-8 use.  If a buy one
> > component using UTF-8 and another using UTF-8s the end user will have a
> real
> > mess on their hands converting back and forth and dealing with Unicode
> > two sorting sequences depending on the interface.
> >
> > We might as well be asking the user to work in code page again.  It is
> like
> > designing application that are required to support both Shift JIS and
> > simultaneously.
> >
> > Carl
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > Behalf Of Mark Davis
> > Sent: Monday, June 04, 2001 8:47 AM
> > Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> >
> >
> > I am not, myself, in favor of UTF-8s. However, I do want to point out a
> few
> > things.
> >
> > 1) Normalization does not particularly favor one side or the other.
> >
> > A binary compare is used because of performance, typically when you
> > care about the internal ordering from an international perspective (such
> as
> > a B-Tree for file systems). It does not prevent you from later imposing
> > localized sort order (e.g. when the files are displayed in a window,
> > can be sorted by name (or date, or author, etc) at that time).
> >
> > For performance reasons, in that case it is simply not a good idea to do
> > normalization when you compare. You are choosing a binary compare simply
> > because it is a fast, well-defined comparison operation. Invoking
> > normalization at comparison time will defeat one of the goals. While
> > normalization at comparison can be pretty fast (only take the slow path
> when
> > the Quickcheck fails -- as described in #15), yet it will never be
> anywhere
> > as fast as binary compare.
> >
> > The best practice for that case is to enforce normalization on data
> > *when the text is inserted in the field* . If one does, then canonical
> > equivalents will compare as equal, whether they are encoded in UTF-8,
> > UTF-8s, or UTF-16 (or, for that matter, BOCU).
> >
> > 2. Auto-detection does not particularly favor one side or the other.
> >
> > UTF-8 and UTF-8s are strictly non-overlapping. If you ever encounter a
> > supplementary character expressed with two 3-byte values, you know you
> > not have pure UTF-8. If you ever encounter a supplementary character
> > expressed with a 4-byte value, you know you don't have pure UTF-8s. If
> > never encounter either one, why does it matter? Every character you read
> is
> > valid and correct.
> >
> > Auto-detection works on the basis of statistical probability. With
> > sufficient non-ASCII characters, the chance that text obeys the UTF-8
> > restrictions and is not UTF-8 is very low (see Martin Duerst's messages
> > this from some time ago*). Essentially the same is true of UTF-8s.
> >
> > Mark
> >
> > * Martin, it'd be nice to resurrect you note into one of the Unicode
> >
> > ----- Original Message -----
> > Sent: Monday, June 04, 2001 00:10
> > Subject: Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)
> >
> >
> > > In a message dated 2001-06-03 18:04:17 Pacific Daylight Time,
> > > [EMAIL PROTECTED] writes:
> > >
> > > >  It would seem to me that there's
> > > >  another issue that has to be taken into consideration here:
> > normalisation.
> > > >  You can't just do a simple sort using raw binary comparison; you
> > to
> > > >  normalise strings before you compare them, even if the comparison
> a
> > > >  binary compare.
> > >
> > > I would be surprised if that has even been considered.  Normalization
> > one
> > > of those fine details of Unicode, like directionality and character
> > > properties, that may be completely unknown to a development team that
> > thinks
> > > the strict binary order of UTF-16 code points makes a suitable
> > > order.  This is a sign of a company or development team that thinks
> > Unicode
> > > support is a simple matter of handling 16-bit characters instead of
> 8-bit.
> > >
> > > While we are at it, here's another argument against the existence of
> both
> > > UTF-8 and this new UTF-8s.  Recently there was a discussion about the
> use
> > of
> > > the U+FEFF signature in UTF-8 files, with a fair number of Unicode
> experts
> > > arguing against its necessity because UTF-8 is so easy to detect
> > > heuristically.  Without reopening that debate, it is worth noting that
> > UTF-8s
> > > could not be distinguished from UTF-8 by that technique.  By
> > D29,
> > > UTF-8s must support encoding of unpaired surrogates (as UTF-8 already
> > does),
> > > and thus a UTF-8s sequence like ED A0 80 ED B0 80 could ambiguously
> > represent
> > > either the two unpaired surrogates U+D800 U+DC00 or the legitimate
> Unicode
> > > code point U+10000.  Such a sequence -- the only difference between
> UTF-8
> > and
> > > UTF-8s -- could appear in either encoding, but with different
> > > interpretations, so auto-detection would not work.
> > >
> > > Summary: UTF-8s is bad.
> > >
> > > -Doug Ewell
> > >  Fullerton, California
> > >
> >
> >
> >
> >

Reply via email to