RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Philippe Verdy Mon, 12 Jul 2010 20:45:32 -0700

> De : "Kenneth Whistler" <k...@sybase.com>
> Philippe Verdy wrote:
>
> > "Kenneth Whistler" <k...@sybase.com> wrote:
> > > Huh? That is just preprocessing to delete portions of strings
> > > before calculating keys. If you want to do so, be my guest,
> > > but building in arbitrary rules of content suppression into
> > > the UCA algorithm itself is a non-starter.
> >
> > I have definitely not asked for adding such preprocessing.
>
> No, effectively you just did, by asking for special weighting for
> things like parentheticals at the end of strings.


And you have completely misinterpreted it. I have not said that it
implied preprocessing for special weighting.

It was used as a justification for the fact that the end of the string
needs not be scanned at all, in the VERY FREQUENT cases where the
string contains multiple words (including parenthetical precisions,
but not limited to this case, but to any kind of phrase, sentence or
text), and why we should be able to compare all collation levels
within each word isolately, to fully determine if the rest of the
string needs to be scanned (if they compare as binary identical).

The only « special weighting » I spoke about, was related to a single
special empty collation element (implicitly FFFF.0000.0000.0000, with
FFFF treated as -1) whose insertion will be useful between fields of
multi-field sorts (such as with SQL's SELECT ORDER BY and GROUP BY
clauses), and will be requiring shifting all primary weights by 1, if
you want to get positive values only (but note that when serializing
collation weights into a byte stream for computing collation keys,
additional shiftings will occur anyway, to avoid signed bytes ordering
problems in Java for example, or simply to compress them to just the
number of bits needed for each collation level). In fact, this does
not require any change the format or data of the DUCET, as there will
never be any collision of primary weights.

When just generating sort keys strings, the unsigned 16-bit collation
weights decoded from the DUCET or tailored tables, will be stored in a
standard 32-bit or 64-bit signed integer register or local variable,
that can perfectly fit the additional -1 value in the special primary
level for the field separator, and when comparing strings, it is not
even needed (you just have to know if the other field from the other
compared row is also at end, because you'll compare these fields
separately).

-- Philippe.

RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Reply via email to