RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Kenneth Whistler Mon, 12 Jul 2010 17:04:16 -0700

Philippe Verdy wrote:

> "Kenneth Whistler" <k...@sybase.com> wrote:
> > Huh? That is just preprocessing to delete portions of strings
> > before calculating keys. If you want to do so, be my guest,
> > but building in arbitrary rules of content suppression into
> > the UCA algorithm itself is a non-starter.
> 
> I have definitely not asked for adding such preprocessing.


No, effectively you just did, by asking for special weighting for
things like parentheticals at the end of strings.

> My examples
> just gave practical examples where the lowest significant parts of the
> strings are alsmost always at end of the text, in practical cases.

And the point of that is what? If the most significant part of a
string is at the front, that is the part that will get processed
first by incremental comparison, anyway.

> And
> there's no reason that a backwards processing will consider them more
> important than the beginning of text.

And you are completely missing the point that in order to handle
French accent backwards weighting correctly, you don't need to
process the entire string backwards. You can do this correctly
using an incremental comparison from the *front* (by your own
assessment, more significant) part of the string, if you build
the incremental comparison algorithm carefully.

> I have sufficiently though about it when 2-3 years ago I wanted to fix
> the definitely broken sorts in French Wikitionary categories (notably
> the categories containing full lists of words per language, where
> items were apparently missing just because they were sorted many pages
> away).

O.k. so somebody was goofing up on the sorting, and it is a good
thing that you fixed it.

> I also added that *storing* additional collation strings introduces a
> significant cost. This cost cannot be ignored, so in practice only the
> leading part of the sort keys/ collation strings will be effectively
> stored.

Yes, but what does a decision about storing truncated sort keys have
to do with the issue of French accent processing?

> Wikipedia for example truncates ALL its sort keys used in categories
> to a limit not exceeding about 64 bytes, even if page names can be up
> to 255 UTF-8 bytes, so that, even NOT ALL the primary differences will
> be stored ...

In which case, for those longer strings, you are going to have to
do disambiguating compares to recover any secondary differences --
and it doesn't matter whether you are talking about German accented
characters or French accented characters.

> Then, during effective sort (using QuickSort or similar, you'll end up
> comparing many pairs of strings multiple times in case of equality,
> without depending on collation strings, and YES of course, you'll
> process the string pairs level by level, and the case where you'll
> find long strings sorting as equal at the primary level is not
> exceptional, so this means that you'll have to reparse a second time
> both strings from their beginning for the next level,

A requirement that is forced by deciding to store truncated sort keys,
*regardless* of how you are handling the secondary weights.

> ... even if you
> could have detected very early that there was a secondary difference
> on the first word, without even fully processing the first level on
> the two long strings).

And my point continues to be (which you don't seem to understand)
that a properly designed incremental comparison will do exactly
that.

> I've made experimentations, and performance counting (counting the
> number of pairs compared, and the number of times they are need to be
> reread from the backing store, and cumulating their relative distance)
> clearly shows that comparing word per word (all levels for each word
> performed at once before going to the next words) gives a significant
> and measurable advantage due to data locality, but also because you'll
> effectively scan shorter sequences to detect a secondary difference in
> the first words (such as accents) or tertiary differences (such as
> lettercase). 

And all those calculations apparently depend on mistaken assumptions
about how incremental multi-level collation comparisons can be
constructed.

> And it gives much more consistant results once you are
> using truncated collation strings (and only those truncated strings,
> without addition compares of pairs in additional sort passes).

Truncated stored sort keys?

Or truncated stored strings?

If you truncate the sort keys, then you access the original strings
and compare them (incrementally) to do the tie-breaking.

If you truncate the stored strings, then you are out of luck if they
compare equal, because you've truncated off the parts that might
distinguish them.

> The cases where strings will compare equal on the first level is in
> fact EXTREMELY frequent in SQL requests performing master-detail
> joins, ordered by a master key then by detail keys, or when computing
> aggregates (such as SELECT SUM()... GROUP BY master key, or using a
> clause like: HAVING column1=aggregate(column2). This occurs in almost
> all statistical/financial reports. 

Yes. But are you claiming that has some relevance to your
Wikipedia case? If the structure of topics in Wikipedia is such
that too many of the truncated stored keys are comparing equal,
then somebody at Wikipedia is making the wrong tradeoffs between
structuring topic strings and truncating the stored keys for them.

> And given that you work for Sybase,
> you should already know that sorting/grouping is the most costly
> operation performed by the SQL engine, where all optimisations are
> expected: multifields collations/sorts is also a very basic operation
> found in nearly all requests (including inserts or unsorted/ungrouped
> selects using unique indice or primary/foreign key indice, because
> indice always have to maintain their collation in an accurate but fast
> way).

Yes. Hence the need for highly optimized incremental string compares
which otherwise fit the optimization strategies needed for data joins.

You certainly don't get that by requiring that a processing step
be in place that actually processes the entire comparand string
from the *end* of the string in order to handle accents correctly.
Nor would it be helped, IMO, by putting in place word boundary
processing as part of the string compares: that would only result
in the introduction of further complication and all kinds of
ambiguity about the edge cases where word boundaries might or might
not line up with fields for the joins.

The engineering requirement is for the fastest possible comparison
of two strings A and B, to determine whether they are less than,
equal to, or greater than, and additionally, in the case of
A less than B, whether A is a *prefix* of B (a fact which is used
in optimization strategies to deal with your cases like
statistical/financial reports, with long identical prefixes on
compared strings). In my opinion, that can be done in the context
of UCA as currently defined -- even accounting for French accents
for a French tailoring -- quite efficiently, and would not be
improved by putting in place word parsing of strings as part of
the collation.

--Ken

RE: UTS#10 (collation) : French backwards level 2, and word-breakers.

Reply via email to