Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

verdy_p Mon, 12 Jul 2010 20:19:06 -0700

"Kenneth Whistler" 
> A : verd...@wanadoo.fr
> Copie à : unicode@unicode.org, k...@sybase.com
> Objet : Re: UTS#10 (collation) : French backwards level 2, and word-breakers.
> 
> 
> Philippe Verdy said:
> 
> > A basic word-breaker using inly the space separator would marvelously
> > improve the speed of French sorting even if backwards ordering occurs,
> > just because it would significantly improve the data locality in the
> > implementation and would considerably reduce the reallocations and use
> > of large buffers.
> 
> I disagree. Any implementation of an incremental comparison based on
> UCA which is making significant use of reallocations and large buffers
> isn't optimized in the first place.


You are speaking of comparison of text pairs. That's not the same as computing 
sort keys.

And you have already used the cas of sort keys in a prior message.

> An optimized incremental comparison walks down both string 1 and
> string 2, pulling up the relevant comparable collation elements
> and tracking state for primary, secondary, and tertiary weights
> as it goes, bailing with an answer at the first available point
> that the comparison becomes definitively non-equal. That can be
> done without *any* buffer allocation or reallocation, using
> relatively tiny buffers allocated on the stack, and definitely
> has better "data locality" than first trying to chunk strings at
> word boundaries.

Regarding data locality you're still wrong, because you've not understood what 
I said:

The case where two stings are fully determined in their relative order after 
just scanning the beginning to find ANY 
difference that may apply at any level, just in the first word where that 
differs.

This case is really very common. And for data locality, it means that the end 
of the strings from BOTH texts won't 
ever be scanned to make the decision. And these ends can be thousands of bytes 
on both strings. There will be a 
SINGLE pass on each word of each string, even if, I DO admit it, each word may 
need to be itself scanned several 
times.

But just consider as well the need for breaking collation elements: this is 
really the complex area of UCA, because 
it will have to be performed several times, using quite complex decisions. 
These descisions have even a higher 
complexity than word-breaking.

If you have to scan each word several times, you'll have to repeat this complex 
logic, which will also have to take 
into account the default grapheme custer boundaries for correct determination 
of normalization. In all cases, this 
will require mutliple bufferings and rebufferings. to comulate the default 
graphemes clusters and normalize them, 
then looking for normalization, then trying to lookup for matching 1-to-N and 
M-to-N collation weights for getting 
the collation weights at some level to compare.

And all this work that occur within one word will have to be performed once 
again : it will be much slower if these 
multiple passes on default grapheme boundaries are separated in time by having 
scanned BOTH texts entirely at one 
level just to see that they still compare equal, when you coould have broken 
the steps early by detecting ANY 
different within the same word.

Breaking on words, even if it requirs a very modest buffering, will 
significantly improve the processing time, 
because each word in the long texts will be scanned only once, and all the rest 
will occur within the small and 
constantly reused buffer.

And this also just requires the same implementation to generate the collation 
string of each word individually 
within this small reused buffer (when you are about to generate and store sort 
keys), and when comparing string 
pairs.

I don't forget that in most practical cases, sorts will operate on texts whose 
collation keys have been only partly 
generated and truncated, because they really speed up and reduce the number of 
compares to perform : when you are 
sorting texts, any fast algorithm will still need to compare the same texts 
with a variable number of multiple other 
candidates for swapping (on average this number of compares per string in the 
list of size N grows in O(log N) with 
QuickSort, but may still have a worst case that grows in O(N):

So the cumulative effects of low data locality when scanning any "long" texts 
made of more than 1 word (e.g. a 
bibliographic entry, with Author name, book title, subtitle, chapter title...) 
will be multipled by O(log N) up to 
O(N). And data locality will be worse, because each time you'll have to work by 
scanning on the source texts (in 
very different locations in memory (possibly even on external memory and 
requiring I/O with its own internal 
buffering, or in swapped out memory pages) when performing the many text pairs 
compares, than when performing those 
collation level passes only within the small local buffers for each word.

> > A word breaker anyway has a very tiny impact on performance, compared
> > to the gain you can have by using smaller data buffers: most of the
> > cost of UCA falls within the huge cost of normalizations and complex
> > analysis of M-to-N mappings of collation elements, and in
> > interpreting/locating the DUCET table entries.
> 
> That much I *do* agree with, however. If you have to normalize
> entire strings before starting comparison, that is a significant
> processing cost. The answer to that is architectural -- if you
> have to repeatedly normalize strings before comparing them,
> then store the normalized strings and compare *those* instead.
> Also, even in the case where dealing with "wild" data input that
> cannot be assumed to be pre-normalized, a test to check whether
> a string is in NFC is *much* quicker than actually doing the
> normalization, and for most data the answer is yes, anyway.
> 
> As for the M-to-N mappings and the collation element lookup,
> that is unavoidable, however else you build the comparison, so
> you just do the best job to make the lookup of collation
> elements fast.

And admit it, the complexity of these mappings, that require determining 
conditional breaks between one or more 
default grapheme clusters (which themselves only require a single state finite 
automata), require more complexity 
than word-breaking which just uses a single state in the scanning automata. It 
will be best to avoid having to 
perform all these scans multiple times, and best if you can perform the DUCET 
lookups only once:

- you'll need a recyclable input buffer, whose small size will be limited to 
word sizes, for scanning default 
graphemes and perform normalization only once per default grapheme,

- you'll need a few weight buffers (one per collation level), whose size will 
be also more or less proportional to 
word sizes, to perform only once the all M-to-N lookups in the DUCET (or its 
tailoring) while making the first pass 
for level 1

- no extra buffering will occur for level 2 and after: all is already in these 
small buffers that can be simply 
scanned linearily without any breakers or any additonal lookup.

> But your argument that a word breaker "has a very tiny impact
> on performance" compared to normalization or collation element
> lookup is moot, if the addition of the word breaker isn't
> going to improve the incremental comparison, anyway -- which
> is my argument. And unless your word breaker really is
> a trivial one (i.e. break on spaces), it isn't correct to
> maintain that it is a "tiny impact" compared to normalization,
> for example.

Yes, I maintain it : this is a EXTREMELY TINY impact, as the word breaker 
CANNOT add more complexity than:

- the combining sequence breaker used for normalization subbstitutions and 
reordering, and needed for Unicode 
conformance (this requires an input buffer anyway, that you can keep up to the 
last collation level for binary 
compares if ever the all collation weights compare equal within the same word).

- and the default grapheme cluster breaker needed for UCA conformance (this is 
also at this level that a word-
breaker will work, adding at most 1 or no additional state for its finite state 
automata : this adds no extra 
buffering, just a single enumerable state for both breakers).

- and the collation element breakers needed to perform substitutions and/or 
lookups in the table of collation 
weights : this is at this level that you need small buffers during the level 1 
pass, for accumulating the collation 
weights that ''may'' be used later, but if your collation is limited to level 
1, you don't even need these buffers, 
and will just need the first input buffer already used for normalization, to 
perform the final binary compares).


For French for example, as all words are 25 characters long at most, and given 
than only 3 levels are needed for the 
most precise case, this just requires:

- a small buffer of 25 code points (or UTF-16 code units) for input and 
normalization. This buffer may be 
resizeable on exceptional cases with "pathological words" (in fact codes or 
numers) like AAAAAAAAAAAAAAAAAAAAAAA.

- 2 small buffers for 25 secondary weights, and 25 ternary weights. Only the 
level collation weight buffer for 
level-2 will need to be processed in backwards order, and there's actually no 
need to flip it on pass 2 if you need 
it.

- So the total buffering need is 75 integers (at most) for non-pathological 
cases for each string, and all texts 
will be processed only once : data locality will be excellent, even if, for 
security or edge cases, those buffers 
will be resizeable.

- Yes you may also stop buffering if these buffers get their maximum size on 
pathological cases (but in that case 
you'll to reperform the scans (but you may avoid very easily the most common 
worst case and cost of these additional 
passes, that will occur when the normalized inputs have equal binary value).


And with exactly the same shared source code above, you can either return 
immediately the generated collation weight 
for building a function computing a collation string for storage, of for 
comparing pairs of texts : you'll have 
built a collation weight enumerator for level 1 or a separate enumerator for 
the other levels, and a separate code 
for returning the normalized codepoints. All of these 3 types of collation 
weight enumeratots can be used to compare 
string directly withinin those small buffers ; you may then :

- flush out all these 3 buffers entirely between each words when just 
performing only text compares.

- feed a collation string/sort key output buffer, if you want to store it 
(which you may also limit immediately to 
the maximum sort key length that your storage will accept to store).


Your approach on the opposite will still use the input buffer for normalization 
(my approach does not change it), 
and an output buffer for collation string (so my approach makes no difrerence), 
but will perform many more lookups, 
as these complex collation element breaking and complex lookups in the 
colllation weights will be performed on each 
pass (even if you can avoid also the worst case where two binary identical 
strings compare equal, to avoid the extra 
passes).

Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

Reply via email to