UTS#10 (collation) : French backwards level 2, and word-breakers.

Philippe Verdy Sun, 04 Jul 2010 17:33:32 -0700

Collation (for French) normally uses backwards ordering of collation
weights at level 2:


«
  4.3 Form Sort Key
  Step 3. The sort key is formed by successively appending all
non-zero weights from the collation element array. The weights are
appended from each level in turn, from 1 to 3. (Backwards weights are
inserted in reverse order.)
»

However I think that this creates over-long sequences which would
reverse ALL secondary weights of arbitrarily long texts. Not only this
rule would have a severe performance impact, but this is actually not
needed for French.
What is needed is JUST to reverse the collation weights associated to
single words (or compaound words, including those including an
apostrophe). So the reversal should only apply to separate spans of
text after word-breaking (see UAX #29).

For example, with the sentence
  «Pour être heureux, ne vivons pas cachés ! »,
it's much enough to reverse the secondary weights like in this sentence :
  <span>Pour </span>
  <span>être </span>
  <span>heureux, </span>
  <span>ne </span>
  <span>vivons </span>
  <span>pas </span>
  <span>cachés !</span>

Using a (UAX#10) word-breaking step (based on "extended grapheme
clusters" as above, or on shorter "legacy grapheme clusters" where
spaces, punctutations and spacing marks would be separated, should be
used at end of steps 4.1 before step 4.2 of the UCA algorithm.

And step 4.3 need just to be applied between those word-breaks,
instead of on the complete string.

And then, this will correctly sort an itemized list of definitions like:
   * être (en anglais, “to be”) : v. aux. irrégulier du 3è groupe
   * été (en anglais, “summer”) : n. m. – 2è saison de l’année.
   * ...
Or other simpler lists of person names, toponyms, book titles...
because it would actually apply the reversal of accent differences
only within the first word of each item (other words would still be
treated only if two items have the same initial word.

Note that the punctuations and spaces that may cause a word-break to
be detected, will often be ignored on the 2 first levels of collations
(i.e. they would have a 0000 collation weight at these levels),
notably in collations tailored for specific locales (such as French)
and not the generic locale-neutral collation (in the "root" locale of
CLDR and using the DUCET).

Can the UTS#10 (currently in review) about the UCA algorithm speak
about where a word breaker may be used ? This would also offer huge
optimization opportunities for computing collation weights in most
languages (not just French). Notably because it will reduce a lot the
internal buffering needed to create each substring of collation weight
for each separate collation level.

And it would be useful to reserve in the DUCET a specific collation
weight, at the primary level (with a lower value than the value of the
collation-level separator, if it is used), or a range of such weights,
that could be used for word separation (or other kinds of hierarchical
logical separation) could really speedup the process of computing
collation weights for long sentences (notably, it would allow
collation strings to be appended directly on the fly by separating
them with this separator weight).

And my opinion is that, by default, at least the most basic
word-breaker (on breakable whitespaces including explicit linebreak
controls, possibly on sentences breaks if available) should be used to
limit the effect of backwards reordering of collation weights at any
level, in any practical implementation of the UCA (and notably in
implementations of UCA with the French locale, in database engines for
building their index and for supporting the « ORDER BY » clause and
text compare operators like >, <, >=, <=, and  « BETWEEN...AND », and
aggregates line « MIN() » and « MAX() », and operators based on text
similarity such as =, !=, and « LIKE »).

Philippe.

UTS#10 (collation) : French backwards level 2, and word-breakers.

Reply via email to