[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Sun, 28 Aug 2011 18:30:18 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #232 from Philippe Verdy <verd...@wanadoo.fr> 2011-08-29 01:29:45 
UTC ---
If table contents are static, it's true that their collation weights could be
delivered as well with the visible table.

In a past message above, I had already proposed that the function that already
is needed to sort categories, and that computes the UCA collation keys, would
be exposed as a ParserFunction. This would then obviously allow any wiki page
to compose web tables containing both the visible data and the associated
collation key.

Many very complex templates, that are used to generate sortable
pseudo-collation keys would no longer be needed or would be largely simplified.
In fact, I even proposed initially to first expose this function before even
modifying the schema of categories : there was no immediate mergency to modify
this schema, as long as we could experiment assigning sort keys when
categorizing a page, using that ParserFunction.

And probably we would have been able to experiment with a better schema,
including one that would have allowed multiple collation keys (generated from
different locale parameters), something that has been forgotten completely (and
that will require a new change in the database schema in the future, to support
multiple collations according to users localisation preferences (and
expectations).

Anyway, my work on JS implementation of UCA is already well advanced, and in
fact more advanced now on some points that MySQL still does not support
correctly, notably contractions and expansions.

And very soon I'll add support for contexts (see UTR#35 to see what I mean with
those terms), because it's a trivial use of contractions and expansions,
combined

But it can be implemented also as a separate function, to generate shorter
keys, I'm stil not very concerned by choosing the most compact format for
representing collation keys, but I have already worked a lot in how to compact
automatically the sets of collation weights for each level, suppressing all
unnecessary gaps that are shown in the standard DUCET table, and only provided
for convenience with very basic implementation of UCA not using tailorings, or
only using very basic tailorings and no contraction/expansions).

For now, the most critical algorithm (that takes most time when computing
collation keys, or when directly comparing two strings) is the computing the
NFD: I can save time when comparing pair of strings (for example processing
only the begining of strings ontil a collation difference is found, and this
does not always require strings being fully converted to NFD), but not when
computing any collation key (unless I also add a constraint limiting the length
of the generated collation keys, this constraint allowing to stop early). I
have looked at severla implementation of Unicode normalizers in Javascript, I'm
not satisfied with them as they are clearly not optimal.

In fact both the NFD transform and the generation of collation weights for each
level are linked: if we sort only on primary collation level, we loose too much
time in computing the NFD transform, that provides too many unnecessary details
that will be finally ignored: this means that I'm also trying to perform a NFD
transform, that removes these details. Such transform is effectively what the
Unicode standard calls a "collation mapping" (something more powerful than just
a "case mapping", or even a "case folding").

Such "collation mapping" would be extremely useful for implementing the "first
letter" classification in categories, or even to provide thiner classifications
in very populated categories (for example allowing two letters). This needs is
in fact exactly equivalent to searching for the smallest string that has the
smallest collation keys containing only two collation weights per collation
level, and with a collation set to only one level, and this also can be largely
optimized so that the NFD transform will remove all collation-ignorable details
that would never be needed to compute the collation key.

All this is an interesting subject of research (and even ICU does not provide
such a general mechanism...).

I will be also innovative in how to provide a very compact representation of
tailorings. I also have ideas on how to represent, store, query and cache the
precomputed lookup tables for tailorings. And on a standard interface that will
allow plugable implementations in native code (if possible and needed), or for
use with other languages, but also with other non-UCA based collations
(including dummy collations, such as binary sorting, or sorting by standard
case folding, or by standard case mappings), or complex collations requiring a
lot more mapping data than the DUCET and a set of collation orders and
contractions/expansions (for example for sorting Han ideographs on
radical/stroke, or for sorting based on romanisations and other
translitterations, that are all language-dependant; but for which I have not
developed anything for now about translitterators, not even for the standard
romanizations of Chinese and Japanese, that require lookup in a large
dictionary, such as the huge CJK properties file found in the Unicode database
as it cannot be performed algorithmically like standard romanizations of
Russian or Arabic).

If you think about it more closely, ALL the existing algorithms that transform
any Unicode text into another can be thought as "collation mappings", based on
a language-dependant definition of a multilevel collation order. Collation is
the central concept, and the distinctions between algorithms are not different
from distinctions between collation tailorings (a tailoring may be
language-neutral, or be language-dependant).

I will expose my solutions later, first on the Unicode mailing list, on the ICU
mailing list, on the jQuery maiing list, expecting that there will be
interesting contributions (or useful checks and corrections), before it can
become a generic library that can be reused in other projects like MediaWiki.
I'll use a licence that will be compatible both with free licences (GPL-like)
and open-source licences (OSI), as well as with commercial projects. It will
probably be BSD-like.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to