https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #232 from Philippe Verdy <verd...@wanadoo.fr> 2011-08-29 01:29:45 UTC --- If table contents are static, it's true that their collation weights could be delivered as well with the visible table. In a past message above, I had already proposed that the function that already is needed to sort categories, and that computes the UCA collation keys, would be exposed as a ParserFunction. This would then obviously allow any wiki page to compose web tables containing both the visible data and the associated collation key. Many very complex templates, that are used to generate sortable pseudo-collation keys would no longer be needed or would be largely simplified. In fact, I even proposed initially to first expose this function before even modifying the schema of categories : there was no immediate mergency to modify this schema, as long as we could experiment assigning sort keys when categorizing a page, using that ParserFunction. And probably we would have been able to experiment with a better schema, including one that would have allowed multiple collation keys (generated from different locale parameters), something that has been forgotten completely (and that will require a new change in the database schema in the future, to support multiple collations according to users localisation preferences (and expectations). Anyway, my work on JS implementation of UCA is already well advanced, and in fact more advanced now on some points that MySQL still does not support correctly, notably contractions and expansions. And very soon I'll add support for contexts (see UTR#35 to see what I mean with those terms), because it's a trivial use of contractions and expansions, combined But it can be implemented also as a separate function, to generate shorter keys, I'm stil not very concerned by choosing the most compact format for representing collation keys, but I have already worked a lot in how to compact automatically the sets of collation weights for each level, suppressing all unnecessary gaps that are shown in the standard DUCET table, and only provided for convenience with very basic implementation of UCA not using tailorings, or only using very basic tailorings and no contraction/expansions). For now, the most critical algorithm (that takes most time when computing collation keys, or when directly comparing two strings) is the computing the NFD: I can save time when comparing pair of strings (for example processing only the begining of strings ontil a collation difference is found, and this does not always require strings being fully converted to NFD), but not when computing any collation key (unless I also add a constraint limiting the length of the generated collation keys, this constraint allowing to stop early). I have looked at severla implementation of Unicode normalizers in Javascript, I'm not satisfied with them as they are clearly not optimal. In fact both the NFD transform and the generation of collation weights for each level are linked: if we sort only on primary collation level, we loose too much time in computing the NFD transform, that provides too many unnecessary details that will be finally ignored: this means that I'm also trying to perform a NFD transform, that removes these details. Such transform is effectively what the Unicode standard calls a "collation mapping" (something more powerful than just a "case mapping", or even a "case folding"). Such "collation mapping" would be extremely useful for implementing the "first letter" classification in categories, or even to provide thiner classifications in very populated categories (for example allowing two letters). This needs is in fact exactly equivalent to searching for the smallest string that has the smallest collation keys containing only two collation weights per collation level, and with a collation set to only one level, and this also can be largely optimized so that the NFD transform will remove all collation-ignorable details that would never be needed to compute the collation key. All this is an interesting subject of research (and even ICU does not provide such a general mechanism...). I will be also innovative in how to provide a very compact representation of tailorings. I also have ideas on how to represent, store, query and cache the precomputed lookup tables for tailorings. And on a standard interface that will allow plugable implementations in native code (if possible and needed), or for use with other languages, but also with other non-UCA based collations (including dummy collations, such as binary sorting, or sorting by standard case folding, or by standard case mappings), or complex collations requiring a lot more mapping data than the DUCET and a set of collation orders and contractions/expansions (for example for sorting Han ideographs on radical/stroke, or for sorting based on romanisations and other translitterations, that are all language-dependant; but for which I have not developed anything for now about translitterators, not even for the standard romanizations of Chinese and Japanese, that require lookup in a large dictionary, such as the huge CJK properties file found in the Unicode database as it cannot be performed algorithmically like standard romanizations of Russian or Arabic). If you think about it more closely, ALL the existing algorithms that transform any Unicode text into another can be thought as "collation mappings", based on a language-dependant definition of a multilevel collation order. Collation is the central concept, and the distinctions between algorithms are not different from distinctions between collation tailorings (a tailoring may be language-neutral, or be language-dependant). I will expose my solutions later, first on the Unicode mailing list, on the ICU mailing list, on the jQuery maiing list, expecting that there will be interesting contributions (or useful checks and corrections), before it can become a generic library that can be reused in other projects like MediaWiki. I'll use a licence that will be compatible both with free licences (GPL-like) and open-source licences (OSI), as well as with commercial projects. It will probably be BSD-like. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l