[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Thu, 22 Jul 2010 02:13:59 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #174 from Philippe Verdy <verd...@wanadoo.fr> 2010-07-22 09:13:30 
UTC ---
> If we're gonna discuss anything, let's
discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

This is EXACTLY what I was discussing: proposing an implementation design,
which also considers the fact that collations will also need to evolve over
time (for example the UCS repertoire is evolving (so the DUCET table is
modified), and collation rules are being corrected for some languages, in the
CLDR project) : each change will generate a new internal localeid to support
it, and there will be possibly different keys during the transition, even if
(finally) an old collation rule will be deleted (with its old sortkeys) after
the new sortkeys will have been fully recomputed.

So this is clearly not "blah blah". And this is certainly relevant for the fact
that you're considering implementing some or all of the suggestions (and you'll
have to test your solutions, including on their performance impact.

I propose a simple model that can be very space-efficient, and that also avoids
reparsing all pages if ever a collation rule is changed, or if additional
collation rules are added in a category to support multiple sort orders
(notably within Chinese categories that could support different orders).

My proposal does not even depend on the backend SQL server capabilities (all it
has to support is at least a binary order on ASCII-only VARCHAR(n) to store the
computed and truncated sortkeys, that will be generated by the PHP front-end
(using ICU) and followed by an ASCII-only serialization. This means that the
simplest "ORDER BY" queries to retrieve correctly ordered lists of pages will
work independantly of the backend.

The function used in PHP to generate the binary-ordered sortkey (that will
finally be effectively stored) should also be accessible in MediaWiki as a
builtin parser function, that will take two parameters: the locale code, and
the text.

For example, as {{SORTKEY:text|locale}}, where the ''locale'' specified can be
optional and will take the default value of the {{CONTENTLANGUAGE}} of the
project).

This builtin parser function could also be used to correctly sort the sortable
Mediawiki tables inserted in articles, by using correct sortkeys generated by
this template, if the generated sortkey is readable and effectively serialized
as ASCII-only, but it does not necessarily have to be truncated by this
function, even if it will be truncated when the same function result will be
used to store sortkeys in the schema).

This parser function should even be the first development to realize, before
even changing the category-page indexes, because it can be applied and tested
immediately in existing categories (for example by using categorizing templates
in Wiktionary), without even upgrading the SQL schema.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to