[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Wed, 21 Jul 2010 23:01:33 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #172 from Philippe Verdy <verd...@wanadoo.fr> 2010-07-22 06:00:34 
UTC ---
>> But if the SQL engine does not have such support, this must be implemented in
>> the PHP code and collation keys can be stored in a new datacolumn (the extra
>> data column can be added or filled conditionnally : if the SQL engine 
>> supports
>> the needed collations, this column can remain NULL to save storage space).
>> 
>If you sort this stuff in PHP, you need to grab the entire list before you can
>reliably sort it. Doing that for [[Category:Living people]] has no chance of
>staying within the memory limit.

And this was false. Because you assume that the generation of sort keys has to
be done on database queries for listing the content of a category, when instead
this generation os sortkeys can be done safely only on database inserts/updates
for each separate page.

What I wanted to say is that the computed sortkeys will have to be stored. But
several sort keys for the same page in the same category are possible (one for
each collation locale indicated by the target category).

There will be no memory limit, but of course this will have a storage cost, as
the stored sort keys will have to be queried along with the list of pages to
display. 

The good question to ask is: where do we store these sortkeys? Currently we
have a SQL relation table containing a unique index on (categoryID, sortkey,
pageID) and this is enough to perform the join with the table of pages. However
thre's still only one sortkey per page and per category.

That sortkey is needlessly using the pageID within the generated sortkey (this
is is visible when crossing a page limit and navigating throught pages) so in
fact the unique index is on (categoryID, "augmented sortkey"). Conceptually
wrong and bogous (I think this was just a fast patch when there were unicity
problems and multiple pages could be specified with the same sortkey).

What is limiting you of changing the relation table containing the list of
pages in categories, of using instead a unique index on:
 (categoryID, sortkey, pageID, localeID)
where the localeID is one of the locales supported by the project, which
specifies: the language for which a collation is being performed, and a
collation variant (for example, in Chinese, sort by radical/strokes with
locale="zh-Hans", or sort by pinyin with locale="zh-Latn")

The generation of the concent of the sortkey column is the only major problem
requiring a design decision. This is where it should not even depend on the SQL
engine, and where it can be implemented within PHP, using the PHP extension
that allows using ICU functions. That string does not have to be extremely long
and does not have to be be humane readable.

It can be safely be stored with a reasonnable length limit. So ICU-generated
sortkeys are still safe if they get truncated. Notably because the unique index
on:
 (categoryID, sortkey, pageID, localeID)
is also unique on its restriction:
 (categoryID, pageID, localeID)
And the sortkey generated by ICU, even if it's a string of binary bytes can
still safely be stored in a table index that does not support blobs but want
only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding
(the most basic that will work is hexadecimal) that does not even require the
support of Unicode or UCA collation. Just use an ASCII only column to store the
computed binary sortkey serialized as an ASCII-only string.

But if the database engine supports strings of bytes, just don't serialize the
blob, use the supported SQL type that can store it directly, for example
VARBINARY(n), if it remains sortable in binary order.

With this design, you are completely independant of the SQL engine, it will
work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
problems of collation with multiple locales according to their rules, and
possibly according to visitors preferences.

Note: above, the localeID is not directly a VARCHAR(n) containing "en", or
"zh-Hans". It is an arbitrary unique numeric identifier that maps in fact to a
collation rule within a locale, and this collation rule may need to be updated
from time to time: when upgraded a collation, you'll generate additional keys
with a new localeID. And when this is done the table of supported collations
will indicate which localeID is the current one to use, and you'll be able to
perform easiky the cleanup of old sortkeys that where computed within the old
rule.

It's not a complicate design, and it offers stability warranties and supports
as well the possibility of upgrading the collations.

The possibility of offering multiple sortkeys for the same page in the same
category comes as a bonus, and you can assign "localeID=0" to store the
user-specified sortkey that has been assigned in a page using the second
parameter of the [[category:pagename|sortkey]] link or the parameter of the
{{DEFAULTSORT:sortkey}} if this paramter was missing (this will avoid having to
reparse the pages just to retrieve this user-specified sortkey).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to