[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Thu, 22 Jul 2010 10:58:18 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #183 from Aryeh Gregor <simetrical+wikib...@gmail.com> 2010-07-22 
17:57:42 UTC ---
Okay, look, if you make posts this long I'm not going to be able to read them. 
You have to state your points briefly and clearly, or else I'm just not going
to have time to read through carefully and figure out exactly what you're
trying to say.  Walls of text are really not helpful.

Furthermore, you're focusing too narrowly on implementation details, like the
exact interface a function should implement.  Please try to state your concerns
primarily in terms of use-cases or problems -- scenarios that the
implementation will need to handle.  Secondarily, it's fine if you propose
specific solutions, but I can't evaluate your solutions unless you state
clearly what problems they're trying to solve.

I'll comment on a few things you said.  If I miss anything important, please
restate it *briefly*, and I can address that too.

(In reply to comment #172)
> What I wanted to say is that the computed sortkeys will have to be stored. But
> several sort keys for the same page in the same category are possible (one for
> each collation locale indicated by the target category).

If we store separate sortkeys per category link per locale, it would take
vastly too much space, since there are likely to be dozens of locales. 
categorylinks is already a big table; we can't justify making it ten times
bigger for this.  Each category can have one locale and only one locale.

> That sortkey is needlessly using the pageID within the generated sortkey (this
> is is visible when crossing a page limit and navigating throught pages) so in
> fact the unique index is on (categoryID, "augmented sortkey"). Conceptually
> wrong and bogous (I think this was just a fast patch when there were unicity
> problems and multiple pages could be specified with the same sortkey).

It is not needless.  We need a unique key to sort by, or else paging doesn't
work if many pages have the same sort key (e.g., if it was manually assigned). 
See bug 4912.

> The generation of the concent of the sortkey column is the only major problem
> requiring a design decision. This is where it should not even depend on the 
> SQL
> engine, and where it can be implemented within PHP, using the PHP extension
> that allows using ICU functions. That string does not have to be extremely 
> long
> and does not have to be be humane readable.

Yes, we will certainly use ICU or something.  GerardM has said he much prefers
CLDR, so maybe we'll use that instead.

> It can be safely be stored with a reasonnable length limit. So ICU-generated
> sortkeys are still safe if they get truncated. Notably because the unique 
> index
> on:
>  (categoryID, sortkey, pageID, localeID)
> is also unique on its restriction:
>  (categoryID, pageID, localeID)
> And the sortkey generated by ICU, even if it's a string of binary bytes can
> still safely be stored in a table index that does not support blobs but want
> only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding
> (the most basic that will work is hexadecimal) that does not even require the
> support of Unicode or UCA collation. Just use an ASCII only column to store 
> the
> computed binary sortkey serialized as an ASCII-only string.
> 
> But if the database engine supports strings of bytes, just don't serialize the
> blob, use the supported SQL type that can store it directly, for example
> VARBINARY(n), if it remains sortable in binary order.
> 
> With this design, you are completely independant of the SQL engine, it will
> work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
> problems of collation with multiple locales according to their rules, and
> possibly according to visitors preferences.

This was always the intended design.  (But note that truncating these sort keys
might not always be fully safe: it could cause tons of collisions, depending on
the nature of the sort key generation algorithm.  These would be resolved
arbitrarily, by page_id, which is confusing to users.)

> It's not a complicate design, and it offers stability warranties and supports
> as well the possibility of upgrading the collations.

Upgrading the collation can be done in-place.  The worst case is that
categories sort weirdly for a few hours.  Also, we would only realistically
have to change the collation often on smaller wikis, since the largest wikis
should have high-quality collations that cover almost all their pages to begin
with.  I don't think we need to adjust the schema to prepare for this.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to