[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Sun, 25 Jul 2010 10:24:09 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #189 from Aryeh Gregor <simetrical+wikib...@gmail.com> 2010-07-25 
17:22:32 UTC ---
(In reply to comment #186)
> That'a a bad assumption : even the highest quality collations will need to be
> updated from time to time:
> - Unicode evolves and new characters get encoded (new versions are published
> about every 1-2 years after synchronization and final balloting at both ISO 
> WG2
> and the UTC.
> - The content of the Unicode DUCET is NOT stable: characters are inserted in
> the sequence so that the full list of collation weights needs to be offseted
> where the new characters get inserted.
> - Collations for languages get corrected. We should be able to upgrade these
> rules when the CLDR project produces new tailorings (CLDR updates are 
> published
> separately, about every 1-2 years.)

It's not critical for most wikis to use the very latest collations, though. 
The English Wikipedia, for example, will do fine with even very out-of-date
collations, since the article names overwhelmingly use characters that have
been in Unicode for an extremely long time.  The same is true for most other
large wikis; but on the other hand, to change collations on small wikis will be
fast in any event.

However, Tim Starling independently suggested a cl_collation column, and my
initial implementation does have one.  The current one doesn't allow the same
row to be stored multiple times with different collations, so sorting might
still be wrong as the collation is updated, but this might not be a big problem
in practice.  If it is, we can go with your idea of extending the primary key
to (cl_collation, cl_from, cl_to) instead of (cl_from, cl_to).

> Anyway you did not reply to the idea of first developin the parser functions
> and test them. Developping the SQL schema extension should not be attempted
> before at least the first function {{SORTKEY:text|locale|level}} has been 
> fully
> developed and tested on specific pages (it can be tested easily in tables).
>
> The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
> immediately to develop the schema, but will be useful to restore the
> functionality of headings. Headings don't need to be stored as they can be
> computed on the fly, directly by reading sequentially the sorted result set
> from the SQL query:

I tried to figure out what you were talking about here, and eventually figured
out that you just want me to expose Language::convertToSortkey() and
Language::firstLetterForLists() (as they're currently called) as parser
functions, for the benefit of lists.  That might be useful, and should be easy
to do, although it's not part of my assigned job here.

> You can compute headings from the returned page names, or from the existing
> stored "cl_sortkey" which should be used now ONLY to store the plain-text
> specified in articles with {{DEFAULTSORT:sortkey}} and
> [[category:...|sortkey]].

I've introduced cl_raw_sortkey to store that, while retaining cl_sortkey for
actual sorting.  Based on your feedback, it might make more sense to rename
cl_raw_sortkey to cl_sortkey_prefix, put {{defaultsort:}} into it, and make it
the empty string by default.

(In reply to comment #187)
> This does not require any change in the schema. This can be made immediately 
> by
> MediaWiki developers and will not influence the developement of all 
> corrections
> needed for bug #164 itself.

Correct.  I'll probably do it anyway in the course of the work I'm doing,
though, since I'll be rewriting CategoryPage.php's sort logic in any case.

> For example in people's names whose page name is "FirstNames LastName" but 
> that
> we want to sort as if they were "LastName, FirstNames" by indicating only
> {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
> the wiki text, as this sort hint will not be unique and the group of pages
> using the same hint will still need to sort within this group using their
> natural order).

I can append the page title to cl_raw_sortkey before computing cl_sortkey. 
That shouldn't be a problem.  As noted above, maybe renaming it to
cl_sortkey_prefix would make more sense.

(In reply to comment #188)
> An "interface" function is DEFINITELY NOT an "implementation" detail. My
> comment #180 _correctly_ describes and summarizes what is really wanted.

Unfortunately, it's hard for me to understand what you're saing.  However, I
think I've got it now, at least mostly.  I don't think parser functions are
essential, here, and they're not strictly part of this bug.  They can be added
after the initial implementation is finished.

> It correctly explains the dependancies and why any change in the SQL schema 
> can
> be and should be delayed. In fact I consider the step (1) in my plan to have a
> high priority on all the rest, and it does not imply any immediate change in
> the schema to support it.

Writing those functions is not part of my job.  I expect the i18n people will
handle that.  I'm just doing a prototype, which is agnostic about what sortkey
function you give it.  My current sortkey function is just strtoupper(), for
testing.  (This does improve English sorting when $wgCapitalLinks is off.)

> - to develop the new schema for stored sortkeys, based only on the internal 
> PHP
> functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be
> deployed before the collations have been tested extensively by users and their
> internal data structures reflect the wanted collations order and tailorings.

This is basically what I'm doing, except I'm not going to work out exactly what
sortkey function to use.

> - to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function
> (for later inclusion in the generation of the correct "column headings" in the
> lists displayed by category pages, because for now these headings are almost
> useless for anything else than English, or in heavily populated categories).

I'm not doing this part, but I'm setting up the framework so that i18n people
can work on it (minus the parser function, which can be added later).  At
least, if your COLLATIONMAP is meant to be anything like my
Language::firstLetterForLists().  I'm not sure if it is.  If it's not, please
explain better, because I don't follow.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to