El 5/12/09 1:49 PM, Aryeh Gregor escribió:
> On Tue, May 12, 2009 at 4:38 PM, Brion Vibber<br...@wikimedia.org>  wrote:
>> * Collation use for sorting needs to be double-checked to confirm it
>> wouldn't interfere with present uniqueness constraints
>
> Since cl_sortkey isn't part of any unique key, this appears not to be
> an issue for this use.  Of course, it's an issue for every other
> sorted list of titles, but those can't have custom sort keys specified
> to begin with and don't seem to be included in this proposal.  Perhaps
> they should be, though.  In that case we'd probably end up needing an
> extra column in every single table that includes the page title, just
> for sorting (but we'd be able to use flexible algorithms to generate
> the sort key, rather than being stuck with MySQL's).

As a general issue we also need to consider managing paging through 
collation-sorted lists, since sort keys for different inputs may produce 
the same result. At the moment I think category lists are paged by 
offset (bad!) but we should ensure this is planned for.

>> * Multilingual sites possibly not well served by table-wide
>> language-specific coding
>
> utf8 sorting would be a lot better than binary sorting for any site,
> I'm pretty sure.  (I assume utf8 sorts sanely and not according to
> codepoint.)

Well, "utf8" doesn't tell you anything specific there... :) There's a 
"general" as well as "binary" which would be the same as what we do now 
(except for not supporting 4-byte characters AT ALL)

http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For a multilingual site we'd probably end up using utf8_unicode_ci, 
which at least partially implements the Unicode Collation Algorithm 
(UCA), which sounds kind of confusing since at least a glance at 
http://www.unicode.org/reports/tr10/ makes it quite explicit that 
collation properties are language-dependent... presumably that's an 
un-tailored version which won't have most language-specific properties.

>> Doing our own localized sort key encoding and adding another indexed
>> column to sort on would avoid some dependency issues but has its own
>> deployment and maintenance difficulties.
>
> You don't need another column for categorylinks, you can use the
> existing cl_sortkey, so that should be relatively easy to deploy.  It
> doesn't help with non-category use cases, of course.

You would if you need to store a processed sort key index that's not in 
the form of displayable characters. (eg, the output of the UCA)

>> It would also be possible to use a separate column for the collated
>> sorting while using MySQL 4.1+'s native collations, if the uniqueness
>> constraints are a problem, but this is still dependent on rolling out an
>> upgrade from 4.0.
>
> In that case we may as well make it like cl_sortkey and populate it
> ourselves, surely.

For the unique case of categorylinks yes. For everything else, 
additional columns are not already present.

-- brion

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to