[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Thu, 19 Nov 2009 16:12:10 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=164






--- Comment #146 from Philippe Verdy <verd...@wanadoo.fr>  2009-11-20 00:11:53 
UTC ---
I did not make assumptions about SQL engines, because this message exactly says
the opposite: there are varieties there, and Mediawiki could/should also work
with other SQL backends then just MySQL, even if it is open-sourced, but mainly
supported now by Sun which has its own strategies on it.

But I wanted to really note that the page id had no use for sorting results,
given that we really want to sort by full page name, as a single entity or
split in two parts if we consider the namespace...

And may be into more parts, if we consider the subpage names that are also part
of the full page name by specially tailoring the "/" character in the collation
order so that "a/c" will still sort with sort with "a", and not after "ä" or
"a!", by giving a less than minimum primary weight to the "/" character used as
subpages separator (a value even before characters with default-ignorable
primary weight values).

Another way to perceive multicolumn hierarchical sort keys is to think about
them as if they were separated in a single string by an implicit (non-coded)
character with a non null but less than minimal primary weight value lower than
default ignorable characters. In UCA impelmentations there a reserved weight
for this function (fully ignorable characters get the null primary weight which
does not participate at all to the computed collation key, this invisible
character takes the next weight, and then the default-ignorable characters come
just after, followed by combining characters, then whitespaces, and then
generally in the order: punctuation and symbols, numerals, letters and
ideographs).

This implicit separator may be the same as the one for separating series of
weight at each level within the same source string (but many UCA
implementations, including ICU, do not need to encode this separator in the
computed multilevel collation key, as they allocate the collation weights in
non overlapping numeric ranges where the highest one is used for the primary
weights ; but they still maintain a reserved value for easily computing
multicolumn sort keys by simple concatenation with the encoded binary primary
weight for this the column separator).

Note also that in UCA, some groups of distinct strings, whose collation key are
computed at the maximum level (generally level 4 with the DUCET, but possibly
longer in collations tailored for some complex languages), can still have the
same collation key: this will be true if the stricts contain different
characters, but they are still canonically equivalent (i.e. they have the same
NFC form). One string in each subgroup will be in NFC form, the others are in
alternate non canonical forms and may even be longer):

It is not a problem for language-sensitive collation, but effectively this
sometimes requires adding a final binary comparison between the Unicode-encoded
strings, to make sure that the sort order will be stable across database
updates (such comparison needs not be stored in the collation key itself, given
that we will still retreive the full page name from the database, for
displaying them, and not just the collation key which is just used in the ORDER
BY clause.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to