https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #146 from Philippe Verdy <verd...@wanadoo.fr> 2009-11-20 00:11:53 UTC --- I did not make assumptions about SQL engines, because this message exactly says the opposite: there are varieties there, and Mediawiki could/should also work with other SQL backends then just MySQL, even if it is open-sourced, but mainly supported now by Sun which has its own strategies on it. But I wanted to really note that the page id had no use for sorting results, given that we really want to sort by full page name, as a single entity or split in two parts if we consider the namespace... And may be into more parts, if we consider the subpage names that are also part of the full page name by specially tailoring the "/" character in the collation order so that "a/c" will still sort with sort with "a", and not after "รค" or "a!", by giving a less than minimum primary weight to the "/" character used as subpages separator (a value even before characters with default-ignorable primary weight values). Another way to perceive multicolumn hierarchical sort keys is to think about them as if they were separated in a single string by an implicit (non-coded) character with a non null but less than minimal primary weight value lower than default ignorable characters. In UCA impelmentations there a reserved weight for this function (fully ignorable characters get the null primary weight which does not participate at all to the computed collation key, this invisible character takes the next weight, and then the default-ignorable characters come just after, followed by combining characters, then whitespaces, and then generally in the order: punctuation and symbols, numerals, letters and ideographs). This implicit separator may be the same as the one for separating series of weight at each level within the same source string (but many UCA implementations, including ICU, do not need to encode this separator in the computed multilevel collation key, as they allocate the collation weights in non overlapping numeric ranges where the highest one is used for the primary weights ; but they still maintain a reserved value for easily computing multicolumn sort keys by simple concatenation with the encoded binary primary weight for this the column separator). Note also that in UCA, some groups of distinct strings, whose collation key are computed at the maximum level (generally level 4 with the DUCET, but possibly longer in collations tailored for some complex languages), can still have the same collation key: this will be true if the stricts contain different characters, but they are still canonically equivalent (i.e. they have the same NFC form). One string in each subgroup will be in NFC form, the others are in alternate non canonical forms and may even be longer): It is not a problem for language-sensitive collation, but effectively this sometimes requires adding a final binary comparison between the Unicode-encoded strings, to make sure that the sort order will be stable across database updates (such comparison needs not be stored in the collation key itself, given that we will still retreive the full page name from the database, for displaying them, and not just the collation key which is just used in the ORDER BY clause. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l