https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #189 from Aryeh Gregor <simetrical+wikib...@gmail.com> 2010-07-25 17:22:32 UTC --- (In reply to comment #186) > That'a a bad assumption : even the highest quality collations will need to be > updated from time to time: > - Unicode evolves and new characters get encoded (new versions are published > about every 1-2 years after synchronization and final balloting at both ISO > WG2 > and the UTC. > - The content of the Unicode DUCET is NOT stable: characters are inserted in > the sequence so that the full list of collation weights needs to be offseted > where the new characters get inserted. > - Collations for languages get corrected. We should be able to upgrade these > rules when the CLDR project produces new tailorings (CLDR updates are > published > separately, about every 1-2 years.) It's not critical for most wikis to use the very latest collations, though. The English Wikipedia, for example, will do fine with even very out-of-date collations, since the article names overwhelmingly use characters that have been in Unicode for an extremely long time. The same is true for most other large wikis; but on the other hand, to change collations on small wikis will be fast in any event. However, Tim Starling independently suggested a cl_collation column, and my initial implementation does have one. The current one doesn't allow the same row to be stored multiple times with different collations, so sorting might still be wrong as the collation is updated, but this might not be a big problem in practice. If it is, we can go with your idea of extending the primary key to (cl_collation, cl_from, cl_to) instead of (cl_from, cl_to). > Anyway you did not reply to the idea of first developin the parser functions > and test them. Developping the SQL schema extension should not be attempted > before at least the first function {{SORTKEY:text|locale|level}} has been > fully > developed and tested on specific pages (it can be tested easily in tables). > > The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed > immediately to develop the schema, but will be useful to restore the > functionality of headings. Headings don't need to be stored as they can be > computed on the fly, directly by reading sequentially the sorted result set > from the SQL query: I tried to figure out what you were talking about here, and eventually figured out that you just want me to expose Language::convertToSortkey() and Language::firstLetterForLists() (as they're currently called) as parser functions, for the benefit of lists. That might be useful, and should be easy to do, although it's not part of my assigned job here. > You can compute headings from the returned page names, or from the existing > stored "cl_sortkey" which should be used now ONLY to store the plain-text > specified in articles with {{DEFAULTSORT:sortkey}} and > [[category:...|sortkey]]. I've introduced cl_raw_sortkey to store that, while retaining cl_sortkey for actual sorting. Based on your feedback, it might make more sense to rename cl_raw_sortkey to cl_sortkey_prefix, put {{defaultsort:}} into it, and make it the empty string by default. (In reply to comment #187) > This does not require any change in the schema. This can be made immediately > by > MediaWiki developers and will not influence the developement of all > corrections > needed for bug #164 itself. Correct. I'll probably do it anyway in the course of the work I'm doing, though, since I'll be rewriting CategoryPage.php's sort logic in any case. > For example in people's names whose page name is "FirstNames LastName" but > that > we want to sort as if they were "LastName, FirstNames" by indicating only > {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in > the wiki text, as this sort hint will not be unique and the group of pages > using the same hint will still need to sort within this group using their > natural order). I can append the page title to cl_raw_sortkey before computing cl_sortkey. That shouldn't be a problem. As noted above, maybe renaming it to cl_sortkey_prefix would make more sense. (In reply to comment #188) > An "interface" function is DEFINITELY NOT an "implementation" detail. My > comment #180 _correctly_ describes and summarizes what is really wanted. Unfortunately, it's hard for me to understand what you're saing. However, I think I've got it now, at least mostly. I don't think parser functions are essential, here, and they're not strictly part of this bug. They can be added after the initial implementation is finished. > It correctly explains the dependancies and why any change in the SQL schema > can > be and should be delayed. In fact I consider the step (1) in my plan to have a > high priority on all the rest, and it does not imply any immediate change in > the schema to support it. Writing those functions is not part of my job. I expect the i18n people will handle that. I'm just doing a prototype, which is agnostic about what sortkey function you give it. My current sortkey function is just strtoupper(), for testing. (This does improve English sorting when $wgCapitalLinks is off.) > - to develop the new schema for stored sortkeys, based only on the internal > PHP > functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be > deployed before the collations have been tested extensively by users and their > internal data structures reflect the wanted collations order and tailorings. This is basically what I'm doing, except I'm not going to work out exactly what sortkey function to use. > - to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function > (for later inclusion in the generation of the correct "column headings" in the > lists displayed by category pages, because for now these headings are almost > useless for anything else than English, or in heavily populated categories). I'm not doing this part, but I'm setting up the framework so that i18n people can work on it (minus the parser function, which can be added later). At least, if your COLLATIONMAP is meant to be anything like my Language::firstLetterForLists(). I'm not sure if it is. If it's not, please explain better, because I don't follow. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l