https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #190 from Philippe Verdy <verd...@wanadoo.fr> 2010-07-26 18:38:10 UTC --- Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but COLLATIONMAP is a more generic concept which reflects what is defined in Unicode standard annexes, which speaks about various mappings (including collan mapppings, but also case mappings) One bad thing about the name Language::firstLetterforList(s) is that it implies that this should only be the first letter. In fact, for many locales, the significant unit is not the letter, but the collation element (for exemple digrams like "ch" or trigrams like "c’h"). For some categories, it should be convenient also to be able to use longer substrings, containing more than one grapheme cluster (in Wiktionnary for lists of terms belonging to a language, or in lists of people names, a category may need to be indexed and anchored with section headings containing the first 2 or 3 grapheme clusters, because the first grapheme is not discriminant enough and will remain identical an all columns of the disaplyed list on one page, and even sometimes on several or many successive pages : the first letter heading does not help, and is just an unneeded visual pollution) For other categories that have very few contents, too many headings are frequently added that also appear as pollution. Being able to suppress all of them, by specifying 0 graphemeclusters in that category will better help readers locate the wanted item. The collation map also has several levels of implementations, which match exactly the same levels as collation levels used to generate sort keys. ---- About sort keys now: As sort keys are intended to be opaque binary objects, they do not qualify as being used directly as a parserfunction, without being exposed by a serialization to safe Unicode text, even if it means nothing for reader. That's why I proposed a Base-36 mapping to plain ASCII which will still sort correctly in binary order, and for use in sortable tables, but it could use any arbitrary Base that sorts correctly using binary ordering, and that uses ONLY valid Unicode characters. The chosen base should be easy to compute, but all the standard Base-64 variants do not qualify (there's no warranty for the last two "letters" of all base-64 variants). We could use Base-62 (using all 10 digits, and the 26 pairs of Basic Latin letters), or Base-32 (simpler to compute but will generate longer texts). The only intent is not really to make the string visible in pages, but to help in the generation of accurate sort keys in sortable columns. For now these sort keys are generated by templates as invisible text spans (using a CSS style="display:none" attribute), but ideally, the templates used in sortable tables that generate custome sortkeys should put them in some HTML attribute that can be specified on table cells, and that the Javascript will use directly. In my opinin, these opaque strings should be as compact as possible, but still safe for use inclusing in pages, and directly usable by simple Javascript without requiring any complex reimplemementation of UCA in the Javascript code. Why do I think that exposing the functions as parser functions will be useful ? that's because it allows the implementation to be tested extensively on lots of cases, but only within a limited set of pages, long before the schema is developed, finalized and finally deployed. In other words, it will not block the development of the schema update, as long as we agree about what are the essential functions to support, i.e. their interface that will be exposed (partly) in parser functions. Also because I'm convinced that the exposed parser functions will not have this syntax changed, and that what they return will be well known: - The {{COLLATIONMAP:}} function is very well described and will effectively return humane-readable text. Its formal implementation should follow the standard Unicode definitions. You can expose it at least in a test wiki where you'll be able to track very easily the result and progress of its implementation (just create a page containing test words in various languages, arranged in a Wikitable). - The {{SORTKEY:}} function can be exposed as well (and tested on the same test page for various languages, using the same list of words). Its result will be opaque for humane and compact. It will be easy to assert that it generates the expected sort order by using it in a sortable wikitable. Both functions will be deployable rapidly, even on wikis that won't want to apply the schema change (so they will continue to use a single collation order for ALL their categories, and will anyway be able to sort specific categories using another supplied locale matching the category name). If you think about it, changing the SQL schema may be rejected at end by lots of people. Exposing the parser functions will provide a convenient alternative that can be deployed much more simply, and with MUCH LESS risks, using the existing facilities offered by [[category:...|sortkey]] and {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the exposed {{SORTKEY:}} function: {{DEFAULTSORT:{{SORTKEY:text|locale|level}}}} or: [[category:...|{{SORTKEY:text|locale|level}}]] both being generalizable through helper templates. There is such existing helper template named [[Modèle:Clé de tri]] in French Wiktionnary, that will NO LONGER need that we pass the article name without accents or extra punctuations or apostrophes (ignored at collation level 1), this parameter becoming ignored. Currently we use the template like this: {{clé de tri}} when the article name contains nothing else than basic Latin letters or spaces, otherwise we have to pass: {{clé de tri|Basic latin}} And we contantly need to verify that the passed parameter is correct. Instead the template would just generate this very simple code: {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}}}}} As the {{SORTKEY:text|locale|level}} will use by default: locale={{CONTENTLANGUAGE}}|level=1 this will be sufficient for French Wiktionnary. In fact it may also be sufficient in English Wikipedia. But in Chinese Wikipedia, one may still want to be able to use: {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}|{{int:lang}}}}}} to support the prefered collation order of the user (traditional radical/stroke order, simplified radical/stroke order, Bopomofo order, Pinyin order) (Note also that section headings ("first letter") will have to be "translated" to correctly report the "first letter" of the Pinyin romanization, because the page names listed will continue to display their distinctive ideographs ! The only way to do that is to use the collation mapping exposed by {{COLLATIONMAP:}}) But you'll note that it won't be possible to sort the categories using multiple locales, so the page will be stored and indexed by parsing it using {{int:lang}} forced to {{CONTENTLANGUAGE}}, which will just be "zh", using only the simplified radical/stroke order by default. To support more collations, the categories will need to support them explicitly (but this would force to reparse the page multiple times, once for each additinal locale specified in the category). The alternative would be to create multiple parallel categories, one for each sort order, but then each article will have to specify all these categories. My opinion is that the same category should be sortable using different locales, and that's why they should support multiple sortkeys par indexed page, one for each specified locale. Some wikis will only sort on the {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of sorting automatically all categories using at least the default "zh" locale which is an alias for "zh-hans", plus the "zh-hant" locale for traditional radical/stroke order, "zh-latn" for the Pinyin order, and "zh-bpmf" for the Bopomofo order. The exact locale to which "zh" corresponds will be a user preference, but one will be able to navigate by clicking the automatically generated links that will allow them to specify the other collation orders supported specifically by the category or by default throughout the wiki project. For example, the Chinese Wiktionnary will display links on the page showing the available choice as: Sort by : [default] | [simplified radical/stroke] | [traditional radical/stroke] | [pinyin] | [bopomofo] How can this be possible ? Either the wikiproject specifies that all categories will support these 4 orders, or the category page will list explicitly the additional sort orders that will be supported by the category. The [default] link will use the index prefixes specified in the existing syntaxes [[category:...|sortkeyprefix]] or {{DEFAULTSORT:sortkeyprefix}} All the other links will display the list sorted using the additional locales specified, but will ignore the sortkeyprefixes specified in categorized pages or indirectely via templates. To add the additional sort orders in a category, you'll just need to insert in the category page some code like: {{SORTAS:zh-hans}} {{SORTAS:zh-hant}} {{SORTAS:zh-latn}} {{SORTAS:zh-bpmf}} No article needs to be changed, these additional sort orders will just discard/ignore the sortkeyprefix when generating the actual opaque sortkey (specified with {{DEFAULTSORT:sortkeyprefix}} or in [[category:...|sortkeyprefix]]. However if the wikiproject offers several project-wide default locales the sortkeyprefix specified in pages will be honored for ONLY for these locales, and made immediately visible as the preselected [default] link, in the choice of sort orders. Lets say for example that the Chinese Wiktionnary wants to support by default only the "zh-hans" and "zh-hant" collations. This means that all categories will contain [default] sort keys computed for these two collations, from the text consisting in: {{{sortkeyprefix}}}{{KEYSEPARATOR}}{{PAGENAME}} if a sortkeyprefix is specified, or just {{PAGENAME}} if no sortkeyprefix is specified. A constant {{KEYSEPARATOR}} will be used that should sort lower than every other valid text usable in {{{sortkeyprefix}}} or {{PAGENAME}}. Ideally, this should be a control character like U+000A (LF) or U+0009 (TAB), after making sure that: - this character will never appear in valid {{{sortkeyprefix}}} or {{PAGENAME}} (Mediawiki already process blanks and convert them to plain SPACE) - this character will have a NON-ZERO (ignorable) primary collation weight that is smaller than all other collation weights. Its primary collation weight should then be 1 (if needed the collation weights coming from the DUCET or from loalized tailoring will just have to be offseted by 1, if they are non-zero) - this character will have a ZERO collation weight for all the remaining supported levels in each locale For all the additional sort orders specified in category pages, the {{{sortkeyprefix}} will be ignored as well as the {{KEYSEPARATOR}}, so the pages will just be indexed on {{PAGENAME}}, within the specified locale. In the example Chinese Wiktionnary a category specifying: {{SORTAS:zh-hans}} {{SORTAS:zh-hant}} {{SORTAS:zh-latn}} {{SORTAS:zh-bpmf}} will generate 4 additional (non [default]) sort keys, that will add to the two sortkeys already generated for "zh-hans" and "zh-hant" except that they will ignore the specified sortkeyprefixes. This means that it will generate up to 6 sortkeys: 1 or 2 for "zh-hans", 1 or 2 for "zh-hant", and only 1 for each of "zh-latn" and "zh-bmpf" In the English Wiktionary or on Commons, that will only use the "en" default collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to specify, for specific categories, an additional sort order when the category is directly related to a specific language. By default, that category will be sorted using the English collation rule, but it will be possible to select the additional specified collation order (in which case the defaultsortprefix specified in indexed pages will be ignored, the list will just be shown by using the natural order of page names in the manually clicked sort order). So the [[Category:Chinese]] in English Wiktionnary will be able to specify at least: {{SORTAS:zh-hans}} And the [[Category:German]] in English Wiktionnary will be able to specify at least: {{SORTAS:de|2}} And the [[Category:French]] in English Wiktionnary will be able to specify at least: {{SORTAS:fr|2}} And this should be enough to be able to view the natural order of these languages (French will require collation level 2 for correct ordering by grouping letters with the same accents, and sorting them in backward order in this level)... Note also that if the categories is presented in any selected locale, the secntion headings will also be computed in that same locale, and with the same collation level. By default it will show only 1 grapheme cluster. But if needed you can specify: {{SORTAS:fr|2|3}} to indicate that the category should use the first three French grapheme clusters (determined from collation mappings) for the headings, if the category is heavily populated, so you'll get the following headings: a, à, â, ... aa, aar, ab, aba, abc, ac, aca, ace, acé, ... ad, ae, aé, ... b, ba, bac, bad, baf, bag, bai, bal, bam, ban, bar, bas, bat, bât, .... (note that at level 2, the same heading will contain all words sharing the same letters and accents. Case will be ignored. Headings don't need to be stored, they are generated on the fly from the index prefixes (returned in the result set, but only if this is one of the wiki's [default] sort orders, because otherwise it will be empty if the user selected a non-[default] sort order) and the pagename (also present in the result set). Note that if you still want to present a category ordered so that all lowercase letters will sort tegether separately from all other uppercase letters, you'll need to indicate a separate collation order, by specifying an additional non-default locale in that specific category. This will look probalby not natural for most users, that's why it will be a distinct locale and not the default. For example to use it in English or in French: {{SORTAS:en-x-uc}}<!-- ALL uppercase before lowercase in collation level 1--> {{SORTAS:fr-x-lc}}<!-- ALL lowercase before uppercase in collation level 1 --> This variant means that the natural tailoring is modified so that case differences will be considered as primary differences in that specific distinct localized collation. There will no longer be any ternary difference but there will be twice more headings generated (here as no maximum grapheme clusters is specified, only 1 grapheme cluster will be used in the headings, so you'll get the headings: A, B, C, D, ... Z, a, b, c, d, ... z, (in with en-x-uc) a, b, c, d, ... z, A, B, C, D, ... Z, (in with fr-x-uc) And in both cases there will be no headings separating differences of accents (because we did not indeicate the collation level, the collation level remains 1) Such options MUST use a standard BCP47 code, so the option needs to be at least 5 characters long after the language code, or must be used after the standard BCP47 suffix "-x-" for private extensions that are not indicating a language. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l