[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Mon, 26 Jul 2010 11:38:58 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #190 from Philippe Verdy <verd...@wanadoo.fr> 2010-07-26 18:38:10 
UTC ---
Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
COLLATIONMAP is a more generic concept which reflects what is defined in
Unicode standard annexes, which speaks about various mappings (including collan
mapppings, but also case mappings)

One bad thing about the name Language::firstLetterforList(s) is that it implies
that this should only be the first letter. In fact, for many locales, the
significant unit is not the letter, but the collation element (for exemple
digrams like "ch" or trigrams like "c’h").

For some categories, it should be convenient also to be able to use longer
substrings, containing more than one grapheme cluster (in Wiktionnary for lists
of terms belonging to a language, or in lists of people names, a category may
need to be indexed and anchored with section headings containing the first 2 or
3 grapheme clusters, because the first grapheme is not discriminant enough and
will remain identical an all columns of the disaplyed list on one page, and
even sometimes on several or many successive pages : the first letter heading
does not help, and is just an unneeded visual pollution)

For other categories that have very few contents, too many headings are
frequently added that also appear as pollution. Being able to suppress all of
them, by specifying 0 graphemeclusters in that category will better help
readers locate the wanted item.

The collation map also has several levels of implementations, which match
exactly the same levels as collation levels used to generate sort keys.

----

About sort keys now:

As sort keys are intended to be opaque binary objects, they do not qualify as
being used directly as a parserfunction, without being exposed by a
serialization to safe Unicode text, even if it means nothing for reader. That's
why I proposed a Base-36 mapping to plain ASCII which will still sort correctly
in binary order, and for use in sortable tables, but it could use any arbitrary
Base that sorts correctly using binary ordering, and that uses ONLY valid
Unicode characters.

The chosen base should be easy to compute, but all the standard Base-64
variants do not qualify (there's no warranty for the last two "letters" of all
base-64 variants). We could use Base-62 (using all 10 digits, and the 26 pairs
of Basic Latin letters), or Base-32 (simpler to compute but will generate
longer texts). The only intent is not really to make the string visible in
pages, but to help in the generation of accurate sort keys in sortable columns.

For now these sort keys are generated by templates as invisible text spans
(using a CSS style="display:none" attribute), but ideally, the templates used
in sortable tables that generate custome sortkeys should put them in some HTML
attribute that can be specified on table cells, and that the Javascript will
use directly. In my opinin, these opaque strings should be as compact as
possible, but still safe for use inclusing in pages, and directly usable by
simple Javascript without requiring any complex reimplemementation of UCA in
the Javascript code.


Why do I think that exposing the functions as parser functions will be useful ?
that's because it allows the implementation to be tested extensively on lots of
cases, but only within a limited set of pages, long before the schema is
developed, finalized and finally deployed.

In other words, it will not block the development of the schema update, as long
as we agree about what are the essential functions to support, i.e. their
interface that will be exposed (partly) in parser functions.

Also because I'm convinced that the exposed parser functions will not have this
syntax changed, and that what they return will be well known:

- The {{COLLATIONMAP:}} function is very well described and will effectively
return humane-readable text. Its formal implementation should follow the
standard Unicode definitions.

You can expose it at least in a test wiki where you'll be able to track very
easily the result and progress of its implementation (just create a page
containing test words in various languages, arranged in a Wikitable).

- The {{SORTKEY:}} function can be exposed as well (and tested on the same test
page for various languages, using the same list of words). Its result will be
opaque for humane and compact. It will be easy to assert that it generates the
expected sort order by using it in a sortable wikitable.


Both functions will be deployable rapidly, even on wikis that won't want to
apply the schema change (so they will continue to use a single collation order
for ALL their categories, and will anyway be able to sort specific categories
using another supplied locale matching the category name).

If you think about it, changing the SQL schema may be rejected at end by lots
of people. Exposing the parser functions will provide a convenient alternative
that can be deployed much more simply, and with MUCH LESS risks, using the
existing facilities offered by [[category:...|sortkey]] and
{{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the
exposed {{SORTKEY:}} function:

  {{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}

or:

  [[category:...|{{SORTKEY:text|locale|level}}]]

both being generalizable through helper templates.


There is such existing helper template named [[Modèle:Clé de tri]] in French
Wiktionnary, that will NO LONGER need that we pass the article name without
accents or extra punctuations or apostrophes (ignored at collation level 1),
this parameter becoming ignored. Currently we use the template like this:

  {{clé de tri}}

when the article name contains nothing else than basic Latin letters or spaces,
otherwise we have to pass:

  {{clé de tri|Basic latin}}

And we contantly need to verify that the passed parameter is correct. Instead
the template would just generate this very simple code:

  {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}}}}}

As the {{SORTKEY:text|locale|level}} will use by default:

  locale={{CONTENTLANGUAGE}}|level=1

this will be sufficient for French Wiktionnary. In fact it may also be
sufficient in English Wikipedia.


But in Chinese Wikipedia, one may still want to be able to use:

  {{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}|{{int:lang}}}}}}

to support the prefered collation order of the user (traditional radical/stroke
order, simplified radical/stroke order,  Bopomofo order, Pinyin order)

(Note also that section headings ("first letter") will have to be "translated"
to correctly report the "first letter" of the Pinyin romanization, because the
page names listed will continue to display their distinctive ideographs ! The
only way to do that is to use the collation mapping exposed by
{{COLLATIONMAP:}})

But you'll note that it won't be possible to sort the categories using multiple
locales, so the page will be stored and indexed by parsing it using
{{int:lang}} forced to {{CONTENTLANGUAGE}}, which will just be "zh", using only
the simplified radical/stroke order by default.

To support more collations, the categories will need to support them explicitly
(but this would force to reparse the page multiple times, once for each
additinal locale specified in the category).

The alternative would be to create multiple parallel categories, one for each
sort order, but then each article will have to specify all these categories.

My opinion is that the same category should be sortable using different
locales, and that's why they should support multiple sortkeys par indexed page,
one for each specified locale. Some wikis will only sort on the
{{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of
sorting automatically all categories using at least the default "zh" locale
which is an alias for "zh-hans", plus the "zh-hant" locale for traditional
radical/stroke order, "zh-latn"  for the Pinyin order, and "zh-bpmf" for the
Bopomofo order.

The exact locale to which "zh" corresponds will be a user preference, but one
will be able to navigate by clicking the automatically generated links that
will allow them to specify the other collation orders supported specifically by
the category or by default throughout the wiki project.

For example, the Chinese Wiktionnary will display links on the page showing the
available choice as:

 Sort by : [default] | [simplified radical/stroke] | [traditional
radical/stroke] | [pinyin] | [bopomofo]

How can this be possible ? Either the wikiproject specifies that all categories
will support these 4 orders, or the category page will list explicitly the
additional sort orders that will be supported by the category.

The [default] link will use the index prefixes specified in the existing
syntaxes [[category:...|sortkeyprefix]] or {{DEFAULTSORT:sortkeyprefix}}

All the other links will display the list sorted using the additional locales
specified, but will ignore the sortkeyprefixes specified in categorized pages
or indirectely via templates.

To add the additional sort orders in a category, you'll just need to insert in
the category page some code like:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

No article needs to be changed, these additional sort orders will just
discard/ignore the sortkeyprefix when generating the actual opaque sortkey
(specified with {{DEFAULTSORT:sortkeyprefix}} or in
[[category:...|sortkeyprefix]].

However if the wikiproject offers several project-wide default locales the
sortkeyprefix specified in pages will be honored for ONLY for these locales,
and made immediately visible as the preselected [default] link, in the choice
of sort orders.



Lets say for example that the Chinese Wiktionnary wants to support by default
only the "zh-hans" and "zh-hant" collations.

This means that all categories will contain [default] sort keys computed for
these two collations, from the text consisting in:

  {{{sortkeyprefix}}}{{KEYSEPARATOR}}{{PAGENAME}}

if a sortkeyprefix is specified, or just {{PAGENAME}} if no sortkeyprefix is
specified.

A constant {{KEYSEPARATOR}} will be used that should sort lower than every
other valid text usable in {{{sortkeyprefix}}} or {{PAGENAME}}. Ideally, this
should be a control character like U+000A (LF) or U+0009 (TAB), after making
sure that:

- this character will never appear in valid {{{sortkeyprefix}}} or {{PAGENAME}}
(Mediawiki already process blanks and convert them to plain SPACE)

- this character will have a NON-ZERO (ignorable) primary collation weight that
is smaller than all other collation weights. Its primary collation weight
should then be 1 (if needed the collation weights coming from the DUCET or from
loalized tailoring will just have to be offseted by 1, if they are non-zero)

- this character will have a ZERO collation weight for all the remaining
supported levels in each locale

For all the additional sort orders specified in category pages, the
{{{sortkeyprefix}} will be ignored as well as the {{KEYSEPARATOR}}, so the
pages will just be indexed on {{PAGENAME}}, within the specified locale.

In the example Chinese Wiktionnary a category specifying:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

will generate 4 additional (non [default]) sort keys, that will add to the two
sortkeys already generated for "zh-hans" and "zh-hant" except that they will
ignore the specified sortkeyprefixes.

This means that it will generate up to 6 sortkeys: 1 or 2 for "zh-hans", 1 or 2
for "zh-hant", and only 1 for each of "zh-latn" and "zh-bmpf"

In the English Wiktionary or on Commons, that will only use the "en" default
collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to
specify, for specific categories, an additional sort order when the category is
directly related to a specific language.

By default, that category will be sorted using the English collation rule, but
it will be possible to select the additional specified collation order (in
which case the defaultsortprefix specified in indexed pages will be ignored,
the list will just be shown by using the natural order of page names in the
manually clicked sort order).

So the [[Category:Chinese]] in English Wiktionnary will be able to specify at
least:

{{SORTAS:zh-hans}}

And the [[Category:German]] in English Wiktionnary will be able to specify at
least:

{{SORTAS:de|2}}

And the [[Category:French]] in English Wiktionnary will be able to specify at
least:

{{SORTAS:fr|2}}

And this should be enough to be able to view the natural order of these
languages (French will require collation level 2 for correct ordering by
grouping letters with the same accents, and sorting them in backward order in
this level)...

Note also that if the categories is presented in any selected locale, the
secntion headings will also be computed in that same locale, and with the same
collation level. By default it will show only 1 grapheme cluster. But if needed
you can specify:

{{SORTAS:fr|2|3}}

to indicate that the category should use the first three French grapheme
clusters (determined from collation mappings) for the headings, if the category
is heavily populated, so you'll get the following headings:

a, à, â, ... aa, aar, ab, aba, abc, ac, aca, ace, acé, ... ad, ae, aé, ... b,
ba, bac, bad, baf, bag, bai, bal, bam, ban, bar, bas, bat, bât, ....

(note that at level 2, the same heading will contain all words sharing the same
letters and accents. Case will be ignored.

Headings don't need to be stored, they are generated on the fly from the index
prefixes (returned in the result set, but only if this is one of the wiki's
[default] sort orders, because otherwise it will be empty if the user selected
a non-[default] sort order) and the pagename (also present in the result set).

Note that if you still want to present a category ordered so that all lowercase
letters will sort tegether separately from all other uppercase letters, you'll
need to indicate a separate collation order, by specifying an additional
non-default locale in that specific category. This will look probalby not
natural for most users, that's why it will be a distinct locale and not the
default. For example to use it in English or in French:

{{SORTAS:en-x-uc}}<!-- ALL uppercase before lowercase in collation level 1-->
{{SORTAS:fr-x-lc}}<!-- ALL lowercase before uppercase in collation level 1 -->

This variant means that the natural tailoring is modified so that case
differences will be considered as primary differences in that specific distinct
localized collation. There will no longer be any ternary difference but there
will be twice more headings generated (here as no maximum grapheme clusters is
specified, only 1 grapheme cluster will be used in the headings, so you'll get
the headings:

 A, B, C, D, ... Z, a, b, c, d, ... z, (in with en-x-uc)
 a, b, c, d, ... z, A, B, C, D, ... Z, (in with fr-x-uc)

And in both cases there will be no headings separating differences of accents
(because we did not indeicate the collation level, the collation level remains
1)

Such options MUST use a standard BCP47 code, so the option needs to be at least
5 characters long after the language code, or must be used after the standard
BCP47 suffix "-x-" for private extensions that are not indicating a language.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to