Re: UCA and Russian letter Ё

Leif H Silli Sun, 23 Dec 2012 07:30:11 -0800

Ken,

A basic question: does the UCA algorithm consider the Russian Ye and theRussian Yo as equal with regard to sort order? Or is it not meant to solvethat issue?


Leif Halvard Silli



------- Opprinnelig melding -------

Fra: Whistler, Ken <[email protected]>
Til: [email protected], [email protected]
Cc: [email protected], [email protected]
Sendt: 21/12/'12,  22:49

Leo Broukhis said:
Granted, not yet, but by itself the argument is invalid. Unicode
collation rules are descriptive;
I'm not sure what you mean by that. UTS #10 is a *specification* of analgorithm, with various options for tailoring and parameterization whichmake it possible to accommodate various needs for particular cases. It isnot intended as a descriptive mechanism.
Perhaps you are referring to LDML, which includes a formal mechanism fordescribing a particular collation in terms of the default table andtailoring options and parameterization options of the UCA.
if, for example,  a language happens to sort accents backwards, this
rule has to be - and is - accommodated despite its apparent
illogicality;
Backwards accent secondary weighting was actually included primarilybecause of prior art in collation standards, because of the need to beable to synchronize the UCA algorithm with ISO 14651, and because it makesit easier to explain how folks can implement versions of multi-levelcollation which can pass the conformance tests of the Canadian sortingstandard, etc.
along the same lines, if a language happens to make a distinction
discussed in this thread, it has to be accommodated just as well.
No, I don't think so.
It is rather easy to come up with distinctions or collation requirementswhich simply cannot be accommodated within the intended bounds of the UCA.For example, sorting all numerical expressions mixed with text strictly bytheir numeric values, or sorting all (or some specified list) ofabbreviations as if they were spelled out, and so forth.
Many lexicographical ordering rules cannot be fully accommodated withinthe context of the UCA algorithm, which is a multilevel *stringcomparison* specification, and not a dictionary ordering specification.
My question is as follows: does UCA have to be modified (e.g. by
adding another bit flag "word-initial primary" next to the existing
"backward secondary") to support the feature if it were to be
implemented, or is there a way to achieve the "new Russian online
collation" within the existing UCA without modifying  the strings to
be sorted before the application of the algorithm?
I don't think there is any out-of-the-box way to use UCA so that animplementation would automatically recognize a word boundary context andweight characters conditionally based on that context. So no, I don'tthink you could get an implementation to do that without first marking uptext with additional characters to indicate word boundaries and thentailoring the weight table to weight sequences including that markupaccordingly.
This is actually derived trivially from the fact that UCA knows nothingwhatsoever about word boundaries. At core, it is just a mechanism to takea string input and provide an output vector of collation weights. Youwould have to have to hook it up to a text segmentation algorithm to evenidentify "words", and then that text segmentation algorithm would itselfhave to be tailored and tuned to whatever language you had in mind,because the criteria for identifying "words" will vary from language tolanguage, and even orthography to orthography.
But there is another possible sense of the question, "does UCA have to bemodified... to support...", i.e. is the UTC somehow required to augmentthe algorithm to support some particular kind of behavior for a particularlanguage's sorting rules, just because someone has turned up particularodd behavior. And I think the answer to that is clearly no. Oh, and by theway, I don't think LDML must (or should) be augmented to enable it todescribe any and all lexicographical ordering practices, either. Thatisn't the function of LDML.
--Ken

Re: UCA and Russian letter Ё

Reply via email to