Ken,
A basic question: does the UCA algorithm consider the Russian Ye and the
Russian Yo as equal with regard to sort order? Or is it not meant to solve
that issue?
Leif Halvard Silli
------- Opprinnelig melding -------
Fra: Whistler, Ken <ken.whist...@sap.com>
Til: l...@mailcom.com, jkorp...@cs.tut.fi
Cc: unicode@unicode.org, ken.whist...@sap.com
Sendt: 21/12/'12, 22:49
Leo Broukhis said:
Granted, not yet, but by itself the argument is invalid. Unicode
collation rules are descriptive;
I'm not sure what you mean by that. UTS #10 is a *specification* of an
algorithm, with various options for tailoring and parameterization which
make it possible to accommodate various needs for particular cases. It is
not intended as a descriptive mechanism.
Perhaps you are referring to LDML, which includes a formal mechanism for
describing a particular collation in terms of the default table and
tailoring options and parameterization options of the UCA.
if, for example, a language happens to sort accents backwards, this
rule has to be - and is - accommodated despite its apparent
illogicality;
Backwards accent secondary weighting was actually included primarily
because of prior art in collation standards, because of the need to be
able to synchronize the UCA algorithm with ISO 14651, and because it makes
it easier to explain how folks can implement versions of multi-level
collation which can pass the conformance tests of the Canadian sorting
standard, etc.
along the same lines, if a language happens to make a distinction
discussed in this thread, it has to be accommodated just as well.
No, I don't think so.
It is rather easy to come up with distinctions or collation requirements
which simply cannot be accommodated within the intended bounds of the UCA.
For example, sorting all numerical expressions mixed with text strictly by
their numeric values, or sorting all (or some specified list) of
abbreviations as if they were spelled out, and so forth.
Many lexicographical ordering rules cannot be fully accommodated within
the context of the UCA algorithm, which is a multilevel *string
comparison* specification, and not a dictionary ordering specification.
My question is as follows: does UCA have to be modified (e.g. by
adding another bit flag "word-initial primary" next to the existing
"backward secondary") to support the feature if it were to be
implemented, or is there a way to achieve the "new Russian online
collation" within the existing UCA without modifying the strings to
be sorted before the application of the algorithm?
I don't think there is any out-of-the-box way to use UCA so that an
implementation would automatically recognize a word boundary context and
weight characters conditionally based on that context. So no, I don't
think you could get an implementation to do that without first marking up
text with additional characters to indicate word boundaries and then
tailoring the weight table to weight sequences including that markup
accordingly.
This is actually derived trivially from the fact that UCA knows nothing
whatsoever about word boundaries. At core, it is just a mechanism to take
a string input and provide an output vector of collation weights. You
would have to have to hook it up to a text segmentation algorithm to even
identify "words", and then that text segmentation algorithm would itself
have to be tailored and tuned to whatever language you had in mind,
because the criteria for identifying "words" will vary from language to
language, and even orthography to orthography.
But there is another possible sense of the question, "does UCA have to be
modified... to support...", i.e. is the UTC somehow required to augment
the algorithm to support some particular kind of behavior for a particular
language's sorting rules, just because someone has turned up particular
odd behavior. And I think the answer to that is clearly no. Oh, and by the
way, I don't think LDML must (or should) be augmented to enable it to
describe any and all lexicographical ordering practices, either. That
isn't the function of LDML.
--Ken