Hi.  Thanks Philippe very much for your information.
Unfortunately I must admit that I have not the experience using the collation 
algorithm in sorts of words or strings of words -- so I cannot help back up 
your data on efficiency.
If a low-weight field/space separator provides uniformity then perhaps I can 
vote for your low-weight field separator -- would it be at all like  
http://www.unicode.org/reports/tr10/tr10-21.html#Combining_Grapheme_Joiner ?  
(I still think a space character's level can simply be customized, and that 
this character might be used to produce a different sort between di lillo and 
dilillo and also be used to block reordering and allow word-by-word comparison 
with the right code.)
Word-by-word comparison should be more efficient of course as it does enable 
you to stop after reading in enough words to distinguish one string from 
another.
At the same time, as I think about it, I  can think of no proper names in 
French where the words differed only by level 2 accents* -- although I am not 
French.


So I can see Kenneth's point somewhat here, too.

Sorry that I am not more help.
 

(* the French words distinguished by level 2 marks include ou, ou' and then 
there are the varieties of cote with and without accents; see: 
http://www.wordreference.com/fren/cote ; the only other thing I can think of 
are masculine past participles such as r^ape' which sometimes are only distinct 
from related verbs/nouns/adjectives by an accent;
this happens in Spanish, particularly among the pronouns;
see:
http://users.ipfw.edu/jehle/courses/pronoun1.htm  
[also note that ' e'ste ' is a demonstrative masculine pronoun, 'this one,'  
while ' este ' is the masculine form of the adjective 'this', or else  ' este ' 
can mean 'East', 
and then ' este' ' is the present subjunctive of 'estar'
http://www.wordreference.com/es/en/translation.asp?spen=este])



Best wishes with this,


--C. E. Whitehead
cewcat...@hotmail.com

 


From: Philippe Verdy (verd...@wanadoo.fr)
Date: Tue Jul 13 2010 - 09:20:10 CDT 
> De : "CE Whitehead" <cewcat...@hotmail.com> 
> A : verd...@wanadoo.fr, unicode@unicode.org 
> Copie à : 
> Objet : RE: UTS#10 (collation) : French backwards level 2, and word-breakers. 
> 
> 
>> Hi, I am sort of confused; so there is no way now to put some of the weights 
>> in reverse order at the secondary level while skipping word boundaries? 
>> Philippe Verdy's suggestion seems reasonable, in general; however I think 
>> that not reversing the weights at word boundaries at level 2 should be 
>> simply an option for French; also I do believe that there is already a way 
>> to identify word boundaries at the primary level in the DUCET but I may be 
>> wrong -- that is the characters that define word boundaries, non-spacing 
>> characters, white space, are defined. 
>> 
>> So is the point then to define all word separators -- whether in the form of 
>> white space, a mandatory line break, etc. -- with a single weight in the 
>> DUCET? (Sorry to be so confused.) 

> There's already a standard annex covering word boundaries, and other 
> boundaries : lines, sentences, default grapheme clusters (including 
> ZWJ and ZWNJ, and the 8 Thai/Lao prepended characters, and sequences 
> that include double diacritics)... plus the combining sequence 
> boundaries that are part of the core standard for normalizations. 
Yes.
> Word boundaries are within the simplest boundaries to compute (at 
least for alphabetic scripts (this is certainly more complex for East 
> Asian scritps, but the same scripts, but these boundaries are not very 
> useful for collation purpose). 

> No need to reinvent the wheel specifically for UTS#10 collation, which 
> already needs the default grapheme cluster boundaries, as the smallest 
> boundaries (spanning entirely one or more combining sequences.), 
> possibly extended to cover multiple default grapheme clusters in 
> language-spacific clusters (for M-to-1 and M-to-N weight mappings). 
Yes.
> Yes I know also that isolated combining characters may also receive 
> their own collation weights, because they are not necessarily combined 
> within M-to-1 or M-to-N weight mappings. 

> But the way UCA and the DUCET is built is to make sure that the result 
> will be consistant within at least the default grapheme clusters, 
> independantly of language-specific tailorings (but I'm not sure that 
> the UCA algorithm addresses all the consistancy issues to make sure 
> that this will be true for all the default grapheme clusters, 
> including in tailorings, when only the combining sequence boundaries 
> are really secured). 

> UTS#10 is also helping us to define the "non-default" grapheme 
> clusters perceived in various languages. It is really a complement to 
> the existing UAX for boundaries, that goes beyond just the purpose of 
> sorting and can be used even without considering any collation weights 
> and independantly of collation levels. For example these boundaries 
> can be used in full text indexing, in orthographic correctors, and in 
> semantic analysis of encoded texts, and they may also help for 
> enhancing the usability of text editors, or for text 
> selection/extraction in browsers. 

> As the introduction of backwards levels in UCA was made apparently 
> specifically for French collation, it really forgot one aspect of 
> French collation: that this is only wanted within single words (the 
> most significant secondary differences of accents are to be found at 
> end of each word separately, but not at end of texts of arbitrary 
> length). 

And even if Kenneth at Sybase thinks that this would complicate things 
or slow down collation, my own experience demonstrates just the 
opposite, exactly for French collation (he recognizes himself that 
French UCA collation is slow, but the cause of this slowness is 
because word boundaries were forgotten an algorithm that is already 
considering smaller boundaries, and he also admits that a small input 
buffering will be needed for correct handling of normalization, 
canonical equivalence of results, and Unicode process conformance), 
and it has absolutely no impact on the effective performance for 
collations without backwards level (including the default 
locale-neutral "root" collation directly induced from the DUCET). 

-- Philippe. 
 


                                          

Reply via email to