"CE Whitehead" <cewcat...@hotmail.com> wrote: > At the same time, as I think about it, I can think of no proper names in > French where the words differed only by level 2 accents* -- although I am not > French.
The most common case is not within proper names, but within the various forms of conjugated verbs and their past participles, in the first group. Backwards ordering comes from French phonology which exhibits stress on the final pronounced syllable (ignoring additional mute 'e' and mute consonannts at end of words, so effectively ignoring some orthographic syllable boundaries). This allows dictionaries to sort together (and present within the same article) many grammatical variations of the same word, according to how they are effectively pronounced, something that is not used in English because its orthography is extremely weakly transparent, so it just uses the default ordering of orthographic base letters indistinctly of their role in the phonology. The language phonology is so significant within the « natural » ordering/collation of words, that, when the orthography is enough transparent, spans of letters used as digrams/trigrams are considered as a single letter (this case occurs in Spanish for example because its ortography is strongly transparent). Ideally, in French, we should also sort the '-er' termination of verbs in the same collation group as '-é', '-ée', '-ées', after the collation group for '-e', '-es', '-ent' ; but this is not done in French as there are too many exceptions (this would require semantic knowledge, a requirement that would be too complicate in most cases for a collation/sort algorithm, but something that is actually used in French dictionnaries, that also consider other regularly conjugated forms, as well as other regular variations of nouns and adjectives that are collated together in the same group sharing the same definition). Because the stress is on the final syllable, any difference of accents in this syllable takes a much higher importance in collation, than variations of accents on prior syllables. This is the MAJOR justification of why backwards ordering of level 2 differences should occur ONLY between word boundaries (because separate words have their own separate stress on important vowels that are clearly differenciated by their accent. And you should also note that backwards ordering of differences will be needed only for very few cases of letters with accents in French, that exhibit significant differences : - /ô$/ versus. /o$/ : open versus close 'o', in very few word pairs like « môle » vs. « mole ». - /e(s|nt)?$/ versus /( e[s]?[t]?[s]? | [èê][stz][s]? )$/' versus /(é[e]?[s]?)$/ : open versus mid-close versus closed versus mute 'e' A pure lexical analysis cannot completely disambiguate the phonology (for exemple the '-ent' finals that are mute in verbs, and nasalized in adverbs). And the first case above is often now insignificant today for many modern French speakers, as well as the difference between mid-closed and closed 'e', so this just leaves the case of differentiating the open 'é' and closed 'è|ê' (which includes also the unaccented 'e' in some foreign words like toponyms, or French words where 'e' appears before a non mute final consonnant) In summary, only one letter is really concerned by a significant differenciation by its accent, and this single letter is 'é', with its acute accent, which creates a major difference with all other forms of letter 'e'. all other French accents are not important (and the circumflex is already disappearing progressively in the orthography above 'i' and 'u' in most cases, just because it is NEVER pronounced by itself, and gives no additional semantic differences in written texts). When you consider this, backwards processing will only be really needed in a word, if there's a 'é' (E WITH ACUTE) as the last vowel within a word, when this is differenciating with the same word (ignoring case) without accents. and almost all these frequent cases occur with French verbs in the very wide 1st froup, and their past participles that may also be used separately as adjectives. This clearly limits the interest of backwards processing of level 2, given that other accents may be as well sorted in the forwards direction. Do you see what I mean ? It's that you can treat it very easily using ONLY FOREWARDS levels, by assigning a higher collation weight on the PRIMARY level, for the letter 'é' or 'É' ONLY when it occurs at end of words (possibly followed by 's' or 'es' before the word separation). This just requires a small look-ahead of at most 3 characters to detect the end of words. In summary, for French, it will just be enough to write a primary level tailoring like: /e/ < /é(e|es|s)?$/ < 'f' exactly as if theis final 'é' was a digram (similar to the special level-1 tailoring of 'ch' as a distinct base letter in Spanish) and where "$" means the end of a word boundary, using UAX#29 which also uses a foreward processing with small look-ahead length). And then you can safely use ONLY forewards processing for French collation of ALL collation levels, using the SAME and very small look-ahead input buffer: - for standard Unicode normalizations (NFC/NFD) of combining characters (required for Unicode process conformance). - for standard UAX#29 combining sequences (grouping also sequences joined by ZWJ and optionally terminated by ZWNJ). - for standard UAX#29 grapheme cluster boundaries (not necessarily the "extended" boundaries, so you'll not include prepending Thai/Lao letters) - for standard UAX#29 word boundaries (not necessarily the "extended" boundaries, so you'll not include trailing whitespaces) and only if collating at level 2 or more (the additionnal tailoring of level 1 above may be dropped when collation has been limited to level 1 only). - for other UCA tailorings using M-to-1 or M-to-N collation mappings. - and finally to buffer ONLY words when comparing strings level by level, and only if comparing pairs of texts. This buffer will be cleared completely between each word boundary (for French, this means that the input buffer length will contain AT MOST 25 characters). - if you are generating a collation string from a single input text, this last step is avoided, but you keep the need for the two previous steps, and in fact for French it will be reduced to an input buffer for look-ahead of only 4 code UTF-16 code-units, or 4 code-points). Is that clearer ? In other words, backwards processing is not even needed for French, as we can be much smarter and produce the correct intended sort order without it (but ONLY is a word-break is inserted in the algorithm, something that will not add any significant cost. And this demonstrates that Kenneth at Sybase WAS CLEARLY WRONG, when he affirmed (without testing the solution, or without understanding it) that adding a UAX #29 breaker in the UCA algorithm would reduce the performance and would complicate a lot the UCA algorithm (which ALREADY needs an input look-ahead buffer, something that he incorrectly implied when he said that we don't need any buffer, assuming that this would be needed to generate sort keys before comparing pairs of text), when it fact it will greatly increase the performance for French collation, and will also greatly improve the data locality (thanks to the MUCH smaller input look-ahead). And it and will have absolutely no impact on other languages that don't need the detection of word boundaries (something that remains to demonstrate, even if the case of French has been accepted but "solved" using the wrong tool !). I want to expose the details of my solution (which is certainly highly preferable to the existing "backwards" trick), both to the UTC members and to those currently participating to the ISO technical work group preparing an international standard about collation. But I also want to give another important note about the very high utility of standard UAX#29 segmentations in the collation of ALL locales (that cannot safely ignore them completely): ---- Rethink about it, but a word-breaker (or at least a grapheme cluster breaker if word breaks do not make differences) will also greatly improve the performance of other languages (including English!) for comparing pairs of texts (all the needed levels will be treated words by word, before going to the next word, when comparing pairs of arbitrarily long texts). It will also be usable as well when generating collation strings from an arbitrarily long text: Suppose that you have to generate a level-2 collation string from an arbitrarily long texts containing word like "Aaaa Bbbbbb ... Zzzzzz" - the word boundaries will be between «Aaaaa», « », «Bbbbbb», « », ... , «zzzzz» - Let's say that the word «aaaaaa» which has higher collation weights values «AAAAA» at the primary level, and lower collation weights values «aaaaa» for the secondary level. - Let''s say that the space separators (detected in separate UAX#29 segments) are generating higer collation weights values «_» at primary level, and lower collation weight values « » for the secondary level. - the usual UCA collation string is built like this, requiring to keep in buffer for the whole input text: "AAAAA_BBBB_..._ZZZZaaaaa bbbbb ... zzzz" - but you can as well build the collation string like this, where the bufffer just needs to keep the UAX#29 segments (words for example): "AAAAAaaaaa_ BBBBbbbbb_ ......_ ZZZZzzzz" In other words, the resulting collation string has exactly the same length, but you absolutely don't need an arbitrarily long input buffer to generate a correct collation string, when the existing UAX#29 breaks are offering excellent opportunities to generate immediately the binary serialization of all levels between each boundary, and allows you to clear completely that input buffer from all characters from that segment of text (you just keep the unprocessed look-ahead characters that may be have read and restored back in that buffer, as they are part of the next segment, that why the input buffer should probably be circular, even if it's resizable, if ever there's a longer segment). For English, which ONLY needs the smallest « grapheme cluster » boundaries if there's some accents or foreign character occuring, the input buffer needed to generate all the binary weights for each successive level will not go very frequently beyond 2 codepoints, and in fact for most texts it will need to keep only 1 character of look-ahead (it may grow slightly above only when using the DUCET for treating all non-Basic Latin Unicode characters, notably some rare Vietnamese, Hangul and Indic default grapheme clusters). My solution is general enough that each language can specify the level of UAX#29 segmentation they need to process all possible input texts, as a parameter within their UCA tailoring specifications, because this is the only thing that will really influence (along with existing M-to-1 and M-to-1 mappings in the tailoring rules) the effective buffer lengh needed for the look-ahead. For English, or for the ROOT locale implied by the Unicode DUCET (and used as a reference within the CLDR project), the default UAX#29 break level to be used within UCA will just be the « default grapheme cluster » boundaries. Note that the UCA algorithm ALREADY requires the detection of AT LEAST the « default grapheme cluster » boundaries (you should not use any finer level) : with a smaller segmentation level, it will not be safe to use the algorithm if the input text is not fully normalized, and it won't be safe to pass over ZWJ and ZWNJ as if they were in segments to be collated separately from the rest of the combining sequences. -- Philippe.