The deadline is here... Find attached the last version of my proposal for a better handling of apostrophe in DUTR#29.
All the criticism I received was very valuable, and most of it has been included in the proposal, in a form or another. Thanks. _ Marco
|
Proposal to accommodate French and Italian elision rules in Unicode's DUTR#29 author: Marco Cimarosti Rationale - The existing word-boundary rules of DUTR#29 (version 2) are designed to capture the meaning of apostrophes in English (and many other languages), where apostrophes normally are inside a word, as in "don't" or "Marco's". The behavior of apostrophes is quite different in Italian and French, where an apostrophe normally marks elision, i.e. the deletion of the last vowel of a word that occurs before another word starting with a vowel. E.g. "d'Unicode" (d' elision of de = "of"), or "l'Angleterre" (l' from la = "the"), "d'un'altr'annata" (elision of di una altra annata: "of a past year"). The two (or more) words are graphically joined (no space before or after the apostrophe). The apostrophe is part of the word that precedes it, and an implicit word break comes after it. Implementing this behavior in the default definition of UTR#29 is important to accommodate the large French and Italian speaking communities, as well as the needs of the people writing in other languages, who often use loanwords or quotations from these popular languages. Proposed heuristic - The present proposal is based on the observation that elision apostrophes are always followed by a vowel, whereas English-style "joining" apostrophes are normally followed by a consonant. The issue is complicated by the fact that both French and Italian have mute H's that can interfere in the algorithm. The proposal defines three new character classes: ElisionVowel (containing all the meaningful vowels in French and Italian), ElisionMute (containing only the letter H in upper and lower case), and ElisionApostrophe (containing the characters used as apostrophe). The characters contained in the new classes are removed from the classes where they used to be (ALetter and MidLetter). The new classes are used to define two new rules (before current rule 6) for elision apostrophes, which cover the cases C'V and C'hV (where C is a consonant and V is a vowel). Several rules are slightly changed because the former classes ALetter and MidLetter are now split in two or more classes. Open issues - Although this proposal might enhance the handling of some common cases in two common languages, there still are many remaining edge cases that can only be solved by tailoring the algorithm for specific languages. For instance, the "c'h" trigraph of the Breton language, or the "g'" digraph of Uzbek, would unduly be split by the default definition, when followed by a vowel. Discussion - This proposal has been discussed publicly on the Unicode Public E-mail List. I thank all the people who took part in the discussion. All the criticism I received was very valuable, and most of it has been incorporated in this version, in a form or another. Note - The proposed changes are concentrated in Table 2
(Default Word Boundaries). Proposed additions are colored in
green and underlined, proposed deletions
are colored in ... Table 2. Default Word Boundaries
...
... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

