Le dim. 20 oct. 2024 à 10:48, Charlotte Eiffel Lilith Buff via Unicode < [email protected]> a écrit :
> As I understand it (and I believe this was even the wording used in > previous versions of UAX #15), the script-specific exclusions exist because > for a handful of characters the fully decomposed form is the preferred > representation in regular usage. This makes sense to me for the precomposed > Hebrew letters because with so many combining marks with unique CCC values, > it just seems easier to deal exclusively with combining character sequences > and not have some random marks “glue” themselves to the base letter. The > two-part Tibetan subjoined letters are similar in this regard. > However, the Indic nuktas seem entirely unproblematic and in fact not all > precomposed letters with nukta are composition-excluded: Devanagari has ऩ, > ऱ, and ऴ for example. > > Does anyone remember what lead to these specific decisions or knows where > to find the relevant documents if they exist? > I certainly wasn’t involved in Unicode when the relevant documents were discussed, as I was busy learning the letters in the Basic Latin block¹, but I looked at some of them a couple of years ago. - Revision 9 of then-DUTR² #15 https://www.unicode.org/reports/tr15/tr15-9.html, dated 1998-11-23, and entered into the registry <https://www.unicode.org/L2/L1998/Register-1998.html> as L2/98-404, does not mention composition exclusions. - The first revision (10) that mentions characters *excluded from being primary composites* is https://www.unicode.org/reports/tr15/tr15-10.html#Definitions, dated 1998-12-16. The rationale is indeed that *This would be to match common practice for scripts that use fully decomposed forms.* The sole example given is FB31. - The next revision (11) includes a list of composition exclusions: https://www.unicode.org/reports/tr15/tr15-11.html#Primary%20Exclusion%20List%20Table, dated 1999-02-25. This list includes 0958..095F. Between revisions 9 and 10, we have UTC #78, whose minutes are L2/98-419 <https://www.unicode.org/L2/L1998/98419.pdf>. See the discussion in the section titled “Normalization [Document L2/98-404]”, and in particular the last comment from Ken Whistler. Between revisions 10 and 11, we have UTC #79, in whose minutes L2/99-054R <https://www.unicode.org/L2/L1999/99054r.htm#79-0>, in the section “Proposed Draft UTR #15, Unicode Normalization”, we get a similar comment from Ken towards the end. The minutes of UTC #80, L2/99-176 <https://www.unicode.org/L2/L1999/99176.htm>, have some discussion of normalization, and motion 80-M25 letting the editorial committee change the composition exclusions table; but by that point 0958 is already in there, so digging there isn’t going to help. However, some later documents provide relevant context: - L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf> (p. 17, in the section on Devanagari). - L2/01-305 <https://www.unicode.org/L2/L2001/01305-india-resp.txt> (section on Devanagari). So there was clear feedback from India that U+0958 क़ and friends should be discouraged; presumably the UTC must have been aware of that in 1999. On the distinction between क़ vs. ऴ, I guess this is related to ऴ being atomic in ISCII; in turn that is because while ऴ is decomposable, corresponding letters in other ISCII scripts (ழ, ఴ, ഴ) are not. See also point (viii) of L2/01-304 <https://www.unicode.org/L2/L2001/01304-feedback.pdf>; there still was a desire to make the encodings similar between the scripts. I am sure Ken can provide more details. Best regards, Robin Leroy ― ¹ As well as a few from the Latin-1 Supplement and Latin Extended-A blocks. ² This predates L2/00-118 <https://www.unicode.org/L2/L2000/00118-parts.txt> and UTC decision 83-C6 <https://www.unicode.org/L2/L2000/00115.htm#83-C6> which gave us the terms UAX and UTS.
