> > Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset > based on MUTABLE character properties (this can also be "decompositions > mappings" except that once a character is added to the new property file, > they won't be removed, and can have some stability as well, where the > decision to "deprecate" old encodings can only be done if there's a new > recommandation, and that if ever this recommandation changes and is > deprecated, the previous "legacy decomposition mappings" can still be > decomposed again to the new decompositions recommanded): unlike NFKC, and > NFKD, a "legacy decomposition" is not "final" in all future versions, and a > future version may remap them by just adding new entries for the new > characters considered to be "legacy" and no longer recommended. This new > properties file would allow evolution and adaptation to humane languages, > and will allow correcting past errors in the standard. This file should > have this form: > > # deprecated codepoint(s) ; new preferred sequence ; Unicode version in > which it was deprecated > 101234 ; 101230 0300... ; 10.0 > > This file can also be used to deprecate some old variation sequences, or > some old clusters made of multiple characters that are isolately not > deprecated. >
Another note: - this new decomposition mapping file for NFLC and NFLD, where NFLC is defined to be NFC(NFLD), has some stability requirements and it must be warrantied that NFD(NFLD) = NFD: the "legacy mapping forms" must be a conforming process respecting the canonical equivalences: - Unlike in the main UCD file for canonical decompositions, the decompositions listed there are not limited to map one character to one or two characters. - The first column should be given in NFC form; the NFD form may also be used, this does not change the result. It is NOT required that the 1st column is in NFKC or NFKD forms (so the decompositions previously recommanded by a "compatibility mapping" in the main UCD can be ignored: it was just a suggestion and a requirement only for NFKC and NFKD). This allows NFLC and NFLD to correct past errors in the frozen permanently NFKC and NFKD decompositions. - the mapping done here is permanent but versioned (by the first version of Unicode deprecating a character or sequence). Being permanent means that the deprecation cannot be removed, but it can still be changed if the target string (preferably listed in NFC form) contains some newly deprecated characters (that will be added separately. - if the target of the mapping contains other deprecated characters or sequences (added to the same file), the decompositions listed there becomes recursive: a derived datafile can be produced listing only the new recommended mappings. - if a source string "SATB" is canonically equivalent to "SBTA", and "SA" is listed as a legacy sequence mapped to be replaced by "X" in this file, then the NFLD process will not just decompose "SATB" into NFD("XTB"), but will also decompose "SBTA" into NBT("XBT"). - if a source string "SATB" is NOT canonically equivalent to "SBTA", and "SA" is listed as a legacy sequence mapped to be replaced by "X" in this file, then the NFLD process will not decompose "SATB" into NFD("XTB"), but will not automatically decompose "SBTA" into NBT("XBT") Then the CLDR project can use NFL(C/D) as a better source for deriving collation elements (in the DUCET or root locale) instead of NFK(C/D) which will follow the new recommandations and will correctly adapt the collation orders for legacy encodings. Tailored collations (per-locale) are not required to use compatibility mappings in the main UCD file, or in this file, they'll use it only if they are based on the DUCET or the collation order of the "root" locale. For that purpose, tailored collations may specify an alternate set of "compatibility or legacy mappings" (to apply after NFC or NFD normalization which is still required). May be the CLDR projects would like to have these derived collation elements to be orderable (so that it can infer and order the new relative weights needed for ordering strings containing "legacy characters") but it may require another column in the legacy mappings datafile (in my opinion the "Unicode version" field already offers by default a suitable relative ordering)