Not really, because the derived primary weights are already present in the DUCET, within the expansions of CJK compatibility characters. e need to take into account how the DUCET values are computed, in order to deduce between which sequences containing CJK ideographs the compatibility characters will fit in an ordered sequence of strings, including at the primary level of collation.
Yes renumbering is not an easy task if we use the current DUCET format, which does not offer an easy view of how it is really structured. That's why I really don't like the fact that it fixes arbitrary weight values without really exposing their relativeproperties. I much prefer the compressed syntax used in LDML, like: "&a<<<A<<à<<<À<b<<<B...", which does not fix any arbitrary weights, because it is clearly not needed; we have much more freedom in how weights are generated, and the syntax is much more expressive, without the additional (unneeded in fact) fixing of those arbitrary weights values. That's because I'm implementing this idea (for now I still have problems with computing the contextual rules, notably with a context on the right side, but absolutely no problem in representing the pseudo collation elements for minimum script primaries). What I am in fact writing is even another representation where I just set: level1 = [Latn], a, b, ..., [Grek], ..., [Cyrl], ... level2 = a, à, b, ... level3 = a, A, à, À, b, B using a simple and very compact format (not requiring any weight values) using simple ordered lists, and absolutely NO complex operators between them. The commas here are just illustrative of the ordered list format, which is implicit, it can be abbreviated as well as: level1=[Latn]a-z...[Grek]...[Cyrl]... level2=aàb... level3=aAàÀbB No need of any reset (they are implicit from the presence of a collation element at a level N+1 which is already positioned in the list for level N, I can still use delimiters only ro represent contractions, and I don't even need to represent expansions. If needed I can add supplementary statistics data about usage of primary ranges in a specific language. From this data, I deduce rational weights, I can also infer an optimal Huffman or arithmetic coding (only needed when generating collation keys, such data is not needed for just comparing Unicode strings, because I always can always split the half-open range of rationals [0..1[ in as many partitiions as wanted: no fixed bit precision). Adding a tailoring just consists in specifying an ordered list of collation elements (single characters or contractions), each list specifying the collation level at which the collation elements are diffrentiated, and I also don't need to load the levels 2..N at all if just performing a level 1 collation, and this is enough as well for representing contextual collation elements as well. Things like collation mappings are much easier to perceive and specify correctly. And this format can even represent all case mappings and case foldings existing in the UCD, just as specialized but limited collations. Compatibility decomposition mappings of the UCD are also represented as a tailoring. And even the canonical decomposition mappings are also represented by a "level-infinite" list like above (but it cannot represent the canonical combining classes). I no longer need to specify the gaps for interleavings, because they are implicitly present at all levels, and almost infinitely tailorable (up to the max precision of rationals). The same format can be used to represent the mappings used for transcodings from/to non-Unicode encodings, once again as specialized tailoring. I can also represent the collation level numbers as well as rationals (for example, with Hangul, I can set lists for level 1 only listing the leading consonnants, level 1.3 listing these consonnants and LV syllables and vowel jamos, and the level 1.6 for adding LCT syllables and trailing consonnants: no more need of "trailing weights") May be you don't see the interest for now. Simply because it is very different from what you have designed in ICU. But much of the complicate cases and exceptions that you need to handle with complex code in ICU would be simplified a lot. Finally my initial desire when posting the comment about derived weights was about how PUAs are currently ordered (mixed at the primary level with all other unassigned codepoints and non-characters). when they should be treated like a script by themselves, and should be representable using the abbreviated LDML format like "& [Hani] < [Qqqq]" and reorderable easily in any tailoring if one does not want them to be ordered after all sinograms and all other collation elements (except possibly the set of non-characters, and surrogates). Philippe. 2011/9/13 Mark Davis ☕ <[email protected]> > I don't think there is any particular value to that restructuring, from > what I can make of your email. > > Note also, with regard to your message about 'real' weights, that there is > no requirement that implementations preserve the DUCET values, as long as > the ordering is the same. In particular, CLDR and many implementations use > the 'fractional' UCA weights, which are derived from the DUCET values, but > express weights using a variable number of bytes. These are similar to your > 'rationals' but are really decimal value chunked into bytes, with some extra > features to allow interleaving and avoid overlap. > > http://unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html > > Mark > *— Il meglio è l’inimico del bene —* > > > On Sun, Sep 11, 2011 at 01:06, Philippe Verdy <[email protected]> wrote: > >> I think that the UCA forgets to specify which are the valid primary >> weights infered from the default rules used in the current DUCET. >> >> # Derived weight ranges: FB40..FBFF >> # [Hani] core primaries: FB40..FB41 (2) >> U+4E00..U+9FFF FB40..FB41 (2) >> U+F900..U+FAFF FB41 (1) >> # [Hani] extended primaries: FB80..FB9D (30) >> U+3400..U+4DBF FB80 (1) >> U+20000..U+EFFFF FB84..FB9D (29) >> # Other primaries: FBC0..FBE1 (34) >> U+0000..U+EFFFF FBC0..FBDD (30) >> U+F0000..U+10FFFF FBDE..FBE1 (4) >> # Trailing weights: FC00..FFFF (1024) >> >> It clearly exhibits that the currently assigned ranges of primary weights >> are way too large for the use. >> >> - Sinograms can fully be assigned a first primary weight within a set of >> only 32 values, instead of the 128 assigned. >> >> - This leaves enough place to separate the primary weights used by PUA >> blocks (both in the BMP or in planes 15 and 16), which just requires 1 >> primary weight for the PUAs in the BMP, and 4 primary weights for the last >> two planes (if some other future PUA ranges are assigned, for example for >> RTL PUAs, we could imagine that this count of 5 weights would be extended >> to >> >> - All other primaries will never be assigned to anything outside planes 0 >> to 14, and only for unassigned code points (whose primary weight value >> should probably be between the first derived primary weights for sinograms, >> and those from the PUA), so they'll never need more than 30 primary weights. >> >> Couldn't we remap these default bases for derived primary weights like >> this, and keep more space for the rest: >> >> # Derived weight ranges: FBB0..FBFF (80) >> # [Hani] core primaries: FBB0..FBB1 (2) >> U+4E00..U+9FFF FBB0 (1) >> (using base=U+2000 for the 2nd primary weight) >> U+F900..U+FAFF FBB1 (1) >> (using base=U+A000 for the 2nd primary weight) >> # [Hani] extended primaries: FBB2..FB9D (30) >> U+3400..U+4DBF FBB2 (1) >> (using base=U+2000 for the 2nd primary weight) >> reserved FBB3 (1) >> U+20000..U+EFFFF FBB4..FBCF (26) >> (using base=U+n0000 or U+n8000 for the 2nd primary weight) >> # Other non-PUA primaries: FBD0..FBEF (32) >> U+0000..U+EFFFF FBD0..FBED (30) >> (using base=U+n0000 or U+n8000 for the 2nd primary weight) >> reserved FBEE..FBEF (2) >> # PUA primaries: FBF0..FBFF (16) >> U+D800..U+DFFF FBF0 (1) >> (using base=U+n8000 for the 2nd primary weight) >> reserved FBF1..FBFB (11) >> U+F0000..U+10FFFF FBFC..FBFF (4) >> (using base=U+n0000 or U+n8000 for the 2nd primary weight) >> # Trailing weights: FC00..FFFF (1024) >> >> This scheme completely frees the range FB40..FBAF, while reducing the gaps >> currently left which will never have any use. >> >> (In this scheme, I have no opinion of which best range to use for code >> points assigned to non-characters, but they could all map to FBFF, used here >> for PUA, but with the second primary weight at end of the encoding space >> 8000..FFFF moved to 4000..BFFF so that the second primary weight for >> non-characters goes easily into C000..FFFF) >> >> This way, we would keep ranges available for future large non-sinographic >> scripts (pictographic, non-Han ideographic), that would probably use only >> derived weights, or for a refined DUCET containing more precise levels or >> gaps facilitating some derived collation tables (for example in CLDR). >> >> And all PUAs would clearly sort within dedicated ranges of primary >> weights, with a warranty of all being sorted at end, after all scripts. >> >> -- Philippe. >> >> >

