It's not just a question of "I like it or not". But the fact that the standard makes the presence of 0000 required in some steps, and the requirement is in fact wrong: this is in fact NEVER required to create an equivalent collation order. these steps are completely unnecessary and should be removed.
Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ <m...@macchiato.com> a écrit : > You may not like the format of the data, but you are not bound to it. If > you don't like the data format (eg you want [.0021.0002] instead of > [.0000.0021.0002]), you can transform it however you want as long as you > get the same answer, as it says here: > > http://unicode.org/reports/tr10/#Conformance > “The Unicode Collation Algorithm is a logical specification. > Implementations are free to change any part of the algorithm as long as any > two strings compared by the implementation are ordered the same as they > would be by the algorithm as specified. Implementations may also use a > different format for the data in the Default Unicode Collation Element > Table. The sort key is a logical intermediate object: if an implementation > produces the same results in comparison of strings, the sort keys can > differ in format from what is specified in this document. (See Section 9, > Implementation Notes.)” > > > That is what is done, for example, in ICU's implementation. See > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw > collation elements" and "sort keys" to see the transformed collation > elements (from the DUCET + CLDR) and the resulting sort keys. > > a =>[29,05,_05] => 29 , 05 , 05 . > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . > à => <same> > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . > À => <same> > > Mark > > > On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < > unicode@unicode.org> wrote: > >> As well the step 2 of the algorithm speaks about a single "array" of >> collation elements. Actually it's best to create one separate array per >> level, and append weights for each level in the relevant array for that >> level. >> The steps S2.2 to S2.4 can do this, including for derived collation >> elements in section 10.1, or variable weighting in section 4. >> >> This also means that for fast string compares, the primary weights can be >> processed on the fly (without needing any buffering) is the primary weights >> are different between the two strings (including when one or both of the >> two strings ends, and the secondary weights or tertiary weights detected >> until then have not found any weight higher than the minimum weight value >> for each level). >> Otherwise: >> - the first secondary weight higher that the minimum secondary weght >> value, and all subsequent secondary weights must be buffered in a >> secondary buffer . >> - the first tertiary weight higher that the minimum secondary weght >> value, and all subsequent secondary weights must be buffered in a tertiary >> buffer. >> - and so on for higher levels (each buffer just needs to keep a counter, >> when it's first used, indicating how many weights were not buffered while >> processing and counting the primary weights, because all these weights were >> all equal to the minimum value for the relevant level) >> - these secondary/tertiary/etc. buffers will only be used once you reach >> the end of the two strings when processing the primary level and no >> difference was found: you'll start by comparing the initial counters in >> these buffers and the buffer that has the largest counter value is >> necessarily for the smaller compared string. If both counters are equal, >> then you start comparing the weights stored in each buffer, until one of >> the buffers ends before another (the shorter buffer is for the smaller >> compared string). If both weight buffers reach the end, you use the next >> pair of buffers built for the next level and process them with the same >> algorithm. >> >> Nowhere you'll ever need to consider any [.0000] weight which is just a >> notation in the format of the DUCET intended only to be readable by humans >> but never needed in any machine implementation. >> >> Now if you want to create sort keys this is similar except that you don"t >> have two strings to process and compare, all you want is to create separate >> arrays of weights for each level: each level can be encoded separately, the >> encoding must be made so that when you'll concatenate the encoded arrays, >> the first few encoded *bits* in the secondary or tertiary encodings cannot >> be larger or equal to the bits used by the encoding of the primary weights >> (this only limits how you'll encode the 1st weight in each array as its >> first encoding *bits* must be lower than the first bits used to encode any >> weight in previous levels). >> >> Nowhere you are required to encode weights exactly like their logical >> weight, this encoding is fully reversible and can use any suitable >> compression technics if needed. As long as you can safely detect when an >> encoding ends, because it encounters some bits (with lower values) used to >> start the encoding of one of the higher levels, the compression is safe. >> >> For each level, you can reserve only a single code used to "mark" the >> start of another higher level followed by some bits to indicate which level >> it is, then followed by the compressed code for the level made so that each >> weight is encoded by a code not starting by the reserved mark. That >> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' >> (if the encoding must be readable as ASCII or UTF-8-based, and must not use >> any control or SPACE or isolated surrogate) and codes used to encode each >> weight must not start by a byte lower or equal to this mark. The binary or >> ASCII code units used to encode each weight must just be comparable, so >> that comparing codes is equivalent to compare weights represented by each >> code. >> >> As well, you are not required to store multiple "marks". This is just one >> of the possibilities to encode in the sort key which level is encoded after >> each "mark", and the marks are not necessarily the same before each level >> (their length may also vary depending on the level they are starting): >> these marks may be completely removed from the final encoding if the >> encoding/compression used allows discriminating the level used by all >> weights, encoded in separate sets of values. >> >> Typical compression technics are for example differencial, notably in >> secondary or higher levels, and run-legth encoded to skip sequences of >> weights all equal to the minimum weight. >> >> The code units used by the weigh encoding for each level may also need to >> avoid some forbidden values if needed (e.g. when encoding the weights to >> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units >> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this >> would create a string not conforming to any standard UTF). >> >> Once again this means that the sequence of logical weight will can sefely >> become a readable string, even suitable to be transmitted as plain-text >> using any UTF, and that compression is also possible in that case: you can >> create and store lot of sort keys even for very long texts >> >> However it is generally better to just encode sort keys only for a >> reasonnably discriminant part of the text, e.g. no sort key longer than 255 >> bytes (created from the start of the original texts): if you compare two >> sort keys and find that they are equal, and if both sort keys have this >> length of 255 bytes, then you'll compare the full original texts using the >> fast-compare algorithm: you don't need to store full sort keys in addition >> to the original texts. This can save lot of storage, provided that original >> texts are sufficiently discriminated by their start, and that cases where >> the sort keys were truncated to the limit of 255 bytes are exceptionnal. >> >> For short texts however, truncated sortkeys may save time at the price of >> a reasonnable storage cost (but sortkeys can be also encoded with roughly >> the same size as the original text: compression is modest for the encoded >> primary level. But compression is frequently very effective for higher >> levels where their smaller weight also have less possible variations of >> value, in a smaller set. >> >> Notably for the secondary level used to encode case differences, only 3 >> bits are enough per weight, and you just need to reserve the 3-bit value >> "000" as the "mark" for indicating the start of another higher level, while >> encoding secondary weights as "001" to "111". >> >> (This means that primary levels have to be encoded so that none of their >> encoded primary weights are starting with "000" marking the start of the >> secondary level. So primary weights can be encoded in patterns starting by >> "0001", "001", "01", or "1" and followed by other bits: this allows >> encoding them as readable UTF-8 if these characters are all different at >> primary level, excluding only the 16 first C0 controls which need to be >> preprocessed into escape sequences using the first permitted C0 control as >> an escape, and escaping that C0 control itself). >> >> The third level, started by the mark "00" and followed by the encoded >> weights indicating this is a tertiary level and not an higher level, will >> also be used to encode a small set of weights (in most locales, this is not >> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using >> differential coding on 3-bits, you reserve "000" as the "mark" for the next >> higher level, then use "001" to "111" to encode differencial weights, the >> differencial weights being initially based on the minimum tertiary weight, >> you'll use the bit pattern "001" to encode the most frequent minimum >> tertiary weight, and patterns "01" to "11" plus additional bits to encode >> other positive or negative differences of tertiary weights, or to use >> run-length compression). Here also it is possible to map the patterns so >> that the encoded secondary weight will be readable valid UTF-8. >> >> The fourth level, started by the mark "000" can use the pattern "001" to >> encode the most frequent minimum quaternary weight, and patterns "010" to >> "011" followed by other bits to differentially encode the quaternary >> weights. Here again it is possible to create an encoding for quaternary >> weights that can use some run-length compression and can also be readable >> valid UTF-8! >> >> And so on. >> >> >> >> >> >> >> >> >> Le jeu. 1 nov. 2018 à 22:04, Philippe Verdy <verd...@wanadoo.fr> a >> écrit : >> >>> So it should be clear in the UCA algorithm and in the DUCET datatable >>> that "0000" is NOT a valid weight >>> It is just a notational placeholder used as ".0000", only indicating in >>> the DUCET format that there's NO weight assigned at the indicated level, >>> because the collation element is ALWAYS ignorable at this level. >>> The DUCET could have as well used the notation ".none", or just dropped >>> every ".0000" in its file (provided it contains a data entry specifying >>> what is the minimum weight used for each level). This notation is only >>> intended to be read by humans editing the file, so they don't need to >>> wonder what is the level of the first indicated weight or remember what is >>> the minimum weight for that level. >>> But the DUCET table is actually generated by a machine and processed by >>> machines. >>> >>> >>> >>> Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy <verd...@wanadoo.fr> a >>> écrit : >>> >>>> In summary, this step given in the algorithm is completely unneeded and >>>> can be dropped completely: >>>> >>>> *S3.2 <http://unicode.org/reports/tr10/#S3.2> *If L is not 1, append a >>>> *level >>>> separator* >>>> >>>> *Note:*The level separator is zero (0000), which is guaranteed to be >>>> lower than any weight in the resulting sort key. This guarantees that when >>>> two strings of unequal length are compared, where the shorter string is a >>>> prefix of the longer string, the longer string is always sorted after the >>>> shorter—in the absence of special features like contractions. For example: >>>> "abc" < "abcX" where "X" can be any character(s). >>>> >>>> Remove any reference to the "level separator" from the UCA. You never >>>> need it. >>>> >>>> As well this paragraph >>>> >>>> 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3> >>>> >>>> *Step 3.* Construct a sort key for each collation element array by >>>> successively appending all non-zero weights from the collation element >>>> array. Figure 2 gives an example of the application of this step to one >>>> collation element array. >>>> >>>> Figure 2. Collation Element Array to Sort Key >>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table> >>>> Collation Element ArraySort Key >>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], >>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 >>>> 0002 0002 0002 >>>> >>>> can be written with this figure: >>>> >>>> Figure 2. Collation Element Array to Sort Key >>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table> >>>> Collation Element ArraySort Key >>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >>>> >>>> The parentheses mark the collation weights 0020 and 0002 that can be >>>> safely removed if they are respectively the minimum secondary weight and >>>> minimum tertiary weight. >>>> But note that 0020 is kept in two places as they are followed by a >>>> higher weight 0021. This is general for any tailored collation (not just >>>> the DUCET). >>>> >>>> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy <verd...@wanadoo.fr> a >>>> écrit : >>>> >>>>> The 0000 is there in the UCA only because the DUCET is published in a >>>>> format that uses it, but here also this format is useless: you never need >>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>>>> just needs to indicate what is the minimum weight assigned for every level >>>>> (except the highest level where it is "implicitly" 0001, and not 0000). >>>>> >>>>> >>>>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus....@gmail.com> a >>>>> écrit : >>>>> >>>>>> There are lots of ways to implement the UCA. >>>>>> >>>>>> When you want fast string comparison, the zero weights are useful for >>>>>> processing -- and you don't actually assemble a sort key. >>>>>> >>>>>> People who want sort keys usually want them to be short, so you spend >>>>>> time on compression. You probably also build sort keys as byte vectors >>>>>> not >>>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be >>>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root >>>>>> collation data file remunges all weights into fractional byte sequences, >>>>>> and leaves gaps for tailoring. >>>>>> >>>>>> markus >>>>>> >>>>>