On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
I was replying not about the notational repreentation of the DUCET data table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. Which remains highly confusive, and contains completely unnecesary steps, and just complicates things with absoiluytely no benefit at all by introducing confusion about these "0000".

Sorry, Philippe, but the confusion that I am seeing introduced is what you are introducing to the unicode list in the course of this discussion.


UTR#10 still does not explicitly state that its use of "0000" does not mean it is a valid "weight", it's a notation only

No, it is explicitly a valid weight. And it is explicitly and normatively referred to in the specification of the algorithm. See UTS10-D8 (and subsequent definitions), which explicitly depend on a definition of "A collation weight whose value is zero." The entire statement of what are primary, secondary, tertiary, etc. collation elements depends on that definition. And see the tables in Section 3.2, which also depend on those definitions.


(but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DUCET

It is *not* just a notation format used in the DUCET -- it is part of the normative definitional structure of the algorithm, which then percolates down into further definitions and rules and the steps of the algorithm.

itself to present how collation elements are structured, the other one is for marking the presence of a possible, but not always required, encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It is not part of the *notation* for collation elements, but instead is a magic value chosen for the level separator precisely because zero values from the collation elements are removed during sort key construction, so that zero is then guaranteed to be a lower value than any remaining weight added to the sort key under construction. This part of the algorithm is not rocket science, by the way!

UTR#10 is still needlessly confusive.

O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

Even the example tables can be made without using these "0000" (for example in tables showing how to build sort keys, it can present the list of weights splitted in separate columns, one column per level, without any "0000". The implementation does not necessarily have to create a buffer containing all weight values in a row, when separate buffers for each level is far superior (and even more efficient as it can save space in memory).

The UCA doesn't *require* you to do anything particular in your own implementation, other than come up with the same results for string comparisons. That is clearly stated in the conformance clause of UTS #10.

https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

The step "S3.2" in the UCA algorithm should not even be there (it is made in favor an specific implementation which is not even efficient or optimal),

That is a false statement. Step S3.2 is there to provide a clear statement of the algorithm, to guarantee correct results for string comparison. Section 9 of UTS #10 provides a whole lunch buffet of techniques that implementations can choose from to increase the efficiency of their implementations, as they deem appropriate. You are free to implement as you choose -- including techniques that do not require any level separators. You are, however, duly warned in:

https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can interfere with other compression methods."

it complicates the algorithm with absoluytely no benefit at all); you can ALWAYS remove it completely and this still generates equivalent results.

No you cannot ALWAYS remove it completely. Whether or not your implementation can do so, depends on what other techniques you may be using to increase performance, store shorter keys, or whatever else may be at stake in your optimization.

If you don't like zeroes in collation, be my guest, and ignore them completely. Take them out of your tables, and don't use level separators. Just make sure you end up with conformant result for comparison of strings when you are done. And in the meantime, if you want to complain about the text of the specification of UTS #10, then provide carefully worded alternatives as suggestions for improvement to the text, rather than just endlessly ranting about how the standard is confusive because the collation weight 0000 is "unnecessary".

--Ken


Reply via email to