Re: UCA unnecessary collation weight 0000

Ken Whistler via Unicode Fri, 02 Nov 2018 14:30:02 -0700


On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:

I was replying not about the notational repreentation of the DUCETdata table (using [.0000...] unnecessarily) but about the text ofUTR#10 itself. Which remains highly confusive, and contains completelyunnecesary steps, and just complicates things with absoiluytely nobenefit at all by introducing confusion about these "0000".

Sorry, Philippe, but the confusion that I am seeing introduced is whatyou are introducing to the unicode list in the course of this discussion.

UTR#10 still does not explicitly state that its use of "0000" does notmean it is a valid "weight", it's a notation only

No, it is explicitly a valid weight. And it is explicitly andnormatively referred to in the specification of the algorithm. SeeUTS10-D8 (and subsequent definitions), which explicitly depend on adefinition of "A collation weight whose value is zero." The entirestatement of what are primary, secondary, tertiary, etc. collationelements depends on that definition. And see the tables in Section 3.2,which also depend on those definitions.

(but the notation is used for TWO distinct purposes: one is forpresenting the notation format used in the DUCET

It is *not* just a notation format used in the DUCET -- it is part ofthe normative definitional structure of the algorithm, which thenpercolates down into further definitions and rules and the steps of thealgorithm.

itself to present how collation elements are structured, the other oneis for marking the presence of a possible, but not always required,encoding of an explicit level separator for encoding sort keys).

That is a numeric value of zero, used in Section 7.3, Form Sort Keys. Itis not part of the *notation* for collation elements, but instead is amagic value chosen for the level separator precisely because zero valuesfrom the collation elements are removed during sort key construction, sothat zero is then guaranteed to be a lower value than any remainingweight added to the sort key under construction. This part of thealgorithm is not rocket science, by the way!


UTR#10 is still needlessly confusive.


O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

Even the example tables can be made without using these "0000" (forexample in tables showing how to build sort keys, it can present thelist of weights splitted in separate columns, one column per level,without any "0000". The implementation does not necessarily have tocreate a buffer containing all weight values in a row, when separatebuffers for each level is far superior (and even more efficient as itcan save space in memory).

The UCA doesn't *require* you to do anything particular in your ownimplementation, other than come up with the same results for stringcomparisons. That is clearly stated in the conformance clause of UTS #10.


https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

The step "S3.2" in the UCA algorithm should not even be there (it ismade in favor an specific implementation which is not even efficientor optimal),

That is a false statement. Step S3.2 is there to provide a clearstatement of the algorithm, to guarantee correct results for stringcomparison. Section 9 of UTS #10 provides a whole lunch buffet oftechniques that implementations can choose from to increase theefficiency of their implementations, as they deem appropriate. You arefree to implement as you choose -- including techniques that do notrequire any level separators. You are, however, duly warned in:


https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it caninterfere with other compression methods."

it complicates the algorithm with absoluytely no benefit at all); youcan ALWAYS remove it completely and this still generates equivalentresults.

No you cannot ALWAYS remove it completely. Whether or not yourimplementation can do so, depends on what other techniques you may beusing to increase performance, store shorter keys, or whatever else maybe at stake in your optimization.

If you don't like zeroes in collation, be my guest, and ignore themcompletely. Take them out of your tables, and don't use levelseparators. Just make sure you end up with conformant result forcomparison of strings when you are done. And in the meantime, if youwant to complain about the text of the specification of UTS #10, thenprovide carefully worded alternatives as suggestions for improvement tothe text, rather than just endlessly ranting about how the standard isconfusive because the collation weight 0000 is "unnecessary".


--Ken

Re: UCA unnecessary collation weight 0000

Reply via email to