Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Philippe Verdy via Unicode
So you finally admit that I was right... And that the specs include
requirements that are not even needed to make UCA work, and that not even
used by wellknown implementations. These are old artefacts which are now
really confusive (instructing programmers to adopt the old deprecated
behavior, before realizing that this was a bad advice which jut complicated
their task). UCA can be implemented **conformingly** without these, even
for the simplest implementations (where using complex packages like ICU is
not an option and rewriting it is not one as well for much simpler goals)
where these incorrect requirements are in fact suggesting to be more
inefficient than really needed.
There's not a lot of work to edit and to fix the specs without these
polluting  "pseudo-weights".

Le dim. 4 nov. 2018 à 09:27, Mark Davis ☕️  a écrit :

> Philippe, I agree that we could have structured the UCA differently. It
> does make sense, for example, to have the weights be simply decimal values
> instead of integers. But nobody is going to go through the substantial
> work of restructuring the UCA spec and data file unless there is a very
> strong reason to do so. It takes far more time and effort than people
> realize to change in the algorithm/data while making sure that everything
> lines up without inadvertent changes being introduced.
>
> It is just not worth the effort. There are so, so, many things we can do
> in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
> benefit.
>
> You can continue flogging this horse all you want, but I'm muting this
> thread (and I suspect I'm not the only one).
>
> Mark
>
>
> On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
>
>> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>>
>>>
>>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>>
>>> I was replying not about the notational repreentation of the DUCET data
>>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>>> Which remains highly confusive, and contains completely unnecesary steps,
>>> and just complicates things with absoiluytely no benefit at all by
>>> introducing confusion about these "".
>>>
>>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>>> you are introducing to the unicode list in the course of this discussion.
>>>
>>>
>>> UTR#10 still does not explicitly state that its use of "" does not
>>> mean it is a valid "weight", it's a notation only
>>>
>>> No, it is explicitly a valid weight. And it is explicitly and
>>> normatively referred to in the specification of the algorithm. See UTS10-D8
>>> (and subsequent definitions), which explicitly depend on a definition of "A
>>> collation weight whose value is zero." The entire statement of what are
>>> primary, secondary, tertiary, etc. collation elements depends on that
>>> definition. And see the tables in Section 3.2, which also depend on those
>>> definitions.
>>>
>>> (but the notation is used for TWO distinct purposes: one is for
>>> presenting the notation format used in the DUCET
>>>
>>> It is *not* just a notation format used in the DUCET -- it is part of
>>> the normative definitional structure of the algorithm, which then
>>> percolates down into further definitions and rules and the steps of the
>>> algorithm.
>>>
>>
>> I insist that this is NOT NEEDED at all for the definition, it is
>> absolutely NOT structural. The algorithm still guarantees the SAME result.
>>
>> It is ONLY used to explain the format of the DUCET and the fact the this
>> format does NOT use  as a valid weight, ans os can use it as a notation
>> (in fact only a presentational feature).
>>
>>
>>> itself to present how collation elements are structured, the other one
>>> is for marking the presence of a possible, but not always required,
>>> encoding of an explicit level separator for encoding sort keys).
>>>
>>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>>> is not part of the *notation* for collation elements, but instead is a
>>> magic value chosen for the level separator precisely because zero values
>>> from the collation elements are removed during sort key construction, so
>>> that zero is then guaranteed to be a lower value than any remaining weight
>>> added to the sort key under construction. This part of the algorithm is not
>>> rocket science, by the way!
>>>
>>
>> Here again you make a confusion: a sort key MAY use them as separators if
>> it wants to compress keys by reencoding weights per level: that's the only
>> case where you may want to introduce an encoding pattern starting with 0,
>> while the rest of the encoding for weights in that level must using
>> patterns not starting by this 0 (the number of bits to encode this 0 does
>> not matter: it is only part of the encoding used on this level which does
>> not necessarily have to use 16-bit code units per weight.
>>
>>>
>>> Even the example tables can be 

Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Mark Davis ☕️ via Unicode
Philippe, I agree that we could have structured the UCA differently. It
does make sense, for example, to have the weights be simply decimal values
instead of integers. But nobody is going to go through the substantial work
of restructuring the UCA spec and data file unless there is a very strong
reason to do so. It takes far more time and effort than people realize to
change in the algorithm/data while making sure that everything lines up
without inadvertent changes being introduced.

It is just not worth the effort. There are so, so, many things we can do in
Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
benefit.

You can continue flogging this horse all you want, but I'm muting this
thread (and I suspect I'm not the only one).

Mark


On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>
>>
>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>
>> I was replying not about the notational repreentation of the DUCET data
>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>> Which remains highly confusive, and contains completely unnecesary steps,
>> and just complicates things with absoiluytely no benefit at all by
>> introducing confusion about these "".
>>
>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>> you are introducing to the unicode list in the course of this discussion.
>>
>>
>> UTR#10 still does not explicitly state that its use of "" does not
>> mean it is a valid "weight", it's a notation only
>>
>> No, it is explicitly a valid weight. And it is explicitly and normatively
>> referred to in the specification of the algorithm. See UTS10-D8 (and
>> subsequent definitions), which explicitly depend on a definition of "A
>> collation weight whose value is zero." The entire statement of what are
>> primary, secondary, tertiary, etc. collation elements depends on that
>> definition. And see the tables in Section 3.2, which also depend on those
>> definitions.
>>
>> (but the notation is used for TWO distinct purposes: one is for
>> presenting the notation format used in the DUCET
>>
>> It is *not* just a notation format used in the DUCET -- it is part of the
>> normative definitional structure of the algorithm, which then percolates
>> down into further definitions and rules and the steps of the algorithm.
>>
>
> I insist that this is NOT NEEDED at all for the definition, it is
> absolutely NOT structural. The algorithm still guarantees the SAME result.
>
> It is ONLY used to explain the format of the DUCET and the fact the this
> format does NOT use  as a valid weight, ans os can use it as a notation
> (in fact only a presentational feature).
>
>
>> itself to present how collation elements are structured, the other one is
>> for marking the presence of a possible, but not always required, encoding
>> of an explicit level separator for encoding sort keys).
>>
>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>> is not part of the *notation* for collation elements, but instead is a
>> magic value chosen for the level separator precisely because zero values
>> from the collation elements are removed during sort key construction, so
>> that zero is then guaranteed to be a lower value than any remaining weight
>> added to the sort key under construction. This part of the algorithm is not
>> rocket science, by the way!
>>
>
> Here again you make a confusion: a sort key MAY use them as separators if
> it wants to compress keys by reencoding weights per level: that's the only
> case where you may want to introduce an encoding pattern starting with 0,
> while the rest of the encoding for weights in that level must using
> patterns not starting by this 0 (the number of bits to encode this 0 does
> not matter: it is only part of the encoding used on this level which does
> not necessarily have to use 16-bit code units per weight.
>
>>
>> Even the example tables can be made without using these "" (for
>> example in tables showing how to build sort keys, it can present the list
>> of weights splitted in separate columns, one column per level, without any
>> "". The implementation does not necessarily have to create a buffer
>> containing all weight values in a row, when separate buffers for each level
>> is far superior (and even more efficient as it can save space in memory).
>>
>> The UCA doesn't *require* you to do anything particular in your own
>> implementation, other than come up with the same results for string
>> comparisons.
>>
> Yes I know, but the algorithm also does not require me to use these
> invalid  pseudo-weights, that the algorithm itself will always discard
> (in a completely needless step)!
>
>
>> That is clearly stated in the conformance clause of UTS #10.
>>
>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>
>> The step "S3.2" in the UCA 

Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [....] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
> (but the notation is used for TWO distinct purposes: one is for presenting
> the notation format used in the DUCET
>
> It is *not* just a notation format used in the DUCET -- it is part of the
> normative definitional structure of the algorithm, which then percolates
> down into further definitions and rules and the steps of the algorithm.
>

I insist that this is NOT NEEDED at all for the definition, it is
absolutely NOT structural. The algorithm still guarantees the SAME result.

It is ONLY used to explain the format of the DUCET and the fact the this
format does NOT use  as a valid weight, ans os can use it as a notation
(in fact only a presentational feature).


> itself to present how collation elements are structured, the other one is
> for marking the presence of a possible, but not always required, encoding
> of an explicit level separator for encoding sort keys).
>
> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
> is not part of the *notation* for collation elements, but instead is a
> magic value chosen for the level separator precisely because zero values
> from the collation elements are removed during sort key construction, so
> that zero is then guaranteed to be a lower value than any remaining weight
> added to the sort key under construction. This part of the algorithm is not
> rocket science, by the way!
>

Here again you make a confusion: a sort key MAY use them as separators if
it wants to compress keys by reencoding weights per level: that's the only
case where you may want to introduce an encoding pattern starting with 0,
while the rest of the encoding for weights in that level must using
patterns not starting by this 0 (the number of bits to encode this 0 does
not matter: it is only part of the encoding used on this level which does
not necessarily have to use 16-bit code units per weight.

>
> Even the example tables can be made without using these "" (for
> example in tables showing how to build sort keys, it can present the list
> of weights splitted in separate columns, one column per level, without any
> "". The implementation does not necessarily have to create a buffer
> containing all weight values in a row, when separate buffers for each level
> is far superior (and even more efficient as it can save space in memory).
>
> The UCA doesn't *require* you to do anything particular in your own
> implementation, other than come up with the same results for string
> comparisons.
>
Yes I know, but the algorithm also does not require me to use these invalid
 pseudo-weights, that the algorithm itself will always discard (in a
completely needless step)!


> That is clearly stated in the conformance clause of UTS #10.
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>
> The step "S3.2" in the UCA algorithm should not even be there (it is made
> in favor an specific implementation which is not even efficient or optimal),
>
> That is a false statement. Step S3.2 is there to provide a clear statement
> of the algorithm, to guarantee correct results for string comparison.
>

You're wrong, this statement is completely useless in all cases. There is
still the correct results for string comparison without them: a string
comparison can only compare valid weights for each level, it will not
compare any weight past the end of the text in any one of the two compared
strings, nowhere it will compare weights with one of them being 0, unless
this 0 is used as a "guard value" for the end of text and your compare loop
still continues scanning the longer string when the other string has
already ended (this case should be detected much earlier before
determineing the next collection boundary in the string and then computing
its weights for each level.

> Section 

Re: UCA unnecessary collation weight 0000

2018-11-03 Thread Philippe Verdy via Unicode
Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [....] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
Ok is is a valid "weight" when taken *isolately*, but it is invalid as a
weight at any level.
This does not change the fact because weights are always relative to a
specific level for which they are defined, and  does not belong to any
one. This weight is completely artificial and introduced completely
needlessly: all levels are completely defined by a closed range of weights,
all of them being non-, and all ranges being numerically separated
(with the primary level using the largest range).

I can reread again and again (even the sections you cite), but there's
absolutely NO need of this articificial "" anywhere (any clause
introducing it or using it to define something can be safely removed)


Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
On Fri, 2 Nov 2018 14:27:37 -0700
Ken Whistler via Unicode  wrote:

> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:

> > UTR#10 still does not explicitly state that its use of "" does
> > not mean it is a valid "weight", it's a notation only  
> 
> No, it is explicitly a valid weight. And it is explicitly and 
> normatively referred to in the specification of the algorithm. See 
> UTS10-D8 (and subsequent definitions), which explicitly depend on a 
> definition of "A collation weight whose value is zero." The entire 
> statement of what are primary, secondary, tertiary, etc. collation 
> elements depends on that definition. And see the tables in Section
> 3.2, which also depend on those definitions.

The definition is defective in that it doesn't handle 'large weight
values' well.  There is the anomaly that a mapping of collating element
to [1234..][0200.020.002] may be compatible with WF1, but the
exactly equivalent mapping to [1234.020.002][0200..] makes the
table ill-formed.  The fractional weight definitions for UCA eliminate
this '' notion quite well, and I once expected the UCA to move to
the CLDRCA (CLDR Collation Algorithm) fractional weight definition.
The definition of the CLDRCA does a much better job of explaining
'large weight values'.  It turns them from something exceptional to a
normal part of its functioning.  

> > (but the notation is used for TWO distinct purposes: one is for 
> > presenting the notation format used in the DUCET  
> 
> It is *not* just a notation format used in the DUCET -- it is part of 
> the normative definitional structure of the algorithm, which then 
> percolates down into further definitions and rules and the steps of
> the algorithm.

It's not needed for the CLDRCA!  The statement of the UCA algorithm
does depend on its notation, but it can be recast to avoid these zero
weights.

Richard.


Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Ken Whistler via Unicode


On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
I was replying not about the notational repreentation of the DUCET 
data table (using [....] unnecessarily) but about the text of 
UTR#10 itself. Which remains highly confusive, and contains completely 
unnecesary steps, and just complicates things with absoiluytely no 
benefit at all by introducing confusion about these "". 


Sorry, Philippe, but the confusion that I am seeing introduced is what 
you are introducing to the unicode list in the course of this discussion.



UTR#10 still does not explicitly state that its use of "" does not 
mean it is a valid "weight", it's a notation only


No, it is explicitly a valid weight. And it is explicitly and 
normatively referred to in the specification of the algorithm. See 
UTS10-D8 (and subsequent definitions), which explicitly depend on a 
definition of "A collation weight whose value is zero." The entire 
statement of what are primary, secondary, tertiary, etc. collation 
elements depends on that definition. And see the tables in Section 3.2, 
which also depend on those definitions.



(but the notation is used for TWO distinct purposes: one is for 
presenting the notation format used in the DUCET


It is *not* just a notation format used in the DUCET -- it is part of 
the normative definitional structure of the algorithm, which then 
percolates down into further definitions and rules and the steps of the 
algorithm.


itself to present how collation elements are structured, the other one 
is for marking the presence of a possible, but not always required, 
encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It 
is not part of the *notation* for collation elements, but instead is a 
magic value chosen for the level separator precisely because zero values 
from the collation elements are removed during sort key construction, so 
that zero is then guaranteed to be a lower value than any remaining 
weight added to the sort key under construction. This part of the 
algorithm is not rocket science, by the way!


UTR#10 is still needlessly confusive.


O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

Even the example tables can be made without using these "" (for 
example in tables showing how to build sort keys, it can present the 
list of weights splitted in separate columns, one column per level, 
without any "". The implementation does not necessarily have to 
create a buffer containing all weight values in a row, when separate 
buffers for each level is far superior (and even more efficient as it 
can save space in memory).


The UCA doesn't *require* you to do anything particular in your own 
implementation, other than come up with the same results for string 
comparisons. That is clearly stated in the conformance clause of UTS #10.


https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

The step "S3.2" in the UCA algorithm should not even be there (it is 
made in favor an specific implementation which is not even efficient 
or optimal),


That is a false statement. Step S3.2 is there to provide a clear 
statement of the algorithm, to guarantee correct results for string 
comparison. Section 9 of UTS #10 provides a whole lunch buffet of 
techniques that implementations can choose from to increase the 
efficiency of their implementations, as they deem appropriate. You are 
free to implement as you choose -- including techniques that do not 
require any level separators. You are, however, duly warned in:


https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can 
interfere with other compression methods."


it complicates the algorithm with absoluytely no benefit at all); you 
can ALWAYS remove it completely and this still generates equivalent 
results.


No you cannot ALWAYS remove it completely. Whether or not your 
implementation can do so, depends on what other techniques you may be 
using to increase performance, store shorter keys, or whatever else may 
be at stake in your optimization.


If you don't like zeroes in collation, be my guest, and ignore them 
completely. Take them out of your tables, and don't use level 
separators. Just make sure you end up with conformant result for 
comparison of strings when you are done. And in the meantime, if you 
want to complain about the text of the specification of UTS #10, then 
provide carefully worded alternatives as suggestions for improvement to 
the text, rather than just endlessly ranting about how the standard is 
confusive because the collation weight  is "unnecessary".


--Ken




Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Philippe Verdy via Unicode
I was replying not about the notational repreentation of the DUCET data
table (using [....] unnecessarily) but about the text of UTR#10 itself.
Which remains highly confusive, and contains completely unnecesary steps,
and just complicates things with absoiluytely no benefit at all by
introducing confusion about these "". UTR#10 still does not explicitly
state that its use of "" does not mean it is a valid "weight", it's a
notation only (but the notation is used for TWO distinct purposes: one is
for presenting the notation format used in the DUCET itself to present how
collation elements are structured, the other one is for marking the
presence of a possible, but not always required, encoding of an explicit
level separator for encoding sort keys).

UTR#10 is still needlessly confusive. Even the example tables can be made
without using these "" (for example in tables showing how to build sort
keys, it can present the list of weights splitted in separate columns, one
column per level, without any "". The implementation does not
necessarily have to create a buffer containing all weight values in a row,
when separate buffers for each level is far superior (and even more
efficient as it can save space in memory). The step "S3.2" in the UCA
algorithm should not even be there (it is made in favor an specific
implementation which is not even efficient or optimal), it complicates the
algorithm with absoluytely no benefit at all); you can ALWAYS remove it
completely and this still generates equivalent results.


Le ven. 2 nov. 2018 à 15:23, Mark Davis ☕️  a écrit :

> The table is the way it is because it is easier to process (and
> comprehend) when the first field is always the primary weight, second is
> always the secondary, etc.
>
> Go ahead and transform the input DUCET files as you see fit. The "should
> be removed" is your personal preference. Unless we hear strong demand
> otherwise from major implementers, people have better things to do than
> change their parsers to suit your preference.
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:
>
>> It's not just a question of "I like it or not". But the fact that the
>> standard makes the presence of  required in some steps, and the
>> requirement is in fact wrong: this is in fact NEVER required to create an
>> equivalent collation order. these steps are completely unnecessary and
>> should be removed.
>>
>> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>>
>>> You may not like the format of the data, but you are not bound to it. If
>>> you don't like the data format (eg you want [.0021.0002] instead of
>>> [..0021.0002]), you can transform it however you want as long as you
>>> get the same answer, as it says here:
>>>
>>> http://unicode.org/reports/tr10/#Conformance
>>> “The Unicode Collation Algorithm is a logical specification.
>>> Implementations are free to change any part of the algorithm as long as any
>>> two strings compared by the implementation are ordered the same as they
>>> would be by the algorithm as specified. Implementations may also use a
>>> different format for the data in the Default Unicode Collation Element
>>> Table. The sort key is a logical intermediate object: if an implementation
>>> produces the same results in comparison of strings, the sort keys can
>>> differ in format from what is specified in this document. (See Section 9,
>>> Implementation Notes.)”
>>>
>>>
>>> That is what is done, for example, in ICU's implementation. See
>>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>>> collation elements" and "sort keys" to see the transformed collation
>>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>>
>>> a =>[29,05,_05] => 29 , 05 , 05 .
>>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>>> à => 
>>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>>> À => 
>>>
>>> Mark
>>>
>>>
>>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>>> unicode@unicode.org> wrote:
>>>
 As well the step 2 of the algorithm speaks about a single "array" of
 collation elements. Actually it's best to create one separate array per
 level, and append weights for each level in the relevant array for that
 level.
 The steps S2.2 to S2.4 can do this, including for derived collation
 elements in section 10.1, or variable weighting in section 4.

 This also means that for fast string compares, the primary weights can
 be processed on the fly (without needing any buffering) is the primary
 weights are different between the two strings (including when one or both
 of the two strings ends, and the secondary weights or tertiary weights
 detected until then have not found any weight higher than the minimum
 weight value for each level).
 Otherwise:
 - the first secondary weight higher that the minimum secondary weght
 value, and all subsequent secondary weights must be buffered in a

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Richard Wordingham via Unicode
On Fri, 2 Nov 2018 14:54:19 +0100
Philippe Verdy via Unicode  wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of  required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to
> create an equivalent collation order. these steps are completely
> unnecessary and should be removed.
> 
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a
> écrit :
> 
> > You may not like the format of the data, but you are not bound to
> > it. If you don't like the data format (eg you want [.0021.0002]
> > instead of [..0021.0002]), you can transform it however you
> > want as long as you get the same answer, as it says here:
> >
> > http://unicode.org/reports/tr10/#Conformance
> > “The Unicode Collation Algorithm is a logical specification.
> > Implementations are free to change any part of the algorithm as
> > long as any two strings compared by the implementation are ordered
> > the same as they would be by the algorithm as specified.
> > Implementations may also use a different format for the data in the
> > Default Unicode Collation Element Table. The sort key is a logical
> > intermediate object: if an implementation produces the same results
> > in comparison of strings, the sort keys can differ in format from
> > what is specified in this document. (See Section 9, Implementation
> > Notes.)”

Given the above paragraph, how does the standard force you to use a
special ?  Perhaps the wording of the standard can be changed to
prevent your unhappy interpretation.

> > That is what is done, for example, in ICU's implementation. See
> > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> > collation elements" and "sort keys" to see the transformed collation
> > elements (from the DUCET + CLDR) and the resulting sort keys.
> >
> > a =>[29,05,_05] => 29 , 05 , 05 .
> > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> > à => 
> > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> > À => 

As you can see, Mark does not come to the same conclusion as you, and
nor do I.

Richard.



Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
The table is the way it is because it is easier to process (and comprehend)
when the first field is always the primary weight, second is always the
secondary, etc.

Go ahead and transform the input DUCET files as you see fit. The "should be
removed" is your personal preference. Unless we hear strong demand
otherwise from major implementers, people have better things to do than
change their parsers to suit your preference.

Mark


On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of  required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to create an
> equivalent collation order. these steps are completely unnecessary and
> should be removed.
>
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>
>> You may not like the format of the data, but you are not bound to it. If
>> you don't like the data format (eg you want [.0021.0002] instead of
>> [..0021.0002]), you can transform it however you want as long as you
>> get the same answer, as it says here:
>>
>> http://unicode.org/reports/tr10/#Conformance
>> “The Unicode Collation Algorithm is a logical specification.
>> Implementations are free to change any part of the algorithm as long as any
>> two strings compared by the implementation are ordered the same as they
>> would be by the algorithm as specified. Implementations may also use a
>> different format for the data in the Default Unicode Collation Element
>> Table. The sort key is a logical intermediate object: if an implementation
>> produces the same results in comparison of strings, the sort keys can
>> differ in format from what is specified in this document. (See Section 9,
>> Implementation Notes.)”
>>
>>
>> That is what is done, for example, in ICU's implementation. See
>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>> collation elements" and "sort keys" to see the transformed collation
>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>
>> a =>[29,05,_05] => 29 , 05 , 05 .
>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>> à => 
>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>> À => 
>>
>> Mark
>>
>>
>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> As well the step 2 of the algorithm speaks about a single "array" of
>>> collation elements. Actually it's best to create one separate array per
>>> level, and append weights for each level in the relevant array for that
>>> level.
>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>> elements in section 10.1, or variable weighting in section 4.
>>>
>>> This also means that for fast string compares, the primary weights can
>>> be processed on the fly (without needing any buffering) is the primary
>>> weights are different between the two strings (including when one or both
>>> of the two strings ends, and the secondary weights or tertiary weights
>>> detected until then have not found any weight higher than the minimum
>>> weight value for each level).
>>> Otherwise:
>>> - the first secondary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a
>>> secondary  buffer  .
>>> - the first tertiary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>> buffer.
>>> - and so on for higher levels (each buffer just needs to keep a counter,
>>> when it's first used, indicating how many weights were not buffered while
>>> processing and counting the primary weights, because all these weights were
>>> all equal to the minimum value for the relevant level)
>>> - these secondary/tertiary/etc. buffers will only be used once you reach
>>> the end of the two strings when processing the primary level and no
>>> difference was found: you'll start by comparing the initial counters in
>>> these buffers and the buffer that has the largest counter value is
>>> necessarily for the smaller compared string. If both counters are equal,
>>> then you start comparing the weights stored in each buffer, until one of
>>> the buffers ends before another (the shorter buffer is for the smaller
>>> compared string). If both weight buffers reach the end, you use the next
>>> pair of buffers built for the next level and process them with the same
>>> algorithm.
>>>
>>> Nowhere you'll ever need to consider any [.] weight which is just a
>>> notation in the format of the DUCET intended only to be readable by humans
>>> but never needed in any machine implementation.
>>>
>>> Now if you want to create sort keys this is similar except that you
>>> don"t have two strings to process and compare, all you want is to create
>>> separate arrays of weights for each level: each level can be encoded
>>> separately, the encoding must be made so that when you'll concatenate 

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode
You may not like the format of the data, but you are not bound to it. If
you don't like the data format (eg you want [.0021.0002] instead of
[..0021.0002]), you can transform it however you want as long as you
get the same answer, as it says here:

http://unicode.org/reports/tr10/#Conformance
“The Unicode Collation Algorithm is a logical specification.
Implementations are free to change any part of the algorithm as long as any
two strings compared by the implementation are ordered the same as they
would be by the algorithm as specified. Implementations may also use a
different format for the data in the Default Unicode Collation Element
Table. The sort key is a logical intermediate object: if an implementation
produces the same results in comparison of strings, the sort keys can
differ in format from what is specified in this document. (See Section 9,
Implementation Notes.)”


That is what is done, for example, in ICU's implementation. See
http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
collation elements" and "sort keys" to see the transformed collation
elements (from the DUCET + CLDR) and the resulting sort keys.

a =>[29,05,_05] => 29 , 05 , 05 .
a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
à => 
A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
À => 

Mark


On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> As well the step 2 of the algorithm speaks about a single "array" of
> collation elements. Actually it's best to create one separate array per
> level, and append weights for each level in the relevant array for that
> level.
> The steps S2.2 to S2.4 can do this, including for derived collation
> elements in section 10.1, or variable weighting in section 4.
>
> This also means that for fast string compares, the primary weights can be
> processed on the fly (without needing any buffering) is the primary weights
> are different between the two strings (including when one or both of the
> two strings ends, and the secondary weights or tertiary weights detected
> until then have not found any weight higher than the minimum weight value
> for each level).
> Otherwise:
> - the first secondary weight higher that the minimum secondary weght
> value, and all subsequent secondary weights must be buffered in a
> secondary  buffer  .
> - the first tertiary weight higher that the minimum secondary weght value,
> and all subsequent secondary weights must be buffered in a tertiary buffer.
> - and so on for higher levels (each buffer just needs to keep a counter,
> when it's first used, indicating how many weights were not buffered while
> processing and counting the primary weights, because all these weights were
> all equal to the minimum value for the relevant level)
> - these secondary/tertiary/etc. buffers will only be used once you reach
> the end of the two strings when processing the primary level and no
> difference was found: you'll start by comparing the initial counters in
> these buffers and the buffer that has the largest counter value is
> necessarily for the smaller compared string. If both counters are equal,
> then you start comparing the weights stored in each buffer, until one of
> the buffers ends before another (the shorter buffer is for the smaller
> compared string). If both weight buffers reach the end, you use the next
> pair of buffers built for the next level and process them with the same
> algorithm.
>
> Nowhere you'll ever need to consider any [.] weight which is just a
> notation in the format of the DUCET intended only to be readable by humans
> but never needed in any machine implementation.
>
> Now if you want to create sort keys this is similar except that you don"t
> have two strings to process and compare, all you want is to create separate
> arrays of weights for each level: each level can be encoded separately, the
> encoding must be made so that when you'll concatenate the encoded arrays,
> the first few encoded *bits* in the secondary or tertiary encodings cannot
> be larger or equal to the bits used by the encoding of the primary weights
> (this only limits how you'll encode the 1st weight in each array as its
> first encoding *bits* must be lower than the first bits used to encode any
> weight in previous levels).
>
> Nowhere you are required to encode weights exactly like their logical
> weight, this encoding is fully reversible and can use any suitable
> compression technics if needed. As long as you can safely detect when an
> encoding ends, because it encounters some bits (with lower values) used to
> start the encoding of one of the higher levels, the compression is safe.
>
> For each level, you can reserve only a single code used to "mark" the
> start of another higher level followed by some bits to indicate which level
> it is, then followed by the compressed code for the level made so that each
> weight is encoded by a code not starting by the reserved mark. That
> encoding "mark" 

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
As well the step 2 of the algorithm speaks about a single "array" of
collation elements. Actually it's best to create one separate array per
level, and append weights for each level in the relevant array for that
level.
The steps S2.2 to S2.4 can do this, including for derived collation
elements in section 10.1, or variable weighting in section 4.

This also means that for fast string compares, the primary weights can be
processed on the fly (without needing any buffering) is the primary weights
are different between the two strings (including when one or both of the
two strings ends, and the secondary weights or tertiary weights detected
until then have not found any weight higher than the minimum weight value
for each level).
Otherwise:
- the first secondary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a secondary
buffer  .
- the first tertiary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a tertiary buffer.
- and so on for higher levels (each buffer just needs to keep a counter,
when it's first used, indicating how many weights were not buffered while
processing and counting the primary weights, because all these weights were
all equal to the minimum value for the relevant level)
- these secondary/tertiary/etc. buffers will only be used once you reach
the end of the two strings when processing the primary level and no
difference was found: you'll start by comparing the initial counters in
these buffers and the buffer that has the largest counter value is
necessarily for the smaller compared string. If both counters are equal,
then you start comparing the weights stored in each buffer, until one of
the buffers ends before another (the shorter buffer is for the smaller
compared string). If both weight buffers reach the end, you use the next
pair of buffers built for the next level and process them with the same
algorithm.

Nowhere you'll ever need to consider any [.] weight which is just a
notation in the format of the DUCET intended only to be readable by humans
but never needed in any machine implementation.

Now if you want to create sort keys this is similar except that you don"t
have two strings to process and compare, all you want is to create separate
arrays of weights for each level: each level can be encoded separately, the
encoding must be made so that when you'll concatenate the encoded arrays,
the first few encoded *bits* in the secondary or tertiary encodings cannot
be larger or equal to the bits used by the encoding of the primary weights
(this only limits how you'll encode the 1st weight in each array as its
first encoding *bits* must be lower than the first bits used to encode any
weight in previous levels).

Nowhere you are required to encode weights exactly like their logical
weight, this encoding is fully reversible and can use any suitable
compression technics if needed. As long as you can safely detect when an
encoding ends, because it encounters some bits (with lower values) used to
start the encoding of one of the higher levels, the compression is safe.

For each level, you can reserve only a single code used to "mark" the start
of another higher level followed by some bits to indicate which level it
is, then followed by the compressed code for the level made so that each
weight is encoded by a code not starting by the reserved mark. That
encoding "mark" is not necessarily a , it may be a nul byte, or a '!'
(if the encoding must be readable as ASCII or UTF-8-based, and must not use
any control or SPACE or isolated surrogate) and codes used to encode each
weight must not start by a byte lower or equal to this mark. The binary or
ASCII code units used to encode each weight must just be comparable, so
that comparing codes is equivalent to compare weights represented by each
code.

As well, you are not required to store multiple "marks". This is just one
of the possibilities to encode in the sort key which level is encoded after
each "mark", and the marks are not necessarily the same before each level
(their length may also vary depending on the level they are starting):
these marks may be completely removed from the final encoding if the
encoding/compression used allows discriminating the level used by all
weights, encoded in separate sets of values.

Typical compression technics are for example differencial, notably in
secondary or higher levels, and run-legth encoded to skip sequences of
weights all equal to the minimum weight.

The code units used by the weigh encoding for each level may also need to
avoid some forbidden values if needed (e.g. when encoding the weights to
UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
would create a string not conforming to any standard UTF).

Once again this means that the sequence of logical weight will can 

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 18:39:16 +0100
Philippe Verdy via Unicode  wrote:

> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua
> which uses a "map" to map source (binary) strings or regexps,into
> target (binary) strings:
> 
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
> 
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized
> order (possibly tuned for the tailored locale) using basic regexps.

Are you sure of this?  Will you publish the algorithm?  Have you
passed the official conformance tests?  (Mind you, DUCET is a
relatively easy UCA collation to implement successfully.)

> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
> 
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight

Richard.


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 21:13:46 +0100
Philippe Verdy via Unicode  wrote:

> I'm not speaking just about how collation keys will finally be stored
> (as uint16 or bytes, or sequences  of bits with variable length); I'm
> just refering to the sequence of weights you generate.


> You absolutely NEVER need ANYWHERE in the UCA algorithm any 
> weight, not even during processing, or un the DUCET table.

If you take the zero weights out, you have a different table structure
to store, e.g. the CLDR fractional weight tables.

Richard.


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Richard Wordingham via Unicode
On Thu, 1 Nov 2018 22:04:40 +0100
Philippe Verdy via Unicode  wrote:

> The DUCET could have as well used the notation ".none", or
> just dropped every "." in its file (provided it contains a data
> entry specifying what is the minimum weight used for each level).
> This notation is only intended to be read by humans editing the file,
> so they don't need to wonder what is the level of the first indicated
> weight or remember what is the minimum weight for that level.
> But the DUCET table is actually generated by a machine and processed
> by machines.

A fair few humans have tailored it by hand.

Richard.


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
So it should be clear in the UCA algorithm and in the DUCET datatable that
"" is NOT a valid weight
It is just a notational placeholder used as ".", only indicating in the
DUCET format that there's NO weight assigned at the indicated level,
because the collation element is ALWAYS ignorable at this level.
The DUCET could have as well used the notation ".none", or just dropped
every "." in its file (provided it contains a data entry specifying
what is the minimum weight used for each level). This notation is only
intended to be read by humans editing the file, so they don't need to
wonder what is the level of the first indicated weight or remember what is
the minimum weight for that level.
But the DUCET table is actually generated by a machine and processed by
machines.



Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy  a écrit :

> In summary, this step given in the algorithm is completely unneeded and
> can be dropped completely:
>
> *S3.2  *If L is not 1, append a *level
> separator*
>
> *Note:*The level separator is zero (), which is guaranteed to be
> lower than any weight in the resulting sort key. This guarantees that when
> two strings of unequal length are compared, where the shorter string is a
> prefix of the longer string, the longer string is always sorted after the
> shorter—in the absence of special features like contractions. For example:
> "abc" < "abcX" where "X" can be any character(s).
>
> Remove any reference to the "level separator" from the UCA. You never need
> it.
>
> As well this paragraph
>
> 7.3 Form Sort Keys 
>
> *Step 3.* Construct a sort key for each collation element array by
> successively appending all non-zero weights from the collation element
> array. Figure 2 gives an example of the application of this step to one
> collation element array.
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 
> 0706
> 06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002
>
> can be written with this figure:
>
> Figure 2. Collation Element Array to Sort Key
> 
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>
> The parentheses mark the collation weights 0020 and 0002 that can be
> safely removed if they are respectively the minimum secondary weight and
> minimum tertiary weight.
> But note that 0020 is kept in two places as they are followed by a higher
> weight 0021. This is general for any tailored collation (not just the
> DUCET).
>
> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :
>
>> The  is there in the UCA only because the DUCET is published in a
>> format that uses it, but here also this format is useless: you never need
>> any [.], or [..] in the DUCET table as well. Instead the DUCET
>> just needs to indicate what is the minimum weight assigned for every level
>> (except the highest level where it is "implicitly" 0001, and not ).
>>
>>
>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
>> écrit :
>>
>>> There are lots of ways to implement the UCA.
>>>
>>> When you want fast string comparison, the zero weights are useful for
>>> processing -- and you don't actually assemble a sort key.
>>>
>>> People who want sort keys usually want them to be short, so you spend
>>> time on compression. You probably also build sort keys as byte vectors not
>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>> collation data file remunges all weights into fractional byte sequences,
>>> and leaves gaps for tailoring.
>>>
>>> markus
>>>
>>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
In summary, this step given in the algorithm is completely unneeded and can
be dropped completely:

*S3.2  *If L is not 1, append a *level
separator*

*Note:*The level separator is zero (), which is guaranteed to be lower
than any weight in the resulting sort key. This guarantees that when two
strings of unequal length are compared, where the shorter string is a
prefix of the longer string, the longer string is always sorted after the
shorter—in the absence of special features like contractions. For example:
"abc" < "abcX" where "X" can be any character(s).

Remove any reference to the "level separator" from the UCA. You never need
it.

As well this paragraph

7.3 Form Sort Keys 

*Step 3.* Construct a sort key for each collation element array by
successively appending all non-zero weights from the collation element
array. Figure 2 gives an example of the application of this step to one
collation element array.

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [..0021.0002], [.06EE.0020.0002] 0706
06D9 06EE  0020 0020 0021 0020  0002 0002 0002 0002

can be written with this figure:

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)

The parentheses mark the collation weights 0020 and 0002 that can be safely
removed if they are respectively the minimum secondary weight and minimum
tertiary weight.
But note that 0020 is kept in two places as they are followed by a higher
weight 0021. This is general for any tailored collation (not just the
DUCET).

Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy  a écrit :

> The  is there in the UCA only because the DUCET is published in a
> format that uses it, but here also this format is useless: you never need
> any [.], or [..] in the DUCET table as well. Instead the DUCET
> just needs to indicate what is the minimum weight assigned for every level
> (except the highest level where it is "implicitly" 0001, and not ).
>
>
> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
> écrit :
>
>> There are lots of ways to implement the UCA.
>>
>> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>> People who want sort keys usually want them to be short, so you spend
>> time on compression. You probably also build sort keys as byte vectors not
>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>> collation data file remunges all weights into fractional byte sequences,
>> and leaves gaps for tailoring.
>>
>> markus
>>
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
The  is there in the UCA only because the DUCET is published in a
format that uses it, but here also this format is useless: you never need
any [.], or [..] in the DUCET table as well. Instead the DUCET
just needs to indicate what is the minimum weight assigned for every level
(except the highest level where it is "implicitly" 0001, and not ).


Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:31, Philippe Verdy  a écrit :

> so you can use these two last functions to write the first one:
>
>   bool isIgnorable(int level, string element) {
> return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
>   }
>
correction:
return getWeightAt(element, 0) > getMinWeight(level);


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a
écrit :

> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>
And no, I absolutely no case where any  weight is useful during
processing, it does not distinguish any case, even for "fast" string
comparison.

Even if you don't build any sort key, may be you'll want to return  it
you query the weight for a specific collatable element, but this would be
the same as querying if the collatable element is ignorable or not for a
given specific level; this query just returns a false or true boolean, like
this method of a Collator object:

  bool isIgnorable(int level, string collatable element)

and you can also make this reliable for any collector:

  int getLevel(int weight);
  int getMinWeight(int level);
  int getWeightAt(string element, int level, int position);

so you can use these two last functions to write the first one:

  bool isIgnorable(int level, string element) {
return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
  }

That's enough you can write the fast comparison...

What I said is not a complicate "compression" this is done on the fly,
without any complex transform. All that counts is that any primary weight
value is higher than any secondary weight, and any secondary weight is
higher than a tertiary weight.


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
I'm not speaking just about how collation keys will finally be stored (as
uint16 or bytes, or sequences  of bits with variable length); I'm just
refering to the sequence of weights you generate.
You absolutely NEVER need ANYWHERE in the UCA algorithm any  weight,
not even during processing, or un the DUCET table.

Le jeu. 1 nov. 2018 à 21:08, Markus Scherer  a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>


Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Philippe Verdy via Unicode
For example, Figure 3 in the UTR#10 contains:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0002* 0002 0002
2 Cab *0706* 06D9 06EE ** 0020 0020 *0020* ** *0008* 0002 0002
3 cáb *0706* 06D9 06EE ** 0020 0020 *0021* ** 0002 0002 0002 0002
4 dab *0712* 06D9 06EE ** 0020 0020 0020 ** 0002 0002 0002


The  weights are never needed, even if any of the source strings
("cab", "Cab", "cáb", "dab") is followed by ANY other string, or if any
other string (higher than "b") replaces their final "b".
What is really important is to understand where the input text (after
initial transforms like reodering and expansion) is broken at specific
boundaries between collatable elements.
But the boundaries of weights indicated each part of the sort key can
always be infered for example between 06EE and 0020, or between 0020 and
0002.
So this can obviously be changed to just:

Figure 3. Comparison of Sort Keys


 StringSort Key
1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002
2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002
3 cáb *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002
4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002
As well (emphasized by black blackground above),
* when the secondary weights in the sort key are terminated by any sequence
of 0020 (the minimal secondary weight), you can suppress them from the
collation key.
* when the tertiary weights are in the sort key are terminated by any
sequence of 0002 (the minimal tertiary weight), you can suppress them from
the collation key.
This gives:

Figure 3. Comparison of Sort Keys

 StringSort Key
1 cab *0706* 06D9 06EE
2 Cab *0706* 06D9 06EE *0008*
3 cáb *0706* 06D9 06EE 0020 0020 *0021*
4 dab *0712* 06D9 06EE
See the reduction !

Le jeu. 1 nov. 2018 à 18:39, Philippe Verdy  a écrit :

> I just remarked that there's absolutely NO utility of the collation weight
>  anywhere in the algorithm.
>
> For example in UTR #10, section 3.3.1 gives a collection element :
>   [..0021.0002]
> for COMBINING GRAVE ACCENT. However it can also be simply:
>   [.0021.0002]
> for a simple reason: the secondary or tertiary weights are necessarily
> LOWER then any primary weight (for conformance reason):
>  any tertiary weight < any secondary weight < any primary weight
> (the set of all weights for all levels is fully partitioned into disjoint
> intervals in the same order, each interval containing all its weights, so
> weights are sorted by decreasing level, then increasing weight in all cases)
>
> This also means that we never need to handle  weights when creating
> sort keys from multiple collection elements, as we can easily detect that
> [.0021.0002] given above starts by a secondary weight 0021 and is not a
> primary weight.
>
> As well we don't need to use any level separator  in the sort key.
>
> This allows more interesting optimizations, and reduction of length for
> sort keys.
> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua which uses
> a "map" to map source (binary) strings or regexps,into target (binary)
> strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized order
> (possibly tuned for the tailored locale) using basic regexps.
>
> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight
>
> The "intermediary" collection elements are just like source text, except
> that higher level differences are eliminated, i.e.all source collation
> element string are replaced by the collection element string that have the
> smallest collation element weights. They must be just encoded so that they
> are HIGHER than any higher level weights.
>
> How to do that:
> - reserve the weight range between . (yes! not just .0001) and .001E
> for the last (tertiary) weight, make sure that all other intermediary
> collation elements will use only code units higher than .0020 (this means
> that they can remain encoded in their existing UTF form!)
> - reserve the weight .001F for the case where you don't want to use
> secondary differences (like letter case) and them to tertiary differences.
>
> This will be used in the second mapping to decompose source collection
> elements into "intermediary collation elements" + tertiary weight. you may
> then decide to leave tertiary weights 

Re: UCA unnecessary collation weight 0000

2018-11-01 Thread Markus Scherer via Unicode
There are lots of ways to implement the UCA.

When you want fast string comparison, the zero weights are useful for
processing -- and you don't actually assemble a sort key.

People who want sort keys usually want them to be short, so you spend time
on compression. You probably also build sort keys as byte vectors not
uint16 vectors (because byte vectors fit into more APIs and tend to be
shorter), like ICU does using the CLDR collation data file. The CLDR root
collation data file remunges all weights into fractional byte sequences,
and leaves gaps for tailoring.

markus