Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
On Fri, 15 Mar 2013 21:12:48 -0700, Markus Scherer wrote: On Fri, Mar 15, 2013 at 6:52 PM, Richard Wordingham wrote: (Well, actually the send button was pressed at 01.52 GMT on Saturday.) The point is that no sequence of units (8-bit, 16-bit or whatever the implementation uses) can be

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Markus Scherer
On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Please give an example of how the low/high split would fail. With the primary collation weights 20, 21, 21 80 and 22 I get the following primary collation weight sequences for one and two collating

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
On Sat, 16 Mar 2013 09:29:07 -0700 Markus Scherer markus@gmail.com wrote: On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Please give an example of how the low/high split would fail. With the primary collation weights 20, 21, 21 80 and 22 I

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
2013/3/16 Richard Wordingham richard.wording...@ntlworld.com: On Sat, 16 Mar 2013 09:29:07 -0700 Markus Scherer markus@gmail.com wrote: On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Please give an example of how the low/high split would

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
2013/3/16 Richard Wordingham richard.wording...@ntlworld.com: But with the low/high split scheme, start units have to have low values (e.g. 20, 21 22) and continuation units have high values (e.g. 80) just to stop this very problem. Note also that all technics used for data compression can

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Richard Wordingham
On Sat, 16 Mar 2013 21:58:02 +0100 Philippe Verdy verd...@wanadoo.fr wrote: 2013/3/16 Richard Wordingham richard.wording...@ntlworld.com: On Sat, 16 Mar 2013 09:29:07 -0700 Markus Scherer markus@gmail.com wrote: On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham

Re: Size of Weights in Unicode Collation Algorithm

2013-03-16 Thread Philippe Verdy
2013/3/16 Richard Wordingham richard.wording...@ntlworld.com: If you start with my start = low, continuation = high scheme, you can convert it in an order-preserving manner to a no-prefix scheme by the following simple transform: If a simple weight precedes a continuation weight, add 0ยท8

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
On Thu, 14 Mar 2013 19:13:43 -0700 Markus Scherer markus@gmail.com wrote: On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Thu, 14 Mar 2013 14:49:18 -0700 Markus Scherer markus@gmail.com wrote: While variableTop=u2FD5 ... ... but

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
On Fri, 15 Mar 2013 13:52:39 -0700 Markus Scherer markus@gmail.com wrote: On Fri, Mar 15, 2013 at 12:50 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Not quite. The characterisation of variable weights knows nothing of the concept, and that is the problem. That's a

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Markus Scherer
On Fri, Mar 15, 2013 at 3:05 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary weights (and many of the secondary weights) use the large weights mechanism. No, they're 32-bit weights expressed by omitting

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Richard Wordingham
On Fri, 15 Mar 2013 16:03:57 -0700 Markus Scherer markus@gmail.com wrote: On Fri, Mar 15, 2013 at 3:05 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: In CLDR/ICU's FractionalUCA.txt, all but 40 or so of the primary weights (and many of the secondary weights) use the

Re: Size of Weights in Unicode Collation Algorithm

2013-03-15 Thread Markus Scherer
On Fri, Mar 15, 2013 at 6:52 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: The fractional refers to the same kind of mechanism as the large weight values in the UCA spec. Yes. The problem is that formally the UCA clearly treats 'large weights' as being in multiple

RE: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Whistler, Ken
Richard Wordingham wrote: Actually, there is a subtle and nasty difference, but probably one that will very rarely strike practical use. It's most obvious manifestation is in the application of the UCA parametric tailoring topVariable=u2FD5. U+2FD5 KANGXI RADICAL FLUTE is the last symbol in

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Markus Scherer
In ICU, setVariableTop() has a documented limitation: It requires that the primary weight has only 1 or 2 bytes. Until a few years ago, this was true for most characters. Since then, Unicode added many more characters and we ran out of space for 2-byte weights, given our constraints. So we use

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
On Thu, 14 Mar 2013 14:49:18 -0700 Markus Scherer markus@gmail.com wrote: However, it does not make a lot of sense to set the variable top to something above the currency symbols range -- it's basically an option for an ignore punctuation mode, and you wouldn't want to ignore nearly every

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Richard Wordingham
On Thu, 14 Mar 2013 21:01:10 + Whistler, Ken ken.whist...@sap.com wrote: Richard Wordingham wrote: ...UCA parametric tailoring topVariable=u2FD5 ... The parametric tailoring in question is variableTop, not topVariable, Sorry. and it would be expressed u00u2FD5, not u2FD5. No -

Re: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Markus Scherer
On Thu, Mar 14, 2013 at 4:09 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Thu, 14 Mar 2013 14:49:18 -0700 Markus Scherer markus@gmail.com wrote: However, it does not make a lot of sense to set the variable top to something above the currency symbols range -- it's

Re: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Markus Scherer
On Wed, Mar 13, 2013 at 11:38 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10) was to changed weights from being 16 bits to just being general non-negative integers. Was this just to accommodate the 4th

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10) was to changed weights from being 16 bits to just being general non-negative integers. Was this just to accommodate the 4th weight in DUCET (scheduled for deletion in Version 6.3.0), or is it

Re: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Richard Wordingham
On Wed, 13 Mar 2013 21:07:06 + Whistler, Ken ken.whist...@sap.com wrote: Richard Wordingham wrote: One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10) was to changed weights from being 16 bits to just being general non-negative integers. Was this just to

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: It loosened up the spec, so that the spec itself didn't seem to be requiring that each of the first 3 levels had to be expressed with a full 16 bits in any collation element table. I don't read it that way. But it did allow the 4th weight to go up to 10!