Re: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation)

2003-03-13 Thread Mark Davis
Well, maybe 3 things ;-)

Mark

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Markus Scherer" <[EMAIL PROTECTED]>; "unicode"
<[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, March 13, 2003 13:04
Subject: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation)


> I want to point out two things.
>
> 1. UCA provides a mechanism for producing a "deterministic" sort (there
> called semi-stable). See step 3.10
> (http://www.unicode.org/reports/tr10/#Step_3).
>
> 2. A "deterministic" sort is actually not needed very often; people
confuse
> it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability
>
> 3. If someone did customize the UCA for numeric sorting, the difference
> between 002 and 2 could be a tertiary difference. So even without using
> 3.10, they would be distinguished at level 3.
>
> Mark
> 
> [EMAIL PROTECTED]
> IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
> (408) 256-3148
> fax: (408) 256-0799
>
> - Original Message -
> From: "Markus Scherer" <[EMAIL PROTECTED]>
> To: "unicode" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Wednesday, March 12, 2003 08:48
> Subject: Re: ZWNJ & Persian Collation
>
>
> > Roozbeh Pournader wrote:
> > > Well, anything that is completely ignored in collation creates
problems
> > > with deterministic sorting.
> >
> > I don't think you mean "deterministic". UCA is deterministic, it just
> sorts many strings as equal.
> >
> > > There are certain words in Persian, with
> > > completely different meanings, that only differ in a ZWNJ[1].  Having
> ZWNJ
> > > ignored by default, means they may appear in this or that order,
> possibly
> > > based on the original order of input.  I guess this is not what we
want
> > > for deterministic collation.
> > >
> > > The desired behavior for ZWNJ, is being treated like punctuations.
> > > Ignored in the first levels, but considered at the end. (Personal
Note:
> > > write something for UTC on this.)
> >
> > Possible. I assume that ZWNJ is ignored in UCA because that is the
> expected behavior for many other
> > languages. Not ignoring ZWNJ is possible with a tailoring that gives it
> some non-zero weights.
> >
> > Note that many languages require tailorings for at least a couple of
> characters to follow national
> > standards.
> >
> > markus
> >
> > --
> > Opinions expressed here may not reflect my company's positions unless
> otherwise noted.
> >
> >
> >
>
>
>




"Deterministic Sorting" (was Re: ZWNJ & Persian Collation)

2003-03-13 Thread Mark Davis
I want to point out two things.

1. UCA provides a mechanism for producing a "deterministic" sort (there
called semi-stable). See step 3.10
(http://www.unicode.org/reports/tr10/#Step_3).

2. A "deterministic" sort is actually not needed very often; people confuse
it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability

3. If someone did customize the UCA for numeric sorting, the difference
between 002 and 2 could be a tertiary difference. So even without using
3.10, they would be distinguished at level 3.

Mark

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: "Markus Scherer" <[EMAIL PROTECTED]>
To: "unicode" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, March 12, 2003 08:48
Subject: Re: ZWNJ & Persian Collation


> Roozbeh Pournader wrote:
> > Well, anything that is completely ignored in collation creates problems
> > with deterministic sorting.
>
> I don't think you mean "deterministic". UCA is deterministic, it just
sorts many strings as equal.
>
> > There are certain words in Persian, with
> > completely different meanings, that only differ in a ZWNJ[1].  Having
ZWNJ
> > ignored by default, means they may appear in this or that order,
possibly
> > based on the original order of input.  I guess this is not what we want
> > for deterministic collation.
> >
> > The desired behavior for ZWNJ, is being treated like punctuations.
> > Ignored in the first levels, but considered at the end. (Personal Note:
> > write something for UTC on this.)
>
> Possible. I assume that ZWNJ is ignored in UCA because that is the
expected behavior for many other
> languages. Not ignoring ZWNJ is possible with a tailoring that gives it
some non-zero weights.
>
> Note that many languages require tailorings for at least a couple of
characters to follow national
> standards.
>
> markus
>
> --
> Opinions expressed here may not reflect my company's positions unless
otherwise noted.
>
>
>




Re: ZWNJ & Persian Collation

2003-03-12 Thread Markus Scherer
Roozbeh Pournader wrote:
Well, anything that is completely ignored in collation creates problems
with deterministic sorting.
I don't think you mean "deterministic". UCA is deterministic, it just sorts many strings as equal.

There are certain words in Persian, with
completely different meanings, that only differ in a ZWNJ[1].  Having ZWNJ
ignored by default, means they may appear in this or that order, possibly
based on the original order of input.  I guess this is not what we want 
for deterministic collation. 

The desired behavior for ZWNJ, is being treated like punctuations.  
Ignored in the first levels, but considered at the end. (Personal Note:
write something for UTC on this.)
Possible. I assume that ZWNJ is ignored in UCA because that is the expected behavior for many other 
languages. Not ignoring ZWNJ is possible with a tailoring that gives it some non-zero weights.

Note that many languages require tailorings for at least a couple of characters to follow national 
standards.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.



Re: ZWNJ & Persian Collation

2003-03-12 Thread Roozbeh Pournader
On Tue, 11 Mar 2003, Markus Scherer wrote:

> The Unicode Collation Algorithm (UCA) for which allkeys.txt is the
> default weight table does treat ZWNJ and a number of other characters as
> special. For these, they are completely ignored by the UCA - same as if
> you stripped them from the text.

Well, anything that is completely ignored in collation creates problems
with deterministic sorting.  There are certain words in Persian, with
completely different meanings, that only differ in a ZWNJ[1].  Having ZWNJ
ignored by default, means they may appear in this or that order, possibly
based on the original order of input.  I guess this is not what we want 
for deterministic collation. 

The desired behavior for ZWNJ, is being treated like punctuations.  
Ignored in the first levels, but considered at the end. (Personal Note:
write something for UTC on this.)

roozbeh

[1] A good example, is نام‌های or نامهای (names of) vs 
نامه‌ای (a letter). Their only difference in  encoding is 
existence or non-existence of ZWNJs, or its different place in the word.