Re: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation)
Well, maybe 3 things ;-) Mark [EMAIL PROTECTED] IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 (408) 256-3148 fax: (408) 256-0799 - Original Message - From: "Mark Davis" <[EMAIL PROTECTED]> To: "Markus Scherer" <[EMAIL PROTECTED]>; "unicode" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, March 13, 2003 13:04 Subject: "Deterministic Sorting" (was Re: ZWNJ & Persian Collation) > I want to point out two things. > > 1. UCA provides a mechanism for producing a "deterministic" sort (there > called semi-stable). See step 3.10 > (http://www.unicode.org/reports/tr10/#Step_3). > > 2. A "deterministic" sort is actually not needed very often; people confuse > it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability > > 3. If someone did customize the UCA for numeric sorting, the difference > between 002 and 2 could be a tertiary difference. So even without using > 3.10, they would be distinguished at level 3. > > Mark > > [EMAIL PROTECTED] > IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 > (408) 256-3148 > fax: (408) 256-0799 > > - Original Message - > From: "Markus Scherer" <[EMAIL PROTECTED]> > To: "unicode" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Wednesday, March 12, 2003 08:48 > Subject: Re: ZWNJ & Persian Collation > > > > Roozbeh Pournader wrote: > > > Well, anything that is completely ignored in collation creates problems > > > with deterministic sorting. > > > > I don't think you mean "deterministic". UCA is deterministic, it just > sorts many strings as equal. > > > > > There are certain words in Persian, with > > > completely different meanings, that only differ in a ZWNJ[1]. Having > ZWNJ > > > ignored by default, means they may appear in this or that order, > possibly > > > based on the original order of input. I guess this is not what we want > > > for deterministic collation. > > > > > > The desired behavior for ZWNJ, is being treated like punctuations. > > > Ignored in the first levels, but considered at the end. (Personal Note: > > > write something for UTC on this.) > > > > Possible. I assume that ZWNJ is ignored in UCA because that is the > expected behavior for many other > > languages. Not ignoring ZWNJ is possible with a tailoring that gives it > some non-zero weights. > > > > Note that many languages require tailorings for at least a couple of > characters to follow national > > standards. > > > > markus > > > > -- > > Opinions expressed here may not reflect my company's positions unless > otherwise noted. > > > > > > > > >
"Deterministic Sorting" (was Re: ZWNJ & Persian Collation)
I want to point out two things. 1. UCA provides a mechanism for producing a "deterministic" sort (there called semi-stable). See step 3.10 (http://www.unicode.org/reports/tr10/#Step_3). 2. A "deterministic" sort is actually not needed very often; people confuse it with a stable sort. See http://www.unicode.org/reports/tr10/#Stability 3. If someone did customize the UCA for numeric sorting, the difference between 002 and 2 could be a tertiary difference. So even without using 3.10, they would be distinguished at level 3. Mark [EMAIL PROTECTED] IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 (408) 256-3148 fax: (408) 256-0799 - Original Message - From: "Markus Scherer" <[EMAIL PROTECTED]> To: "unicode" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, March 12, 2003 08:48 Subject: Re: ZWNJ & Persian Collation > Roozbeh Pournader wrote: > > Well, anything that is completely ignored in collation creates problems > > with deterministic sorting. > > I don't think you mean "deterministic". UCA is deterministic, it just sorts many strings as equal. > > > There are certain words in Persian, with > > completely different meanings, that only differ in a ZWNJ[1]. Having ZWNJ > > ignored by default, means they may appear in this or that order, possibly > > based on the original order of input. I guess this is not what we want > > for deterministic collation. > > > > The desired behavior for ZWNJ, is being treated like punctuations. > > Ignored in the first levels, but considered at the end. (Personal Note: > > write something for UTC on this.) > > Possible. I assume that ZWNJ is ignored in UCA because that is the expected behavior for many other > languages. Not ignoring ZWNJ is possible with a tailoring that gives it some non-zero weights. > > Note that many languages require tailorings for at least a couple of characters to follow national > standards. > > markus > > -- > Opinions expressed here may not reflect my company's positions unless otherwise noted. > > >
Re: ZWNJ & Persian Collation
Roozbeh Pournader wrote: Well, anything that is completely ignored in collation creates problems with deterministic sorting. I don't think you mean "deterministic". UCA is deterministic, it just sorts many strings as equal. There are certain words in Persian, with completely different meanings, that only differ in a ZWNJ[1]. Having ZWNJ ignored by default, means they may appear in this or that order, possibly based on the original order of input. I guess this is not what we want for deterministic collation. The desired behavior for ZWNJ, is being treated like punctuations. Ignored in the first levels, but considered at the end. (Personal Note: write something for UTC on this.) Possible. I assume that ZWNJ is ignored in UCA because that is the expected behavior for many other languages. Not ignoring ZWNJ is possible with a tailoring that gives it some non-zero weights. Note that many languages require tailorings for at least a couple of characters to follow national standards. markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Re: ZWNJ & Persian Collation
On Tue, 11 Mar 2003, Markus Scherer wrote: > The Unicode Collation Algorithm (UCA) for which allkeys.txt is the > default weight table does treat ZWNJ and a number of other characters as > special. For these, they are completely ignored by the UCA - same as if > you stripped them from the text. Well, anything that is completely ignored in collation creates problems with deterministic sorting. There are certain words in Persian, with completely different meanings, that only differ in a ZWNJ[1]. Having ZWNJ ignored by default, means they may appear in this or that order, possibly based on the original order of input. I guess this is not what we want for deterministic collation. The desired behavior for ZWNJ, is being treated like punctuations. Ignored in the first levels, but considered at the end. (Personal Note: write something for UTC on this.) roozbeh [1] A good example, is نامهای or نامهای (names of) vs نامهای (a letter). Their only difference in encoding is existence or non-existence of ZWNJs, or its different place in the word.