Fwd: Re: Question about normalization tests

2012-12-10 Thread Edwin Hoogerbeets
Ah yes, I did indeed miss the "equal to" part. I fixed up my code and
now it works as expected.

Thanks to Mark and Ken for your help and speedy response!

Edwin

On 12/10/2012 12:57 PM, Whistler, Ken wrote:
>
> Your misunderstanding is at the highlighted statement below. Actually
> 0300 **is** blocked from 0061 in this sequence, because it is preceded
> by a character with the same canonical combining class (i.e. U+0305,
> ccc=230). A blocking context is the preceding combining character
> either having ccc=0 or having ccc greater than *or equal to* the
> character being checked.
>
>  
>
> --Ken
>
>  
>





RE: Question about normalization tests

2012-12-10 Thread Whistler, Ken
Your misunderstanding is at the highlighted statement below. Actually 0300 *is* 
blocked from 0061 in this sequence, because it is preceded by a character with 
the same canonical combining class (i.e. U+0305, ccc=230). A blocking context 
is the preceding combining character either having ccc=0 or having ccc greater 
than or equal to the character being checked.

--Ken


Starting with the NFD decomposition string, we retrieve the combining classes 
for each character from the UnicodeData.txt file:

0061 - 0
05AE - 228
0305 - 230
0300 - 230
0315 - 232
0062 - 0

You start at the first character after the starter (0061, with ccc=0), which is 
05AE. There is no primary composition for the sequence 0061 05AE, so you move 
on.

Looking at 0305, it is not blocked from 0061, so check the primary composition 
for 0061 0305. There is none for that either, so move on.

Looking at 0300, it is also not blocked from 0061, so check the primary 
composition for 0061 0300. There is a primary composition for that sequence, 
00E0, so replace the starter with that, delete the 0300, and continue. The 
string looks like this now:

00E0 - 0
05AE - 228
0305 - 230
0315 - 232
0062 - 0

Checking 0315 and 0062, they are not blocked, but there is no composition with 
00E0, so the algorithm ends with the result:
00E0 05AE 0305 0315 0062

This disagrees with what it says in the normalization tests file as listed 
above. The question is, did I misunderstand the algorithm, or is this perhaps a 
bug in the data file?

Thanks,

Edwin



Re: Question about normalization tests

2012-12-10 Thread Mark Davis ☕
0300 *is* blocked, because there is a preceding character (0305) that has
the same combining class (230).

Mark 
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets
wrote:

> Looking at 0300, it is also not blocked from 0061, so check the primary
> composition for 0061 0300. There is a primary composition for that
> sequence, 00E0, so replace the starter with that, delete the 0300, and
> continue. The string looks like this now:
>


Question about normalization tests

2012-12-10 Thread Edwin Hoogerbeets
Hi there,

I'm going through the NormalizationTests.txt in the 6.3.0d1 database,
and I ran across this line:

0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE
0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300
0315 0062; # (a◌̅◌̕◌̀◌֮b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; ) 
LATIN SMALL
LETTER A, COMBINING OVERLINE, COMBINING COMMA ABOVE RIGHT, COMBINING
GRAVE ACCENT, HEBREW ACCENT ZINOR, LATIN SMALL LETTER B

The relevant parts for my question are:

Source: 0061 0305 0315 0300 05AE 0062
NFD: 0061 05AE 0305 0300 0315 0062
NFC: 0061 05AE 0305 0300 0315 0062

I agree with the NFD decomposition result, but the NFC one seems wrong
to me. If you look at rule D117 in the Unicode Spec
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf (I couldn't find
the spec for 6.3 -- hopefully 6.2 is close enough), it gives the
algorithm for NFC composition. The way I interpret it, this is how the
composition proceeds:

Starting with the NFD decomposition string, we retrieve the combining
classes for each character from the UnicodeData.txt file:

0061 - 0
05AE - 228
0305 - 230
0300 - 230
0315 - 232
0062 - 0

You start at the first character after the starter (0061, with ccc=0),
which is 05AE. There is no primary composition for the sequence 0061
05AE, so you move on.

Looking at 0305, it is not blocked from 0061, so check the primary
composition for 0061 0305. There is none for that either, so move on.

Looking at 0300, it is also not blocked from 0061, so check the primary
composition for 0061 0300. There is a primary composition for that
sequence, 00E0, so replace the starter with that, delete the 0300, and
continue. The string looks like this now:

00E0 - 0
05AE - 228
0305 - 230
0315 - 232
0062 - 0

Checking 0315 and 0062, they are not blocked, but there is no
composition with 00E0, so the algorithm ends with the result:
00E0 05AE 0305 0315 0062

This disagrees with what it says in the normalization tests file as
listed above. The question is, did I misunderstand the algorithm, or is
this perhaps a bug in the data file?

Thanks,

Edwin