subject:"Back to the subject\: Folding algorithm and canonical equivalence"

Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Peter Kirk

On 19/07/2004 23:23, Asmus Freytag wrote:
At 01:56 PM 7/19/2004, Mark Davis wrote:
You did point out an oversight; Asmus and I have been working on the 
issue.

â€ŽMark

As Mark wrote, your point is taken and we've taken that onboard. 
However, we won't try to *edit* text on the list, that's why we are 
not engaging in a long discussion on the details (and we've discovered 
many interesting ones, wait for the next version of the text).
In my replies I tend to focus on issues for which I need more 
information.

Fair enough. I just wondered if I needed to raise this one as a formal 
feedback issue. From what you say here, I assume not.

A./
PS: Just one final comment:
Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. There are practical circumstances under which 
implementations
may reasonably distinguish them.

Are the authors of UTR #30 claiming that folding is one of those 
practical circumstances, or is this just an oversight?

As it turns out, and not surprisingly, realizing that ideal for any 
arbitrary type of possible folding rule can get complicated (again, I 
won't go into details right now). There may be situations were an 
optimization would break canonical equivalence in the face of 
permissible, but unusual, if not to say 'non-sensical' input. That's 
what's meant with 'practical circumstances'.

If the ability to 'correctly' handle combining sequences that are a 
random mixture of Khmer and Arabic combining marks were to result in 
severe runtime penalties, would you rather have a 'correct' or a fast 
implementation?

Again, fair enough. But I would be surprised if this is a real issue 
with the folding algorithm. Indeed I would expect, given that 
decomposition, presumably to NFD, is anyway required after the first 
folding pass, that there would be little or no performance hit in 
normalising the text to be folded to NFD before the first folding pass.

Nobody argues that sequences that are expected to occur in realistic 
data, including specialized texts, definitely should be handled as 
expected, even where practicalities require some optimizations.

Yes, but I did make the point that the issue I brought up is not a 
purely theoretical one, but a very real one for Hebrew with the 
diacritic removal folding as defined.

So, we are all agred.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Asmus Freytag

At 01:56 PM 7/19/2004, Mark Davis wrote:
You did point out an oversight; Asmus and I have been working on the issue.
‎Mark
As Mark wrote, your point is taken and we've taken that onboard. However, 
we won't try to *edit* text on the list, that's why we are not engaging in 
a long discussion on the details (and we've discovered many interesting 
ones, wait for the next version of the text).
In my replies I tend to focus on issues for which I need more information.

A./
PS: Just one final comment:
Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. There are practical circumstances under which 
implementations
may reasonably distinguish them.
Are the authors of UTR #30 claiming that folding is one of those practical 
circumstances, or is this just an oversight?
As it turns out, and not surprisingly, realizing that ideal for any 
arbitrary type of possible folding rule can get complicated (again, I won't 
go into details right now). There may be situations were an optimization 
would break canonical equivalence in the face of permissible, but unusual, 
if not to say 'non-sensical' input. That's what's meant with 'practical 
circumstances'.

If the ability to 'correctly' handle combining sequences that are a random 
mixture of Khmer and Arabic combining marks were to result in severe 
runtime penalties, would you rather have a 'correct' or a fast implementation?

Nobody argues that sequences that are expected to occur in realistic data, 
including specialized texts, definitely should be handled as expected, even 
where practicalities require some optimizations.

So, we are all agred.

Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Mark Davis

You did point out an oversight; Asmus and I have been working on the issue.

âMark

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, July 19, 2004 13:21
Subject: Back to the subject: Folding algorithm and canonical equivalence


> There has been extensive discussion in this thread on the specifics of
> accent and diacritic folding. But no one has answered my point, repeated
> below, that there seems to be a conflict between the folding algorithm
> (rather than the details of specific foldings) and the principle of
> canonical equivalence. Specifically, it seems to breach the principle in
> Unicode Conformance Clause C9:
>
> > Ideally, an implementation would always interpret two
> > canonical-equivalent character
> > sequences identically. There are practical circumstances under which
> > implementations
> > may reasonably distinguish them.
>
> Are the authors of UTR #30 claiming that folding is one of those
> practical circumstances, or is this just an oversight?
>
> Peter Kirk
>
> On 17/07/2004 23:25, Peter Kirk wrote:
>
> > I was just reviewing the UTR #30 draft in response to Rick's notice
> > about it. And I believe I may have found a point in which the folding
> > algorithm as given may violate the principle of canonical equivalence.
> > But I would like some clarification from list members before providing
> > formal input on this point.
> >
> > Consider a sequence made up of a base character B and two combining
> > marks M1 and M2, in which the combining class of M1 is less than that
> > of M2.  and  are canonically equivalent
> > representations of the same sequence, but only the former is in
> > canonical order. Suppose that a folding is defined including the
> > operation  -> X, but no other relevant operations. When this
> > folding is applied, according to the folding algorithms defined in
> > sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
> > sequence  will be folded to  and will not be further
> > changed, but the sequence  will not be changed at all by
> > the folding because the sequence  will never be found. (By
> > contrast, a folding operation  -> Y will be applied to both
> > sequences, because the canonical decomposition step converts  > M1> to  and the folding operation is re-applied and finds a
> > match the second time.) The implication is that folding of two
> > canonically equivalent strings gives different (and not canonically
> > equivalent) results.
> >
> > This is not a purely theoretical point. The Diacritic Folding as
> > specified in
> > http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
> > includes operations like 05D1 05BC -> 05D1, i.e.  -> BET,
> > but no general rule to delete DAGESH (or any other combining marks; I
> > think there needs to be such a rule, and I have already posted a
> > formal response saying that). Sequences like  are
> > very common in Hebrew text, and commonly written in this order which
> > is logically correct and preferred by current rendering technologies,
> > but the canonical order is in fact ; thus both
> > sequences will be found in data depending on whether or not it has
> > been normalised. The effect of applying Diacritic Folding exactly as
> > specified is that  is folded to , but
> > the canonically equivalent  is unchanged. (In fact
> > I consider that both should be folded to just BET, but that is not
> > what the current data file specifies.)
> >
> > I hope I have not totally misunderstood the folding algorithm here.
> > But it seems to me that what is missing in the algorithm is an initial
> > step of normalising the data. The introductory text to section 4 seems
> > to suggest that this has been avoided because folding may need to
> > preserve the distinction between NFC and NFD data - although the
> > algorithm as presented does not in fact do this. Since in practice the
> > input data is not necessarily in either NFC or NFD and there is no
> > easy way to detect which is being used, the only meaningful approach
> > is for the user of the folding to specify whether the output of the
> > folding should be NFC or NFD.
> >
> > Of course there might be a real requirement for a folding which, for
> > example, removes DAGESH when combined with BET (but not with other
> > base characters) irrespective of what other combining marks might
> > intervene. But such foldings would need a considerably more powerful
> > folding algorithm.
> >
>
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>

Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Peter Kirk

There has been extensive discussion in this thread on the specifics of
accent and diacritic folding. But no one has answered my point, repeated
below, that there seems to be a conflict between the folding algorithm
(rather than the details of specific foldings) and the principle of
canonical equivalence. Specifically, it seems to breach the principle in
Unicode Conformance Clause C9:

Ideally, an implementation would always interpret two
canonical-equivalent character
sequences identically. There are practical circumstances under which
implementations
may reasonably distinguish them.
Are the authors of UTR #30 claiming that folding is one of those
practical circumstances, or is this just an oversight?

Peter Kirk
On 17/07/2004 23:25, Peter Kirk wrote:
I was just reviewing the UTR #30 draft in response to Rick's notice
about it. And I believe I may have found a point in which the folding
algorithm as given may violate the principle of canonical equivalence.
But I would like some clarification from list members before providing
formal input on this point.

Consider a sequence made up of a base character B and two combining
marks M1 and M2, in which the combining class of M1 is less than that
of M2. and are canonically equivalent
representations of the same sequence, but only the former is in
canonical order. Suppose that a folding is defined including the
operation -> X, but no other relevant operations. When this
folding is applied, according to the folding algorithms defined in
sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
sequence will be folded to and will not be further
changed, but the sequence will not be changed at all by
the folding because the sequence will never be found. (By
contrast, a folding operation -> Y will be applied to both
sequences, because the canonical decomposition step converts to and the folding operation is re-applied and finds a
match the second time.) The implication is that folding of two
canonically equivalent strings gives different (and not canonically
equivalent) results.

This is not a purely theoretical point. The Diacritic Folding as
specified in
http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
includes operations like 05D1 05BC -> 05D1, i.e. -> BET,
but no general rule to delete DAGESH (or any other combining marks; I
think there needs to be such a rule, and I have already posted a
formal response saying that). Sequences like are
very common in Hebrew text, and commonly written in this order which
is logically correct and preferred by current rendering technologies,
but the canonical order is in fact ; thus both
sequences will be found in data depending on whether or not it has
been normalised. The effect of applying Diacritic Folding exactly as
specified is that is folded to , but
the canonically equivalent is unchanged. (In fact
I consider that both should be folded to just BET, but that is not
what the current data file specifies.)

I hope I have not totally misunderstood the folding algorithm here.
But it seems to me that what is missing in the algorithm is an initial
step of normalising the data. The introductory text to section 4 seems
to suggest that this has been avoided because folding may need to
preserve the distinction between NFC and NFD data - although the
algorithm as presented does not in fact do this. Since in practice the
input data is not necessarily in either NFC or NFD and there is no
easy way to detect which is being used, the only meaningful approach
is for the user of the folding to specify whether the output of the
folding should be NFC or NFD.

Of course there might be a real requirement for a folding which, for
example, removes DAGESH when combined with BET (but not with other
base characters) irrespective of what other combining marks might
intervene. But such foldings would need a considerably more powerful
folding algorithm.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Back to the subject: Folding algorithm and canonical equivalence

Re: Back to the subject: Folding algorithm and canonical equivalence

Re: Back to the subject: Folding algorithm and canonical equivalence

Back to the subject: Folding algorithm and canonical equivalence

4 matches

Site Navigation

Mail list logo

Footer information