subject:"Re\: Potential contradiction between the WordBreak test data and UAX #29"

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen


On 23/11/16 11:45, Daniel Bünzli wrote:

On Wednesday 23 November 2016 at 12:28, Tom Hacohen wrote:

I took a look at the ICU sources, and they explicitly mention this case,
so it seems I was mistaken with interpreting the intention of the UAX. I
still find it confusing, but based on this thread, it seems to just be me.


It's not only you, I also sometimes get confused by it (see for example [1] and 
subsequent messages). Maybe the operational model could be clarified a bit.


The comment I quoted from the ICU sources clarifies the intention. Maybe 
a comment similar to one would be helpful?


Also, thinking about it a bit more, the operational order makes sense 
when you consider the CR LF case and extended characters, however it is 
still not obvious from the wording.


Thanks again.

--
Tom.

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Daniel Bünzli

On Wednesday 23 November 2016 at 12:28, Tom Hacohen wrote:
> I took a look at the ICU sources, and they explicitly mention this case,
> so it seems I was mistaken with interpreting the intention of the UAX. I 
> still find it confusing, but based on this thread, it seems to just be me.

It's not only you, I also sometimes get confused by it (see for example [1] and 
subsequent messages). Maybe the operational model could be clarified a bit. 

I also think it would be better if the UAX29 didn't use ignore rules at all, so 
that going from rules to implementation is more straightforward --- though I 
understand it may make the spec harder to maintain.

Best,

Daniel

[1] http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0088.html

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen


On 23/11/16 11:20, Philippe Verdy wrote:

2016-11-23 12:00 GMT+01:00 Tom Hacohen >:


Also take another look at
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules

specifically the table that shows another way of writing the ignore
rule. This again shows my understanding of rule 4 is correct.

Specially look at the following equivalence:
X Y × Z W   ⇒   X (Extend | Format)* Y (Extend | Format)* ×
Z (Extend | Format)* W


This expansion does not occur before rule WB4; it cannot be used to
transform rules WB1 to WB3c; this is explicitly stated in the algorithm.
And because the rule WB3c handles your case, you are misinterpreting the
specs as if it was applying there too...



I took a look at the ICU sources, and they explicitly mention this case, 
so it seems I was mistaken with interpreting the intention of the UAX. I 
still find it confusing, but based on this thread, it seems to just be me.


Sorry for the noise.

The comment from the ICU source code:
# Rule 3c   ZWJ x (Extended_Pict | EmojiNRK).  Precedes WB4, so no 
intervening Extend chars allowed.


Thanks for your help,
Tom

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Philippe Verdy

2016-11-23 12:00 GMT+01:00 Tom Hacohen :

>
> Also take another look at http://www.unicode.org/reports
> /tr29/#Grapheme_Cluster_and_Format_Rules specifically the table that
> shows another way of writing the ignore rule. This again shows my
> understanding of rule 4 is correct.
>
> Specially look at the following equivalence:
> X Y × Z W   ⇒   X (Extend | Format)* Y (Extend | Format)* × Z
> (Extend | Format)* W
>

This expansion does not occur before rule WB4; it cannot be used to
transform rules WB1 to WB3c; this is explicitly stated in the algorithm.
And because the rule WB3c handles your case, you are misinterpreting the
specs as if it was applying there too...

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen


On 23/11/16 11:11, Daniel Bünzli wrote:


On Wednesday 23 November 2016 at 12:00, Tom Hacohen wrote:

This looks like a mistake statement rather than a binding rule.

Well at least to me it's pretty clear that this is not the case.



Even if that's true, look at my second statement (which you redacted in
your reply):


I'm not arguing whether the boundaries produced by this process is good or not. 
I'm just saying that to me, the test data is consistent with the operational 
model and rules of UAX#29 as it exists.


I'm arguing it's not, and I still don't agree with your understanding of 
the operational model, again, take a look at what I wrote in my last email:


Also take another look at 
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules 
specifically the table that shows another way of writing the ignore 
rule. This again shows my understanding of rule 4 is correct.


Specially look at the following equivalence:
X Y × Z W ⇒ X (Extend | Format)* Y (Extend | Format)* × Z 
(Extend | Format)* W


--
Tom

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Philippe Verdy

You say "theres's no case where two rules apply". I don't think this is
right, rules apply in the precedence order as long as they've not produced
a decision for generating a "break here" or no break here". This is
especially important for rules that generate only a replacement, that are
executed in the displayed order. because multiple rules may have their left
side member match simultaneously.

You have to read them as if this was a:

if (condition1) then (replacement1)
else if (condition2) then (replacement2)
else if (condition3) then (replacement3)
...
else if (conditionnN) then (replacementN)

The order of conditions (i.e. the order of rules) is significant when
several one may be true simultaneously.

Then when handling the replacement, of course you restart from the
begining. But what happens on the input stream is very different if it
contains a "break here" or "no break here" (e.g. rule WB3c), or not (e.g.
rule WB4): in the first case, the substitution will not advance the input
stream, it just transforms it (it changes the internal parser state only),
in the second case, the state is transformed but all elements in the put
stream before the "break here" or no-break here" are discarded from the
input stream, leaving only those on the left part of the "break
here"/"nobreak here".

The input state is a FIFO stack where each element contains:
  { a text buffer (or equivalently an index pointing to the relative end
position in the input stream buffer) cumulating all characters (or bytes)
from the input to which the WB class was assigned;
a WB class (a small integer) to which this input string was mapped
  }
and the input strema buffer.

The automata processes each rule in the listed order: to see if a rule
match it just uses the seond element (the WB class) of elements starting
from those of the bottom of the stack.

If there's not enough elements in trhe FIFO stack to match a rule
completely (in "hungry" mode if that matching rule contains "*" or "+") it
will read additional bytes or a character from the input stream, to append
to the top of the input buffer until it can assign it a WB class, and that
element will just contain that character and that WB class that will be
pushed to the top of the FIFO stack.

When a tested rule matches one or more elements starting from the bottom of
the FIFO,
* the replacement will transform only these elements in the FIFO: all
characters in their internal text buffers are combined if needed if the
replacement reduces the number of WB class items, otherwise the WB class is
just replaced in the relevant element of the FIFO stack, but characters are
kept unchanged.
* Then if the replacement in that matched rule contains a "break here'" or
"no-break" item, all characters in the bottom of the FIFO up to that
position are output: they are popped out from the FIFO, but other items in
the FIFO are kept.

An automata can optimize this FIFO so that the set of rules (equivalent to
an ordered set of regexps) becomes a finite state automata. But as the set
of regexp is ordered, it is possible that from some input some common
prefix in multiple regexps will match simulteneously: their order is
significant.

This is more complex than in the initial specification of word breakers
where there was no "hungry" regexps and matching occured only on pairs of
characters, so that you did not need a FIFO (or the FIFO always contained a
single element, never more, and the text buffer in that element was reduced
to just one character or their encoded bytes): in that case there was still
a significant order or rules, so that only if multiple ones were potentialy
matching the input pair, their order in the specification determined their
precedence (in that case it was possibly to summarize the ordered set of
rules with a simple 2D lookup table).

But if you look at rule WB4: X (Extend | Format | ZWJ)*→X
(which is "hungry" and not bound in length, and which does not pop out any
characters from the input FIFO but still cumulate them in the input state
until it no longer matches longer inputs with "X (Extend | Format | ZWJ)*),
the simple 2D lookup table array approach does no work: it will match
partial input at the same time as other concurrent rules, but concurrent
rules must be ignored if their precedence is lower (because their rule
number is higher).

So the automata cannot be a finite-state automata whose state is
represented only by a single integer in a small bound set (the set of WB
class values).

Note also that the input stream is complemented with additional
pseudo-characters "sot" and "eot" surrounding it: the automata will be
initialized by pushing a {"", sot} element in the FIFO and when the end of
strem is reached, it will push a {"", eot} element to the FIFO. This is
needed for rules WB1 and WB2 (that have the highest precedence in the set
of regexps to match).

The last rule "WB999: Any ÷ Any" is not "hungry" but is equivalent to a
match-all pairs regexp "..", and because

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Daniel Bünzli

On Wednesday 23 November 2016 at 12:00, Tom Hacohen wrote:
> This looks like a mistake statement rather than a binding rule.
Well at least to me it's pretty clear that this is not the case.

> Even if that's true, look at my second statement (which you redacted in
> your reply):

I'm not arguing whether the boundaries produced by this process is good or not. 
I'm just saying that to me, the test data is consistent with the operational 
model and rules of UAX#29 as it exists. 

Best, 

Daniel

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen


On 23/11/16 10:52, Daniel Bünzli wrote:

On Wednesday 23 November 2016 at 11:22, Tom Hacohen wrote:

Thank you for your reply, but I don't think the UAX, specifically the
line you quoted implies that. The line you quoted says that the process
is terminated when a rule matches and produces a boundary status. In
Table 1[1], the right-arrow (which is used in rule 4) is listed as a
boundary symbol,


Precisely, rules with this *symbol* do not produce a boundary *status* which is 
either boundary or not boundary as mentioned in parens in the line I quoted.


This looks like a mistake statement rather than a binding rule.




so I would argue that one should stop the process and start it again from the 
start.


At least in the current UAX there is no mention of an idea of stopping and 
restarting the process at all.


Even if that's true, look at my second statement (which you redacted in 
your reply):


Furthermore, in the clarification to rule 4[2] it clearly states: "The 
main purpose of this rule is to always treat a grapheme cluster as a 
single character—that is, as if it were simply the first character of 
the cluster".

This again sides with my understanding that:
X Extendend Y
should behave exactly the same as
X Y
after the extended part.
Which is exactly what I'm arguing for.


Also take another look at 
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules 
specifically the table that shows another way of writing the ignore 
rule. This again shows my understanding of rule 4 is correct.


Specially look at the following equivalence:
X Y × Z W 	⇒ 	X (Extend | Format)* Y (Extend | Format)* × Z (Extend | 
Format)* W


--
Tom

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Daniel Bünzli

On Wednesday 23 November 2016 at 11:22, Tom Hacohen wrote:
> Thank you for your reply, but I don't think the UAX, specifically the
> line you quoted implies that. The line you quoted says that the process 
> is terminated when a rule matches and produces a boundary status. In 
> Table 1[1], the right-arrow (which is used in rule 4) is listed as a 
> boundary symbol, 

Precisely, rules with this *symbol* do not produce a boundary *status* which is 
either boundary or not boundary as mentioned in parens in the line I quoted.

> so I would argue that one should stop the process and start it again from the 
> start.

At least in the current UAX there is no mention of an idea of stopping and 
restarting the process at all.

Best, 

Daniel

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen


On 23/11/16 10:01, Daniel Bünzli wrote:

On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote:

However, looking at the test case and the UAX[2], this does not look
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity
between them.


I'd say this is not the right operational model. From [1]:

"The rules are processed from top to bottom. As soon as a rule matches and produces 
a boundary status (boundary or no boundary) for that offset, the process is 
terminated."

So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 
quicks in. It does not produce a boundary status, it only changes your offset 
context to ZWJ GAZ, as you mention. Now you continue applying the rules 
sequentially with WB6 which does not match, with WB7 which does not match,... 
and you'll get to WB999 which matches and produces a boundary status.

After WB4 you do not restart the matching process from the beginning, as you 
do, leading you to say that WB3c should apply.


Hey Daniel,

Thank you for your reply, but I don't think the UAX, specifically the 
line you quoted implies that. The line you quoted says that the process 
is terminated when a rule matches and produces a boundary status. In 
Table 1[1], the right-arrow (which is used in rule 4) is listed as a 
boundary symbol, so I would argue that one should stop the process and 
start it again from the start.


Furthermore, in the clarification to rule 4[2] it clearly states: "The 
main purpose of this rule is to always treat a grapheme cluster as a 
single character—that is, as if it were simply the first character of 
the cluster".

This again sides with my understanding that:
X Extendend Y
should behave exactly the same as
X Y
after the extended part.
Which is exactly what I'm arguing for.

--
Tom

[1] http://www.unicode.org/reports/tr29/#Table_Boundary_Symbols
[2] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Daniel Bünzli

On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote:
> However, looking at the test case and the UAX[2], this does not look
> correct. More specifically, because of rule 4:
> ZWJ Extended GAZ -> ZWJ GAZ
> And then according to rule 3c, there should be no break opportunity 
> between them. 

I'd say this is not the right operational model. From [1]: 

"The rules are processed from top to bottom. As soon as a rule matches and 
produces a boundary status (boundary or no boundary) for that offset, the 
process is terminated."

So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 
quicks in. It does not produce a boundary status, it only changes your offset 
context to ZWJ GAZ, as you mention. Now you continue applying the rules 
sequentially with WB6 which does not match, with WB7 which does not match,... 
and you'll get to WB999 which matches and produces a boundary status. 

After WB4 you do not restart the matching process from the beginning, as you 
do, leading you to say that WB3c should apply.

Best, 

Daniel

[1] http://www.unicode.org/reports/tr29/#Notation

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-23 Thread Tom Hacohen

You said:
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).

Which makes me think you misread the example I quoted. There is a break 
in the reference implementation, though I argue (like you just did) that 
there shouldn't be. So I think you agree with me and also think it's broken.

Otherwise, I'm not sure I fully understand what you are saying, but if 
what you are saying is correct, then following the same logic, other 
rules would fail, specifically:

÷ 0061 × 2060 × 0030 ÷  #  ÷ [0.2] LATIN SMALL LETTER A (ALetter) × 
[4.0] WORD JOINER (Format_FE) × [9.0] DIGIT ZERO (Numeric) ÷ [0.3]

After the FE here there's no BREAK because:
ALetter Format Numeric -> ALetter Numeric
Which then following rule 9.0 is a no-break.

This is exactly the rule (4) as described in my previous email, just 
with a different follow-up rule (9 instead of 3c). I don't see how rule 
precedence would matter here, as there is no case for which two rules apply.

--
Tom.

On 23/11/16 02:49, Philippe Verdy wrote:

IMHO, the ZWJ should glue with the last symbol following your examples.
But the combining diaeresis following the ZWJ extends it (even if in my
opinion it is "defective" and would likely display on a dotted ciurcle
in renderers, but not defective for the string definition of combining
sequences).
So ignore it and test whever the last symbols glues with ZWJ (it should,
so there's no break in the reference implementation).

WB4: X (Extend | Format | ZWJ)*→X

Extend: [ExtendGrapheme_Extend=Yes]  This includes:
  General_Category = Nonspacing_Mark (this includes the combining diaeresis)
  General_Category = Enclosing_Mark
  U+200C ZERO WIDTH NON-JOINER
  plus a few General_Category = Spacing_Mark needed for canonical
equivalence.

So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ
(EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the
input queue

But rule WB3c comes before and prohibits it:

WB3c: ZWJ × (Glue_After_Zwj | EBG)

This means that you have first:

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

and this does not match the rule WB4 which is not matching for:

X × (Extend | Format | ZWJ)*→X

(it cannot remove the extenders if there's a no-break before them, it is
valid only when the break oppotunity is still unspecified. As soon as a
rule as produced a "break here" or "nobreak here" at a given position,
you must advance after this position (the rules are based on a small
finite state machine). So after :

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

it just remains in your input queue:

"COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
ZWJ is elminated)

Now comes WB4: X (Extend | Format | ZWJ)* → X

There's no more any "X" to match before the combining diaeresis: your
input queue starts by the combining diareasis matching "X", the
following character (EBG) does not match within "(Extend | Format |
ZWJ)*" (which matches an empty string and does not contain the combining
diaresis already matched in "X"), rule WB4 has then no replacement
effect and preserves the initial "X" (i.e. the combining diaeresis)

.

2016-11-22 13:07 GMT+01:00 Tom Hacohen >:

Dear,

I recently updated libunibreak[1] according to unicode 9.0.0. I
thought I implemented it correctly, however it fails against two of
the tests in the reference test data:

÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
(Glue_After_Zwj) ÷ [0.3]

and

÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) ×
[4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]

More specifically, it fails in both after the "combining diaeresis".
My implementation marks it as a break, whereas the test data as not.
The reference implementation, as expected, agrees with the test data.

However, looking at the test case and the UAX[2], this does not look
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity
between them. The reference implementation, however, uses rule 999
here, which I believe is incorrect.

Am I missing anything, or is this an issue with the reference test
data and reference implementation?

Thanks,
Tom.

[1]: https://github.com/adah1972/libunibreak

[2]: http://www.unicode.org/reports/tr29/#WB1

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-22 Thread Philippe Verdy

Note also this statement at the begining of the specification:

Single boundaries. Each rule has exactly one boundary position. This
restriction is more a limitation on the specification methods, because a
rule with multiple boundaries could be expressed instead as multiple rules.
For example:
 *  “a b ÷ c d ÷ e f” could be broken into two rules “a b ÷ c d e f” and “a
b c d ÷ e f”
 *  “a b × c d × e f” could be broken into two rules “a b × c d e f” and “a
b c d × e f”

The rules are not built to allow keeping and processing multiple boundary
positions. Only one is considered: once a break or no-break decision is
made on a position, everything that is before that position is discarded
from the input and will no longer be used in further rule. The engines
loops at the first rule, just from that new boundary position to find
matching rules, without ever looking backward.

2016-11-23 3:49 GMT+01:00 Philippe Verdy :

> IMHO, the ZWJ should glue with the last symbol following your examples.
> But the combining diaeresis following the ZWJ extends it (even if in my
> opinion it is "defective" and would likely display on a dotted ciurcle in
> renderers, but not defective for the string definition of combining
> sequences).
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).
>
> WB4: X (Extend | Format | ZWJ)*→X
>
> Extend: [ExtendGrapheme_Extend=Yes]  This includes:
>   General_Category = Nonspacing_Mark (this includes the combining
> diaeresis)
>   General_Category = Enclosing_Mark
>   U+200C ZERO WIDTH NON-JOINER
>   plus a few General_Category = Spacing_Mark needed for canonical
> equivalence.
>
> So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ (EBG|
> Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input
> queue
>
> But rule WB3c comes before and prohibits it:
>
> WB3c: ZWJ × (Glue_After_Zwj | EBG)
>
> This means that you have first:
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> and this does not match the rule WB4 which is not matching for:
>
> X × (Extend | Format | ZWJ)*→X
>
> (it cannot remove the extenders if there's a no-break before them, it is
> valid only when the break oppotunity is still unspecified. As soon as a
> rule as produced a "break here" or "nobreak here" at a given position, you
> must advance after this position (the rules are based on a small finite
> state machine). So after :
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> it just remains in your input queue:
>
> "COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
> ZWJ is elminated)
>
> Now comes WB4: X (Extend | Format | ZWJ)* → X
>
> There's no more any "X" to match before the combining diaeresis: your
> input queue starts by the combining diareasis matching "X", the following
> character (EBG) does not match within "(Extend | Format | ZWJ)*" (which
> matches an empty string and does not contain the combining diaresis already
> matched in "X"), rule WB4 has then no replacement effect and preserves the
> initial "X" (i.e. the combining diaeresis)
>
> .
>
>
>
>
>
>
> 2016-11-22 13:07 GMT+01:00 Tom Hacohen :
>
>> Dear,
>>
>> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I
>> implemented it correctly, however it fails against two of the tests in the
>> reference test data:
>>
>> ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
>> (Glue_After_Zwj) ÷ [0.3]
>>
>> and
>>
>> ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>>
>>
>> More specifically, it fails in both after the "combining diaeresis". My
>> implementation marks it as a break, whereas the test data as not. The
>> reference implementation, as expected, agrees with the test data.
>>
>>
>> However, looking at the test case and the UAX[2], this does not look
>> correct. More specifically, because of rule 4:
>> ZWJ Extended GAZ -> ZWJ GAZ
>> And then according to rule 3c, there should be no break opportunity
>> between them. The reference implementation, however, uses rule 999 here,
>> which I believe is incorrect.
>>
>>
>> Am I missing anything, or is this an issue with the reference test data
>> and reference implementation?
>>
>> Thanks,
>> Tom.
>>
>> [1]: https://github.com/adah1972/libunibreak
>> [2]: http://www.unicode.org/reports/tr29/#WB1
>>
>
>

Re: Potential contradiction between the WordBreak test data and UAX #29

2016-11-22 Thread Philippe Verdy

IMHO, the ZWJ should glue with the last symbol following your examples.
But the combining diaeresis following the ZWJ extends it (even if in my
opinion it is "defective" and would likely display on a dotted ciurcle in
renderers, but not defective for the string definition of combining
sequences).
So ignore it and test whever the last symbols glues with ZWJ (it should, so
there's no break in the reference implementation).

WB4: X (Extend | Format | ZWJ)*→X

Extend: [ExtendGrapheme_Extend=Yes]  This includes:
  General_Category = Nonspacing_Mark (this includes the combining diaeresis)
  General_Category = Enclosing_Mark
  U+200C ZERO WIDTH NON-JOINER
  plus a few General_Category = Spacing_Mark needed for canonical
equivalence.

So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ (EBG|
Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input
queue

But rule WB3c comes before and prohibits it:

WB3c: ZWJ × (Glue_After_Zwj | EBG)

This means that you have first:

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

and this does not match the rule WB4 which is not matching for:

X × (Extend | Format | ZWJ)*→X

(it cannot remove the extenders if there's a no-break before them, it is
valid only when the break oppotunity is still unspecified. As soon as a
rule as produced a "break here" or "nobreak here" at a given position, you
must advance after this position (the rules are based on a small finite
state machine). So after :

ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG

it just remains in your input queue:

"COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so ZWJ
is elminated)

Now comes WB4: X (Extend | Format | ZWJ)* → X

There's no more any "X" to match before the combining diaeresis: your input
queue starts by the combining diareasis matching "X", the following
character (EBG) does not match within "(Extend | Format | ZWJ)*" (which
matches an empty string and does not contain the combining diaresis already
matched in "X"), rule WB4 has then no replacement effect and preserves the
initial "X" (i.e. the combining diaeresis)

.






2016-11-22 13:07 GMT+01:00 Tom Hacohen :

> Dear,
>
> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I
> implemented it correctly, however it fails against two of the tests in the
> reference test data:
>
> ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
> (Glue_After_Zwj) ÷ [0.3]
>
> and
>
> ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>
>
> More specifically, it fails in both after the "combining diaeresis". My
> implementation marks it as a break, whereas the test data as not. The
> reference implementation, as expected, agrees with the test data.
>
>
> However, looking at the test case and the UAX[2], this does not look
> correct. More specifically, because of rule 4:
> ZWJ Extended GAZ -> ZWJ GAZ
> And then according to rule 3c, there should be no break opportunity
> between them. The reference implementation, however, uses rule 999 here,
> which I believe is incorrect.
>
>
> Am I missing anything, or is this an issue with the reference test data
> and reference implementation?
>
> Thanks,
> Tom.
>
> [1]: https://github.com/adah1972/libunibreak
> [2]: http://www.unicode.org/reports/tr29/#WB1
>

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

Re: Potential contradiction between the WordBreak test data and UAX #29

14 matches

Site Navigation

Mail list logo

Footer information