Koji,

Personally, I don't have a horse in this race, because I am not responsible for any linebreaking implementation -- so a change for halfwidth katakana wouldn't
matter one way or the other to me.

Secondly, there is no formal stability guarantee constraining Line_Break property
values (other than the generic guarantee that the property itself or
existing aliases cannot be *removed* from the standard). Nor is there
any stability guarantee regarding the rest of the algorithm definition in UAX #14. So in principle, the UTC could rewrite it completely. But I doubt that that would
be in anybody's interest at this point. ;-)

But as I see it, the way this should work is for the major stakeholders who *do*
have implemented linebreaking algorithms depending on UAX #14 working
in released products (and that would include people speaking for various
browsers and for Apple products in general, I think) should be the ones
either pushing for a change, because it would make their behavior more correct and acceptable for Japanese, or pushing back *against* a change, because they
depend on UAX #14 stability and would prefer tweaking the behavior in their
implementations, instead. So I'd like to see a formal proposal for a change
(specified *exactly* as to the set of characters affected) brought to the UTC,
where implementers and users of ICU could make the case for or against.

The other thing that I think would need to happen here is that any proposal
should also provide suggested wording for UAX #14 which would explain
why halfwidth katakana specifically need to break with the general principles
that were used 15 years ago to assign LB classes based on East_Asian_Width
considerations, and instead need to match the LB classes of their
fullwidth katakana counterparts. That should be made explicit in the text
of UAX #14, so somebody else doesn't "discover" another inconsistency
between sets of values and try to change things back later on -- not knowing
the rationale for the values.

Because a well-formed proposal for a change like this involves both
a justification for a property value change *and* a corresponding fix
to annex text, I think this is too late in the cycle to be taken as just
beta feedback for the Version 8.0 release, unfortunately. Because of
the potential hit on existing implementations (and test cases), this needs
full review, and should instead be pushed as an early proposal for
the Version 9.0 release cycle.

--Ken

On 5/1/2015 5:33 AM, Koji Ishii wrote:
I support Makoto for the change. Nobody should appreciate that behavior, either 
worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than 
implementing yet another work around in Chrome, I wish it being fixed finally 
after 15 years.

If this issue is like 5 people say break and 5 not to, or considering the long 
life of the bug, 9 say break and 1 say not to, I understand that Ken’s answer 
might make more sense. However, I’m quite sure that this is a 10-0 issue. 
Everyone using UAX#14 has to choose from trailer, unnoticed, or won’t fix. I 
think that kind of things should better be fixed.

Half-width CJK should follow the same line breaking class as their wide 
counterparts. From that point of view, half-width Hangul being AL is actually 
correct. (Note that this is not the same as full-width oftentimes having the 
different classes than their narrow counterparts.)

Half-width punctuations already have correct classes, so they’re fine. Symbols 
in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code 
points in any CJK legacy encoding. Where had they come from? Logical thinking 
is to assign the same classes as their wide counterparts, but I can’t be sure 
without knowing where they came from.

Ken, does this change cause problems in terms of the stability policy?

/koji




Reply via email to