Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Ken Whistler Fri, 01 May 2015 07:23:04 -0700


Koji,

Personally, I don't have a horse in this race, because I am notresponsible forany linebreaking implementation -- so a change for halfwidth katakanawouldn't

matter one way or the other to me.

Secondly, there is no formal stability guarantee constraining Line_Breakproperty

values (other than the generic guarantee that the property itself or
existing aliases cannot be *removed* from the standard). Nor is there

any stability guarantee regarding the rest of the algorithm definitionin UAX #14.So in principle, the UTC could rewrite it completely. But I doubt thatthat would

be in anybody's interest at this point. ;-)

But as I see it, the way this should work is for the major stakeholderswho *do*

have implemented linebreaking algorithms depending on UAX #14 working
in released products (and that would include people speaking for various
browsers and for Apple products in general, I think) should be the ones

either pushing for a change, because it would make their behavior morecorrectand acceptable for Japanese, or pushing back *against* a change, becausethey

depend on UAX #14 stability and would prefer tweaking the behavior in their
implementations, instead. So I'd like to see a formal proposal for a change

(specified *exactly* as to the set of characters affected) brought tothe UTC,

where implementers and users of ICU could make the case for or against.

The other thing that I think would need to happen here is that any proposal
should also provide suggested wording for UAX #14 which would explain

why halfwidth katakana specifically need to break with the generalprinciples

that were used 15 years ago to assign LB classes based on East_Asian_Width
considerations, and instead need to match the LB classes of their
fullwidth katakana counterparts. That should be made explicit in the text
of UAX #14, so somebody else doesn't "discover" another inconsistency
between sets of values and try to change things back later on -- not knowing
the rationale for the values.

Because a well-formed proposal for a change like this involves both
a justification for a property value change *and* a corresponding fix
to annex text, I think this is too late in the cycle to be taken as just
beta feedback for the Version 8.0 release, unfortunately. Because of
the potential hit on existing implementations (and test cases), this needs
full review, and should instead be pushed as an early proposal for
the Version 9.0 release cycle.

--Ken

On 5/1/2015 5:33 AM, Koji Ishii wrote:

I support Makoto for the change. Nobody should appreciate that behavior, either 
worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than 
implementing yet another work around in Chrome, I wish it being fixed finally 
after 15 years.

If this issue is like 5 people say break and 5 not to, or considering the long 
life of the bug, 9 say break and 1 say not to, I understand that Ken’s answer 
might make more sense. However, I’m quite sure that this is a 10-0 issue. 
Everyone using UAX#14 has to choose from trailer, unnoticed, or won’t fix. I 
think that kind of things should better be fixed.

Half-width CJK should follow the same line breaking class as their wide 
counterparts. From that point of view, half-width Hangul being AL is actually 
correct. (Note that this is not the same as full-width oftentimes having the 
different classes than their narrow counterparts.)

Half-width punctuations already have correct classes, so they’re fine. Symbols 
in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code 
points in any CJK legacy encoding. Where had they come from? Logical thinking 
is to assign the same classes as their wide counterparts, but I can’t be sure 
without knowing where they came from.

Ken, does this change cause problems in terms of the stability policy?

/koji

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Reply via email to