Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Asmus Freytag (t) Fri, 01 May 2015 12:18:26 -0700

Thank you, Ken, for your dedicated archeological efforts.

I would like to emphasize that, at the time, UAX#14 reflected observedbehavior, in particular (but not exclusively) for MS products some ofwhich (at the time) used an LB algorithm that effectively matched anuntailored UAX#14.

However, recently, the W3C has spent considerable effort to look intodifferent layout-related algorithms and specification. If, in thatcontext, a consensus approach is developed that would point to a better"default" behavior for untailored UAX#14-style line breaking, I wouldregard that as a critical mass of support to allow UTC to considertinkering with such a long-standing set of property assignments.

This would be true, especially, if it can be demonstrated that (otherthan matching legacy behavior) there's no context that would benefitfrom the existing classification. I note that this was something severalposters implied.

So, if implementers of the legacy behavior are amenable to achieve thisby tailoring, and if the change augments the number of situations whereuntailored UAX#14-style line breaking can be used, that would be a winthat might offset the cost of a disruptive change.

We've heard arguments why the proposed change is technically superiorfor Japanese. We now need to find out whether there are contexts where achange would adversely affect users/implementers. Following that, wewould look for endorsements of the proposal from implementers or otherstandards organizations such as W3C (and, if at all possible, agreementfrom those implementers who use the untailored algorithm now). Withthese three preconditions in place, I would support an effort of the UTCto revisit this question.


A./

On 5/1/2015 9:48 AM, Ken Whistler wrote:

Suzuki-san,

On 5/1/2015 8:25 AM, suzuki toshiya wrote:


Excuse me, there is any discussion record how UAX#14 class for
halfwidth-katakana in 15 years ago? If there is such, I want to
see a sample text (of halfwidth-katakana) and expected layout
result for it.


The *founding* document for the UTC discussion of the initial
Line_Break property values 15 years ago was:

http://www.unicode.org/L2/L1999/99179.pdf

and the corresponding table draft (before approval and conversion
into the final format that was published with UTR #14 -- later
/UAX/ #14) was:

http://www.unicode.org/L2/L1999/99180.pdf

There is nothing different or surprising in terms of values there. Thehalfwidth

katakana were lb=AL and the fullwidth katakana were lb=ID in
that earliest draft, as of 1999.

What is new information, perhaps, is the explicit correlation that canbe foundin those documents with the East_Asian_Width properties, and theexplanation

in L2/99-179 that the EAW property values were explicitly used to
make distinctions for the initial LB values.

There is no sample text or expected layout results from that time period,

because that was not the basis for the original UTC decisions on anyof this.

Initial LB values were generated based on existing General_Category
and EAW values, using general principles. They were not generated by
examining and specifying in detail the line breaking behavior for
every single script in the standard, and then working back from those
detailed specifications to attempt to create a universal specification
that would replicate all of that detailed behavior. Such an approach
would have been nearly impossible, given the state of all the data,
and might have taken a decade to complete.

That said, Japanese line breaking was no doubt considered as part of

the overall background, because the initial design for UTR #14 wasinformed

by experience in implementation of line breaking algorithms at Microsoft
in the 90's.


You commented that the UAX#14 class should not be changed but
the tailoring of the line breaking behaviour would solve
the problem (as Firefox and IE11 did). However, some developers
may wonder "there might be a reason why UTC put halfwidth-katakana
to AL - without understanding it, we could not determine whether
the proposed tailoring should be enabled always, or enabled
only for a specific environment (e.g. locale, surrounding text)".


See above, in L2/99-179. *That* was the justification. It had nothing
to do with specific environment, locale, or surrounding text.


If UTC can supply the "expected layout result for halfwidth-
katakana (used to define the class in current UAX#14)", it
would be helpful for the developers to evaluate the proposed
tailoring algorithm.


UAX #14 was never intended to be a detailed, script-by-script
specification of line layout results. It is a default, generic, universal
algorithm for line breaking that does a decent, generic job of
line breaking in generic contexts without tailoring or specific
knowledge of language, locale, or typographical conventions in use.

UAX #14 is not a replacement for full specification of kinsoku
rules for Japanese, in particular. Nor is it intended as any kind
of replacement for JIS X 4051.

Please understand this: UAX #14 does *NOT* tell anyone how
Japanese text *should* line break. Instead, it is Japanese typographers,
users and standardizers who tell implementers of line break
algorithms for Japanese what the expectations for Japanese text should
be, in what contexts. It is then the job of the UTC and of the
platform and application vendors to negotiate the details of
which part of that expected behavior makes sense to try to
cover by tweaking the default line-breaking algorithm and the
Line_Break property values for Unicode characters, and which
part of that expected behavior makes sense to try to cover
by adjusting commonly accessible and agreed upon tailoring
behavior (or public standards like CSS), and finally which part of that
expected behavior should instead be addressed by value-added, proprietary
implementations of high end publishing software.

Regards,

--Ken

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Reply via email to