Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Asmus Freytag (t) Sun, 03 May 2015 12:57:46 -0700

On 5/3/2015 9:47 AM, Koji Ishii wrote:

Thank you so much Ken and Asmus for the detailed guides and histories.This helps me a lot.
In terms of time frame, I don't insist on specific time frame, Unicode9 is fine if that works well for all.
I'm not sure how much history and postmortem has to be baked into thesection of UAX#14, hope not much because I'm not familiar with how itwas defined so other than what Ken and Asmus kindly provided in thisthread. But from those information, I feel stronger than before thatthis was simply an unfortunate oversight. In the document Ken quoted,F and W are distinguished, but H and N are not. In '90, East Asianversions of Office and RichEdit were in my radar and all of themhandled halfwidth Katakana as ID for the line breaking purposes.That's quite understandable given the amount of code points to workon, given the priority of halfwidth Katakana, and given the differenceof "what line breaking should be" and UAX#14 as Ken noted, but writingit up as a document doesn't look an easy task


Koji,

kana are special in that they are not shared among languages. From thatperspective, there's nothing wrong with having a "general purpose"algorithm support the rules of the target language (unless that wouldadd undue complexity, which isn't a consideration here).

Based on the data presented informally here in postings, I find yourconclusion (oversight) quite believable. The task would therefore be topresent the same data in a more organized fashion as part of a formalproposal. Should be doable.

I think you'd want to focus on survey of modern practice inimplementations (and if you have data on some of them going back to the'90s the better).

From the historical analysis it's clear that there was a desire tocreate assignments that didn't introduce random inconsistencies betweenLB and EAW properties, but that kind of self-consistency check justmakes sure that all characters of some group defined by the intersectionof property subsets are treated the same (unless there's an overridingreason to differentiate within). It seems entirely plausible that thisprocess misfired for the characters in question, more likely so, giventhat the earliest drafts of the tables were based on an implementationalso being created by MS around the same time. That makes any differenceto other MS products even more likely to be an oversight.

I do want to help UTC establish a precedent of getting changes like thatendorsed by a representative sample of implementers and key externalstandards (where applicable, in this case that would be CSS), to avoidthe chance of creating undue disruption (and to increase the chance thatthe resulting modified algorithm is actually usable off-the-shelf, forexample for "default" or "unknown language" type scenarios.

Hence my insistence that you go out and drum up support. But it lookslike this should be relatively easy, as there seems to be no strong casefor maintaining the status quo, other than that it is the status quo.

A./

I agree that implementers and CSS WG should be involved, but given IEand FF have already tailored, and all MS products as well, I guess itshould not be too hard. I'm in Chrome team now, and the only problemfor me to fix it in Chrome is to justify why Chrome wants to tailorrather than fixing UAX#14 (and the bug priority...)


Either Makoto or I can bring it up to CSS WG to get back to you.

/koji

On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t)<[email protected] <mailto:[email protected]>> wrote:


    Thank you, Ken, for your dedicated archeological efforts.

    I would like to emphasize that, at the time, UAX#14 reflected
    observed behavior, in particular (but not exclusively) for MS
    products some of which (at the time) used an LB algorithm that
    effectively matched an untailored UAX#14.

    However, recently, the W3C has spent considerable effort to look
    into different layout-related algorithms and specification. If, in
    that context, a consensus approach is developed that would point
    to a better "default" behavior for untailored UAX#14-style line
    breaking, I would regard that as a critical mass of support to
    allow UTC to consider tinkering with such a long-standing set of
    property assignments.

    This would be true, especially, if it can be demonstrated that
    (other than matching legacy behavior) there's no context that
    would benefit from the existing classification. I note that this
    was something several posters implied.

    So, if implementers of the legacy behavior are amenable to achieve
    this by tailoring, and if the change augments the number of
    situations where untailored UAX#14-style line breaking can be
    used, that would be a win that might offset the cost of a
    disruptive change.

    We've heard arguments why the proposed change is technically
    superior for Japanese. We now need to find out whether there are
    contexts where a change would adversely affect users/implementers.
    Following that, we would look for endorsements of the proposal
    from implementers or other standards organizations such as W3C
    (and, if at all possible, agreement from those implementers who
    use the untailored algorithm now). With these three preconditions
    in place, I would support an effort of the UTC to revisit this
    question.

    A./


    On 5/1/2015 9:48 AM, Ken Whistler wrote:

    Suzuki-san,

    On 5/1/2015 8:25 AM, suzuki toshiya wrote:


    Excuse me, there is any discussion record how UAX#14 class for
    halfwidth-katakana in 15 years ago? If there is such, I want to
    see a sample text (of halfwidth-katakana) and expected layout
    result for it.


    The *founding* document for the UTC discussion of the initial
    Line_Break property values 15 years ago was:

    http://www.unicode.org/L2/L1999/99179.pdf

    and the corresponding table draft (before approval and conversion
    into the final format that was published with UTR #14 -- later
    /UAX/ #14) was:

    http://www.unicode.org/L2/L1999/99180.pdf

    There is nothing different or surprising in terms of values
    there. The halfwidth
    katakana were lb=AL and the fullwidth katakana were lb=ID in
    that earliest draft, as of 1999.

    What is new information, perhaps, is the explicit correlation
    that can be found
    in those documents with the East_Asian_Width properties, and the
    explanation
    in L2/99-179 that the EAW property values were explicitly used to
    make distinctions for the initial LB values.

    There is no sample text or expected layout results from that time
    period,
    because that was not the basis for the original UTC decisions on
    any of this.
    Initial LB values were generated based on existing General_Category
    and EAW values, using general principles. They were not generated by
    examining and specifying in detail the line breaking behavior for
    every single script in the standard, and then working back from those
    detailed specifications to attempt to create a universal
    specification
    that would replicate all of that detailed behavior. Such an approach
    would have been nearly impossible, given the state of all the data,
    and might have taken a decade to complete.

    That said, Japanese line breaking was no doubt considered as part of
    the overall background, because the initial design for UTR #14
    was informed
    by experience in implementation of line breaking algorithms at
    Microsoft
    in the 90's.


    You commented that the UAX#14 class should not be changed but
    the tailoring of the line breaking behaviour would solve
    the problem (as Firefox and IE11 did). However, some developers
    may wonder "there might be a reason why UTC put halfwidth-katakana
    to AL - without understanding it, we could not determine whether
    the proposed tailoring should be enabled always, or enabled
    only for a specific environment (e.g. locale, surrounding text)".


    See above, in L2/99-179. *That* was the justification. It had nothing
    to do with specific environment, locale, or surrounding text.


    If UTC can supply the "expected layout result for halfwidth-
    katakana (used to define the class in current UAX#14)", it
    would be helpful for the developers to evaluate the proposed
    tailoring algorithm.


    UAX #14 was never intended to be a detailed, script-by-script
    specification of line layout results. It is a default, generic,
    universal
    algorithm for line breaking that does a decent, generic job of
    line breaking in generic contexts without tailoring or specific
    knowledge of language, locale, or typographical conventions in use.

    UAX #14 is not a replacement for full specification of kinsoku
    rules for Japanese, in particular. Nor is it intended as any kind
    of replacement for JIS X 4051.

    Please understand this: UAX #14 does *NOT* tell anyone how
    Japanese text *should* line break. Instead, it is Japanese
    typographers,
    users and standardizers who tell implementers of line break
    algorithms for Japanese what the expectations for Japanese text
    should
    be, in what contexts. It is then the job of the UTC and of the
    platform and application vendors to negotiate the details of
    which part of that expected behavior makes sense to try to
    cover by tweaking the default line-breaking algorithm and the
    Line_Break property values for Unicode characters, and which
    part of that expected behavior makes sense to try to cover
    by adjusting commonly accessible and agreed upon tailoring
    behavior (or public standards like CSS), and finally which part
    of that
    expected behavior should instead be addressed by value-added,
    proprietary
    implementations of high end publishing software.

    Regards,

    --Ken

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Reply via email to