Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Peter Edberg Mon, 04 May 2015 14:24:07 -0700

I have been checking with various groups at Apple. The consensus here is that 
we would like to see the linebreak value for halfwidth katakana changed to ID.


- Peter E


> On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <[email protected]> 
> wrote:
> 
> On 5/3/2015 9:47 AM, Koji Ishii wrote:
>> Thank you so much Ken and Asmus for the detailed guides and histories. This 
>> helps me a lot.
>> 
>> In terms of time frame, I don't insist on specific time frame, Unicode 9 is 
>> fine if that works well for all.
>> 
>> I'm not sure how much history and postmortem has to be baked into the 
>> section of UAX#14, hope not much because I'm not familiar with how it was 
>> defined so other than what Ken             and Asmus kindly provided in this 
>> thread. But from those information, I feel stronger than before that this 
>> was simply an unfortunate oversight. In the document Ken quoted, F and W are 
>> distinguished, but H and N are not. In '90, East Asian versions of Office 
>> and RichEdit were in my radar and all of them handled halfwidth Katakana as 
>> ID for the line breaking purposes. That's quite understandable given the 
>> amount of code points to work on, given the priority of halfwidth Katakana, 
>> and given the difference of "what line breaking should be" and UAX#14 as Ken 
>> noted, but writing it up as a document doesn't look an easy task 
> 
> Koji,
> 
> kana are special in that they are not shared among languages. From that 
> perspective, there's nothing wrong with having a "general purpose" algorithm 
> support the rules of the target language (unless that would add undue 
> complexity, which isn't a consideration here).
> 
> Based on the data presented informally here in postings, I find your 
> conclusion (oversight) quite believable. The task would therefore be to 
> present the same data in a more organized fashion as part of a formal 
> proposal. Should be doable.
> 
> I think you'd want to focus on survey of modern practice in implementations 
> (and if you have data on some of them going back to the '90s the better).
> 
> From the historical analysis it's clear that there was a desire to create 
> assignments that didn't introduce random inconsistencies between LB and EAW 
> properties, but that kind of self-consistency check just makes sure that all 
> characters of some group defined by the intersection of property subsets are 
> treated the same (unless there's an overriding reason to differentiate 
> within). It seems entirely plausible that this process misfired  for the 
> characters in question, more likely so, given that the earliest drafts of the 
> tables were based on an implementation also being created by MS around the 
> same time. That makes any difference to other MS products even more likely to 
> be an oversight.
> 
> I do want to help UTC establish a precedent of getting changes like that 
> endorsed by a representative sample of implementers and key external 
> standards (where applicable, in this case that would be CSS), to avoid the 
> chance of creating undue disruption (and to increase the chance that the 
> resulting modified algorithm is actually usable off-the-shelf, for example 
> for "default" or "unknown language" type scenarios.
> 
> Hence my insistence that you go out and drum up support. But it looks like 
> this should be relatively easy, as there seems to be no strong case for 
> maintaining the status quo, other than that it is the status quo.
> 
> A./
> 
> 
>> 
>> I agree that implementers and CSS WG should be involved, but given IE and FF 
>> have already tailored, and all MS products as well, I guess it should not be 
>> too hard. I'm in Chrome team now, and the only problem for me to fix it in 
>> Chrome is to justify why Chrome wants to tailor rather than fixing UAX#14 
>> (and the bug priority...)
>> 
>> Either Makoto or I can bring it up to CSS WG to get back to you.
>> 
>> /koji
>> 
>> 
>> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Thank you, Ken, for your dedicated archeological efforts.
>> 
>> I would like to emphasize that, at the time, UAX#14 reflected observed 
>> behavior, in particular (but not exclusively) for MS products some of which 
>> (at the time) used an LB algorithm that effectively matched an untailored 
>> UAX#14.
>> 
>> However, recently, the W3C has spent considerable effort to look into 
>> different layout-related algorithms and specification. If, in that context, 
>> a consensus approach is developed that would point to a better "default" 
>> behavior for untailored UAX#14-style line breaking, I would regard that as a 
>> critical mass of support to allow UTC to consider tinkering with such a 
>> long-standing set of property assignments.
>> 
>> This would be true, especially, if it can be demonstrated that (other than 
>> matching legacy behavior) there's no context that would benefit from the 
>> existing classification. I note that this was something several posters 
>> implied.
>> 
>> So, if implementers of the legacy behavior are amenable to achieve this by 
>> tailoring, and if the change augments the number of situations where 
>> untailored UAX#14-style line breaking can be used, that would be a win that 
>> might offset the cost of a disruptive change.
>> 
>> We've heard arguments why the proposed change is technically superior for 
>> Japanese. We now need to find out whether there are contexts where a change 
>> would adversely affect users/implementers. Following that, we would look for 
>> endorsements of the proposal from implementers or other standards 
>> organizations such as W3C (and, if at all possible, agreement from those 
>> implementers who use the untailored algorithm now). With these three 
>> preconditions in place, I would support an effort of the UTC to revisit this 
>> question.
>> 
>> A./
>> 
>> 
>> On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>> Suzuki-san,
>>> 
>>> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>>> 
>>>> Excuse me, there is any discussion record how UAX#14 class for 
>>>> halfwidth-katakana in 15 years ago? If there is such, I want to 
>>>> see a sample text (of halfwidth-katakana) and expected layout 
>>>> result for it. 
>>> 
>>> The *founding* document for the UTC discussion of the initial
>>> Line_Break property values 15 years ago was:
>>> 
>>> http://www.unicode.org/L2/L1999/99179.pdf 
>>> <http://www.unicode.org/L2/L1999/99179.pdf>
>>> 
>>> and the corresponding table draft (before approval and conversion
>>> into the final format that was published with UTR #14 -- later
>>> UAX #14) was:
>>> 
>>> http://www.unicode.org/L2/L1999/99180.pdf 
>>> <http://www.unicode.org/L2/L1999/99180.pdf>
>>> 
>>> There is nothing different or surprising in terms of values there. The 
>>> halfwidth
>>> katakana were lb=AL and the fullwidth katakana were lb=ID in
>>> that earliest draft, as of 1999.
>>> 
>>> What is new information, perhaps, is the explicit correlation that can be 
>>> found
>>> in those documents with the East_Asian_Width properties, and the explanation
>>> in L2/99-179 that the EAW property values were explicitly used to
>>> make distinctions for the initial LB values.
>>> 
>>> There is no sample text or expected layout results from that time period,
>>> because that was not the basis for the original UTC decisions on any of 
>>> this.
>>> Initial LB values were generated based on existing General_Category
>>> and EAW values, using general principles. They were not generated by
>>> examining and specifying in detail the line breaking behavior for
>>> every single script in the standard, and then working back from those
>>> detailed specifications to attempt to create a universal specification
>>> that would replicate all of that detailed behavior. Such an approach
>>> would have been nearly impossible, given the state of all the data,
>>> and might have taken a decade to complete.
>>> 
>>> That said, Japanese line breaking was no doubt considered as part of
>>> the overall background, because the initial design for UTR #14 was informed
>>> by experience in implementation of line breaking algorithms at Microsoft
>>> in the 90's.
>>> 
>>>> 
>>>> You commented that the UAX#14 class should not be changed but 
>>>> the tailoring of the line breaking behaviour would solve 
>>>> the problem (as Firefox and IE11 did). However, some developers 
>>>> may wonder "there might be a reason why UTC put halfwidth-katakana 
>>>> to AL - without understanding it, we could not determine whether 
>>>> the proposed tailoring should be enabled always, or enabled 
>>>> only for a specific environment (e.g. locale, surrounding text)". 
>>> 
>>> See above, in L2/99-179. *That* was the justification. It had nothing
>>> to do with specific environment, locale, or surrounding text.
>>> 
>>>> 
>>>> If UTC can supply the "expected layout result for halfwidth- 
>>>> katakana (used to define the class in current UAX#14)", it 
>>>> would be helpful for the developers to evaluate the proposed 
>>>> tailoring algorithm.
>>> 
>>> UAX #14 was never intended to be a detailed, script-by-script
>>> specification of line layout results. It is a default, generic, universal
>>> algorithm for line breaking that does a decent, generic job of
>>> line breaking in generic contexts without tailoring or specific
>>> knowledge of language, locale, or typographical conventions in use.
>>> 
>>> UAX #14 is not a replacement for full specification of kinsoku
>>> rules for Japanese, in particular. Nor is it intended as any kind
>>> of replacement for JIS X 4051.
>>> 
>>> Please understand this: UAX #14 does *NOT* tell anyone how
>>> Japanese text *should* line break. Instead, it is Japanese typographers,
>>> users and standardizers who tell implementers of line break
>>> algorithms for Japanese what the expectations for Japanese text should
>>> be, in what contexts. It is then the job of the UTC and of the
>>> platform and application vendors to negotiate the details of
>>> which part of that expected behavior makes sense to try to
>>> cover by tweaking the default line-breaking algorithm and the
>>> Line_Break property values for Unicode characters, and which
>>> part of that expected behavior makes sense to try to cover
>>> by adjusting commonly accessible and agreed upon tailoring
>>> behavior (or public standards like CSS), and finally which part of that
>>> expected behavior should instead be addressed by value-added, proprietary
>>> implementations of high end publishing software.
>>> 
>>> Regards,
>>> 
>>> --Ken
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?

Reply via email to