I have been checking with various groups at Apple. The consensus here is that we would like to see the linebreak value for halfwidth katakana changed to ID.
- Peter E > On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <[email protected]> > wrote: > > On 5/3/2015 9:47 AM, Koji Ishii wrote: >> Thank you so much Ken and Asmus for the detailed guides and histories. This >> helps me a lot. >> >> In terms of time frame, I don't insist on specific time frame, Unicode 9 is >> fine if that works well for all. >> >> I'm not sure how much history and postmortem has to be baked into the >> section of UAX#14, hope not much because I'm not familiar with how it was >> defined so other than what Ken and Asmus kindly provided in this >> thread. But from those information, I feel stronger than before that this >> was simply an unfortunate oversight. In the document Ken quoted, F and W are >> distinguished, but H and N are not. In '90, East Asian versions of Office >> and RichEdit were in my radar and all of them handled halfwidth Katakana as >> ID for the line breaking purposes. That's quite understandable given the >> amount of code points to work on, given the priority of halfwidth Katakana, >> and given the difference of "what line breaking should be" and UAX#14 as Ken >> noted, but writing it up as a document doesn't look an easy task > > Koji, > > kana are special in that they are not shared among languages. From that > perspective, there's nothing wrong with having a "general purpose" algorithm > support the rules of the target language (unless that would add undue > complexity, which isn't a consideration here). > > Based on the data presented informally here in postings, I find your > conclusion (oversight) quite believable. The task would therefore be to > present the same data in a more organized fashion as part of a formal > proposal. Should be doable. > > I think you'd want to focus on survey of modern practice in implementations > (and if you have data on some of them going back to the '90s the better). > > From the historical analysis it's clear that there was a desire to create > assignments that didn't introduce random inconsistencies between LB and EAW > properties, but that kind of self-consistency check just makes sure that all > characters of some group defined by the intersection of property subsets are > treated the same (unless there's an overriding reason to differentiate > within). It seems entirely plausible that this process misfired for the > characters in question, more likely so, given that the earliest drafts of the > tables were based on an implementation also being created by MS around the > same time. That makes any difference to other MS products even more likely to > be an oversight. > > I do want to help UTC establish a precedent of getting changes like that > endorsed by a representative sample of implementers and key external > standards (where applicable, in this case that would be CSS), to avoid the > chance of creating undue disruption (and to increase the chance that the > resulting modified algorithm is actually usable off-the-shelf, for example > for "default" or "unknown language" type scenarios. > > Hence my insistence that you go out and drum up support. But it looks like > this should be relatively easy, as there seems to be no strong case for > maintaining the status quo, other than that it is the status quo. > > A./ > > >> >> I agree that implementers and CSS WG should be involved, but given IE and FF >> have already tailored, and all MS products as well, I guess it should not be >> too hard. I'm in Chrome team now, and the only problem for me to fix it in >> Chrome is to justify why Chrome wants to tailor rather than fixing UAX#14 >> (and the bug priority...) >> >> Either Makoto or I can bring it up to CSS WG to get back to you. >> >> /koji >> >> >> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <[email protected] >> <mailto:[email protected]>> wrote: >> Thank you, Ken, for your dedicated archeological efforts. >> >> I would like to emphasize that, at the time, UAX#14 reflected observed >> behavior, in particular (but not exclusively) for MS products some of which >> (at the time) used an LB algorithm that effectively matched an untailored >> UAX#14. >> >> However, recently, the W3C has spent considerable effort to look into >> different layout-related algorithms and specification. If, in that context, >> a consensus approach is developed that would point to a better "default" >> behavior for untailored UAX#14-style line breaking, I would regard that as a >> critical mass of support to allow UTC to consider tinkering with such a >> long-standing set of property assignments. >> >> This would be true, especially, if it can be demonstrated that (other than >> matching legacy behavior) there's no context that would benefit from the >> existing classification. I note that this was something several posters >> implied. >> >> So, if implementers of the legacy behavior are amenable to achieve this by >> tailoring, and if the change augments the number of situations where >> untailored UAX#14-style line breaking can be used, that would be a win that >> might offset the cost of a disruptive change. >> >> We've heard arguments why the proposed change is technically superior for >> Japanese. We now need to find out whether there are contexts where a change >> would adversely affect users/implementers. Following that, we would look for >> endorsements of the proposal from implementers or other standards >> organizations such as W3C (and, if at all possible, agreement from those >> implementers who use the untailored algorithm now). With these three >> preconditions in place, I would support an effort of the UTC to revisit this >> question. >> >> A./ >> >> >> On 5/1/2015 9:48 AM, Ken Whistler wrote: >>> Suzuki-san, >>> >>> On 5/1/2015 8:25 AM, suzuki toshiya wrote: >>>> >>>> Excuse me, there is any discussion record how UAX#14 class for >>>> halfwidth-katakana in 15 years ago? If there is such, I want to >>>> see a sample text (of halfwidth-katakana) and expected layout >>>> result for it. >>> >>> The *founding* document for the UTC discussion of the initial >>> Line_Break property values 15 years ago was: >>> >>> http://www.unicode.org/L2/L1999/99179.pdf >>> <http://www.unicode.org/L2/L1999/99179.pdf> >>> >>> and the corresponding table draft (before approval and conversion >>> into the final format that was published with UTR #14 -- later >>> UAX #14) was: >>> >>> http://www.unicode.org/L2/L1999/99180.pdf >>> <http://www.unicode.org/L2/L1999/99180.pdf> >>> >>> There is nothing different or surprising in terms of values there. The >>> halfwidth >>> katakana were lb=AL and the fullwidth katakana were lb=ID in >>> that earliest draft, as of 1999. >>> >>> What is new information, perhaps, is the explicit correlation that can be >>> found >>> in those documents with the East_Asian_Width properties, and the explanation >>> in L2/99-179 that the EAW property values were explicitly used to >>> make distinctions for the initial LB values. >>> >>> There is no sample text or expected layout results from that time period, >>> because that was not the basis for the original UTC decisions on any of >>> this. >>> Initial LB values were generated based on existing General_Category >>> and EAW values, using general principles. They were not generated by >>> examining and specifying in detail the line breaking behavior for >>> every single script in the standard, and then working back from those >>> detailed specifications to attempt to create a universal specification >>> that would replicate all of that detailed behavior. Such an approach >>> would have been nearly impossible, given the state of all the data, >>> and might have taken a decade to complete. >>> >>> That said, Japanese line breaking was no doubt considered as part of >>> the overall background, because the initial design for UTR #14 was informed >>> by experience in implementation of line breaking algorithms at Microsoft >>> in the 90's. >>> >>>> >>>> You commented that the UAX#14 class should not be changed but >>>> the tailoring of the line breaking behaviour would solve >>>> the problem (as Firefox and IE11 did). However, some developers >>>> may wonder "there might be a reason why UTC put halfwidth-katakana >>>> to AL - without understanding it, we could not determine whether >>>> the proposed tailoring should be enabled always, or enabled >>>> only for a specific environment (e.g. locale, surrounding text)". >>> >>> See above, in L2/99-179. *That* was the justification. It had nothing >>> to do with specific environment, locale, or surrounding text. >>> >>>> >>>> If UTC can supply the "expected layout result for halfwidth- >>>> katakana (used to define the class in current UAX#14)", it >>>> would be helpful for the developers to evaluate the proposed >>>> tailoring algorithm. >>> >>> UAX #14 was never intended to be a detailed, script-by-script >>> specification of line layout results. It is a default, generic, universal >>> algorithm for line breaking that does a decent, generic job of >>> line breaking in generic contexts without tailoring or specific >>> knowledge of language, locale, or typographical conventions in use. >>> >>> UAX #14 is not a replacement for full specification of kinsoku >>> rules for Japanese, in particular. Nor is it intended as any kind >>> of replacement for JIS X 4051. >>> >>> Please understand this: UAX #14 does *NOT* tell anyone how >>> Japanese text *should* line break. Instead, it is Japanese typographers, >>> users and standardizers who tell implementers of line break >>> algorithms for Japanese what the expectations for Japanese text should >>> be, in what contexts. It is then the job of the UTC and of the >>> platform and application vendors to negotiate the details of >>> which part of that expected behavior makes sense to try to >>> cover by tweaking the default line-breaking algorithm and the >>> Line_Break property values for Unicode characters, and which >>> part of that expected behavior makes sense to try to cover >>> by adjusting commonly accessible and agreed upon tailoring >>> behavior (or public standards like CSS), and finally which part of that >>> expected behavior should instead be addressed by value-added, proprietary >>> implementations of high end publishing software. >>> >>> Regards, >>> >>> --Ken >>>> >>>> >>> >> >> >

