Hi all, I'm not in sync with publishing schedule, sorry about that, but is it possible to consider this change for Unicode 9.0 time frame?
I believe all concerns were cleared in the discussion, but if any were left, I'd be happy to discuss further. And I hope I'm not too late this time? /koji On Tue, May 5, 2015 at 6:19 AM, Peter Edberg <[email protected]> wrote: > I have been checking with various groups at Apple. The consensus here is > that we would like to see the linebreak value for halfwidth katakana > changed to ID. > > - Peter E > > > > On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <[email protected]> > wrote: > > On 5/3/2015 9:47 AM, Koji Ishii wrote: > > Thank you so much Ken and Asmus for the detailed guides and histories. > This helps me a lot. > > In terms of time frame, I don't insist on specific time frame, Unicode 9 > is fine if that works well for all. > > I'm not sure how much history and postmortem has to be baked into the > section of UAX#14, hope not much because I'm not familiar with how it was > defined so other than what Ken and Asmus kindly provided in this thread. > But from those information, I feel stronger than before that this was > simply an unfortunate oversight. In the document Ken quoted, F and W are > distinguished, but H and N are not. In '90, East Asian versions of Office > and RichEdit were in my radar and all of them handled halfwidth Katakana as > ID for the line breaking purposes. That's quite understandable given the > amount of code points to work on, given the priority of halfwidth Katakana, > and given the difference of "what line breaking should be" and UAX#14 as > Ken noted, but writing it up as a document doesn't look an easy task > > > Koji, > > kana are special in that they are not shared among languages. From that > perspective, there's nothing wrong with having a "general purpose" > algorithm support the rules of the target language (unless that would add > undue complexity, which isn't a consideration here). > > Based on the data presented informally here in postings, I find your > conclusion (oversight) quite believable. The task would therefore be to > present the same data in a more organized fashion as part of a formal > proposal. Should be doable. > > I think you'd want to focus on survey of modern practice in > implementations (and if you have data on some of them going back to the > '90s the better). > > From the historical analysis it's clear that there was a desire to create > assignments that didn't introduce random inconsistencies between LB and EAW > properties, but that kind of self-consistency check just makes sure that > all characters of some group defined by the intersection of property > subsets are treated the same (unless there's an overriding reason to > differentiate within). It seems entirely plausible that this process > misfired for the characters in question, more likely so, given that the > earliest drafts of the tables were based on an implementation also being > created by MS around the same time. That makes any difference to other MS > products even more likely to be an oversight. > > I do want to help UTC establish a precedent of getting changes like that > endorsed by a representative sample of implementers and key external > standards (where applicable, in this case that would be CSS), to avoid the > chance of creating undue disruption (and to increase the chance that the > resulting modified algorithm is actually usable off-the-shelf, for example > for "default" or "unknown language" type scenarios. > > Hence my insistence that you go out and drum up support. But it looks like > this should be relatively easy, as there seems to be no strong case for > maintaining the status quo, other than that it is the status quo. > > A./ > > > > I agree that implementers and CSS WG should be involved, but given IE and > FF have already tailored, and all MS products as well, I guess it should > not be too hard. I'm in Chrome team now, and the only problem for me to fix > it in Chrome is to justify why Chrome wants to tailor rather than fixing > UAX#14 (and the bug priority...) > > Either Makoto or I can bring it up to CSS WG to get back to you. > > /koji > > > On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <[email protected] > > wrote: > >> Thank you, Ken, for your dedicated archeological efforts. >> >> I would like to emphasize that, at the time, UAX#14 reflected observed >> behavior, in particular (but not exclusively) for MS products some of which >> (at the time) used an LB algorithm that effectively matched an untailored >> UAX#14. >> >> However, recently, the W3C has spent considerable effort to look into >> different layout-related algorithms and specification. If, in that context, >> a consensus approach is developed that would point to a better "default" >> behavior for untailored UAX#14-style line breaking, I would regard that as >> a critical mass of support to allow UTC to consider tinkering with such a >> long-standing set of property assignments. >> >> This would be true, especially, if it can be demonstrated that (other >> than matching legacy behavior) there's no context that would benefit from >> the existing classification. I note that this was something several posters >> implied. >> >> So, if implementers of the legacy behavior are amenable to achieve this >> by tailoring, and if the change augments the number of situations where >> untailored UAX#14-style line breaking can be used, that would be a win that >> might offset the cost of a disruptive change. >> >> We've heard arguments why the proposed change is technically superior for >> Japanese. We now need to find out whether there are contexts where a change >> would adversely affect users/implementers. Following that, we would look >> for endorsements of the proposal from implementers or other standards >> organizations such as W3C (and, if at all possible, agreement from those >> implementers who use the untailored algorithm now). With these three >> preconditions in place, I would support an effort of the UTC to revisit >> this question. >> >> A./ >> >> >> On 5/1/2015 9:48 AM, Ken Whistler wrote: >> >> Suzuki-san, >> >> On 5/1/2015 8:25 AM, suzuki toshiya wrote: >> >> >> Excuse me, there is any discussion record how UAX#14 class for >> halfwidth-katakana in 15 years ago? If there is such, I want to >> see a sample text (of halfwidth-katakana) and expected layout >> result for it. >> >> >> The *founding* document for the UTC discussion of the initial >> Line_Break property values 15 years ago was: >> >> http://www.unicode.org/L2/L1999/99179.pdf >> >> and the corresponding table draft (before approval and conversion >> into the final format that was published with UTR #14 -- later >> *UAX* #14) was: >> >> http://www.unicode.org/L2/L1999/99180.pdf >> >> There is nothing different or surprising in terms of values there. The >> halfwidth >> katakana were lb=AL and the fullwidth katakana were lb=ID in >> that earliest draft, as of 1999. >> >> What is new information, perhaps, is the explicit correlation that can be >> found >> in those documents with the East_Asian_Width properties, and the >> explanation >> in L2/99-179 that the EAW property values were explicitly used to >> make distinctions for the initial LB values. >> >> There is no sample text or expected layout results from that time period, >> because that was not the basis for the original UTC decisions on any of >> this. >> Initial LB values were generated based on existing General_Category >> and EAW values, using general principles. They were not generated by >> examining and specifying in detail the line breaking behavior for >> every single script in the standard, and then working back from those >> detailed specifications to attempt to create a universal specification >> that would replicate all of that detailed behavior. Such an approach >> would have been nearly impossible, given the state of all the data, >> and might have taken a decade to complete. >> >> That said, Japanese line breaking was no doubt considered as part of >> the overall background, because the initial design for UTR #14 was >> informed >> by experience in implementation of line breaking algorithms at Microsoft >> in the 90's. >> >> >> You commented that the UAX#14 class should not be changed but >> the tailoring of the line breaking behaviour would solve >> the problem (as Firefox and IE11 did). However, some developers >> may wonder "there might be a reason why UTC put halfwidth-katakana >> to AL - without understanding it, we could not determine whether >> the proposed tailoring should be enabled always, or enabled >> only for a specific environment (e.g. locale, surrounding text)". >> >> >> See above, in L2/99-179. *That* was the justification. It had nothing >> to do with specific environment, locale, or surrounding text. >> >> >> If UTC can supply the "expected layout result for halfwidth- >> katakana (used to define the class in current UAX#14)", it >> would be helpful for the developers to evaluate the proposed >> tailoring algorithm. >> >> >> UAX #14 was never intended to be a detailed, script-by-script >> specification of line layout results. It is a default, generic, universal >> algorithm for line breaking that does a decent, generic job of >> line breaking in generic contexts without tailoring or specific >> knowledge of language, locale, or typographical conventions in use. >> >> UAX #14 is not a replacement for full specification of kinsoku >> rules for Japanese, in particular. Nor is it intended as any kind >> of replacement for JIS X 4051. >> >> Please understand this: UAX #14 does *NOT* tell anyone how >> Japanese text *should* line break. Instead, it is Japanese typographers, >> users and standardizers who tell implementers of line break >> algorithms for Japanese what the expectations for Japanese text should >> be, in what contexts. It is then the job of the UTC and of the >> platform and application vendors to negotiate the details of >> which part of that expected behavior makes sense to try to >> cover by tweaking the default line-breaking algorithm and the >> Line_Break property values for Unicode characters, and which >> part of that expected behavior makes sense to try to cover >> by adjusting commonly accessible and agreed upon tailoring >> behavior (or public standards like CSS), and finally which part of that >> expected behavior should instead be addressed by value-added, proprietary >> implementations of high end publishing software. >> >> Regards, >> >> --Ken >> >> >> >> >> >> > > >

