At 08:20 01/10/23 -0700, Mark Davis wrote: (answering David Hopwood) >I disagree with your assessment. For example, I believe that it is a good >thing that the non-breaking feature is suppressed -- that is irrelevant to >IDNs.
This is a detail that can go one way or another. The fact that the current DNS has survived a few years with a hyphen but without the non-breaking hyphen being accepted suggests that not including it won't hurt too much. But including it won't hurt too much, either. >However, it would make your paper much easier to examine if you removed all >the characters that end up getting disallowed -- they are not >counter-examples. I think they are very important. In terms of numbers, there are I think 3165 compatibility mappings in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. According to David's analysis, which closely matches with my understanding, something of the order of 50 of these really need to be mapped. This shows that using NFKC is serious overkill. Also, there 1866 NFC mappings. Reducing the number of mappings from 5031 to 1866 is a saving of over 60%. >Of the characters that are left that you feel are >problematic, there are three possibilities that the committee has to judge: > >A. they don't matter if they are converted Yes, most of them will show up so rarely that it indeed doesn't matter whether they get converted or prohibited. >B. they do matter, but they can be remapped, or prohibited* >C. they do matter, and enough to make us abandon NFKC > >The downside of abandoning NFKC is that we lose the equivalence between >thousands of characters that do represent the same fundamental abstract >characters, such as width variants, Arabic ligatures, etc. Some of the width variants have been identified as exceptions. How many of the Arabic ligatures will be entered by users when they e.g. type in an arabic domain name on a keyboard? Also, please note that the concept of 'fundamental abstract character' is not something the IDN WG has been really concerned about. As David has shown, many of the compatibility (i.e. NFKC) equivalences are disputable. On the other hand, for many characters not covered by NFKC you would easily find some people claiming that some of them represent the same 'fundamental abstract character'. As an example, many people on this list might argue that there should be a SC/TC mapping because the simplified and the traditional variants represent one and the same 'fundamental abstract character'. What this WG has been concerned about is to avoid excluding variants that could easily be input by the user in place of some equivalent character (to distinguish this from a lookalike that represents something completely different). David's analysis, which coincides with my findings, has show that about 50 out of about 3000 compatibility (i.e. NFKC) equivalents are relevant for user input. Nobody has claimed anything to the contrary, or brought up any evidence to the contrary. If anybody has some such evidence, please send it in. >Note: a character can be deleted by the remapping phase. Which is not a good idea except for characters where deletion doesn't make much of a difference (e.g. Arabic Tatweel, non-spacing marks,...). If somebody would type in fooBARbaz, and would get to foobaz, because B, A, and R are mapped out, it would realy be quite bad. >It can also be >effectively prohibited *before* NFKC by simply mapping it to an invalid >character, like space, that is not affected by normalization and ends up >being prohibited. Yes. If there were just *a few* characters that we wanted to prohibit before doing NFKC, while we want to keep most of the NFKC mappings, then that would be a reasonable idea. But as it turns out, we want to ignore/prohibit most of the characters mapped by NFKC, while only mapping very few of them. >BTW I have very little hope that this committee will ever produce a result >if issues keep getting re-raised time and time and time again. How many times has this issue (NFC vs NFCK) been raised, on this list? How much background material was provided in these discussions? >For the >committee to ever reach some kind of resolution, people have to ask >themselves, *not* whether they think the current situation is absolutely >optimal (in their view) -- since it *never* will be -- but instead, whether >they can live with the result; whether there are really any results that >will cause significant problems in practice. It is the first time anybody has done such a careful analysis of NFC vs. NFKC. The choice between NFC and NFKC is rather fundamental, no the least because IDN also most probably will serve as an example to other similar problems. The conclusions from this analysis are quite clear, at least to me. I agree that asking whether the solution is absolutely optimal in somebody's personal view is not a question that leads to consensus. But the question of whether some change is an overall improvement that helps many while not causing problems for anybody is a very relevant question. Regards, Martin.
