At 08:59 01/10/23 -0400, John C Klensin wrote: >While reading David's NFC versus NFKC note, I had an odd thought. >I've been dissatisfied, as have many others, with the notion that >TC <-> SC mapping is analogous to case mapping in Roman-derived >alphabets. Arguments about whether that analogy applies have >helped to make the discussion of what is, to me, a very difficult >topic even more obscure. > >To quote the Unicode standard, "Serbo-Croatian is a single >language with paired alphabets". This is a definition with which >native speakers of the language agree (although, when tensions in >the Balkans are high, I assume some of them are not completely >happy about it). Would it be constructive to think about Chinese >as "one language, two alphabets"?
It is, but it has similar problems as the 'case' analogy: It can explain some parts of the problem, and can point to some kind of solutions, but it won't explain or help with the full problem/solution. >If it is, then nameprep or a >related process ought to be able to map back and forth between >the Roman-based characters usually used in Croatian contexts and >the Cyrillic characters usually used in Serbian ones (people do >this all the time, and certainly expect the two to match). On a good search engine, when searching for items in Serbo-Croatian, expecting a match is very reasonable. On the other hand, it's very easy for Serbo-Croatians to understand when and where a match isn't happening, and what to do to find things nevertheless (using the other alphabet). Also, there won't be mixups inside single words, although I could expect some domain names such as SERBO-croat or foo-FOO (uppercase standing for Cyrillic) to turn up. >Of course, the analogy is not exact (these things never are): >perhaps partially because there are just fewer characters to deal >with, there are no cases in which there are potential ambiguities >in the mappings. On the other hand, one problem is more severe >than in the Chinese case: in the general case, a Serbo-Croatian >string written in Cyrillic cannot be distinguished, on a >character string basis, from uses of Cyrillic for other languages >(e.g., Russian), which should not be mapped and, similarly, a >string written in Roman-based characters cannot be distinguished, >on a character string basis, from the Roman-based characters of >another language (English?) which, again, cannot be mapped. > >In either case, the mapping becomes readily plausible if the >language, in addition to the content of the character string, is >known, but is hard to think about without causing side-effects in >other languages if not. I agree. For Han ideographs, there are other languages such as Japanese and Korean. Also, adding some kind of language identifier is not an option because billboards, napkins, notebooks,... don't carry language (except implicitly, maybe). Regards, Martin.
