--On Tuesday, 18 June, 2002 08:05 -0700 Mark Davis <[EMAIL PROTECTED]> wrote:
>> The bad news is that, if this character [form] is observed in >> a written string, or a non-Unicode coding system that does not >> distinguish between Eszett and "small letter [Greek] beta", >> one really needs to have language context in order to >> determine which Unicode character to map it to, especially >> because, given stringprep, the consequences of getting it >> wrong are likely to be very significant. > > A few items here. > > 1. While there are many Unicode characters with very similar > shapes, beta and eszed -- in normal fonts -- are really no > more alike than y and gamma (or, for that matter, capital I, > lowercase L, and 1!!). Beta typically has a descender, while > eszed does not. See > http://www.macchiato.com/unicode/beta-eszed.htm for a GIF > image from three fonts. As you, and I, and others, have noted in the past, there are far better examples than beta and eszett (or eszed if you prefer). But there are certainly fonts (which I assume you would classify as "not normal") -- designed in isolation for the two languages -- whose interpretations of the two characters, if placed in text of the other language, would not catch the eye of a casual reader as being out of place. And I would assume that, in the long history of manual typesetting, there have been instances of Eszett being substituted for beta (probably in German texts containing mathematics because a lazy typesetter was disinclined to walk across to a different font). None of this, of course, challenges your fundamental argument, with which I agree and hope I was careful enough to write my note to not challenge. > 2. There are no "non-Unicode coding systems" that unify beta > and eszed; the language issue is irrelevant. Sure there are. We call some of them "books". Transcription of a language into printed form involves a coding system. And I have to assume, although I can claim no personal knowledge, that German schoolchildren, brought up looking at Eszett, have to be taught, when they encounter mathematical notation that uses Greek characters (if not sooner), that it is important to notice either the context or the descender -- that the two characters are not the same. These distinctions, including getting used to the variations and similarities in different fonts of I-l-1, are bits of pattern recogition that lay people --as distinct from font or character set experts-- rapidly learn, within their own language and script contexts, to distinguish from context or by relatively subtle clues. I can't even spell out Arabic or Thai scripts because I don't have enough experience with the right set of clues -- my loss, but these are learned skills. But this isn't the point, so whether there are, or are not, coded character sets that unify the two is not the point either (I'll defer to your knowledge and experience on this subject, since I haven't studied the question, but statements that sound like universal negatives always scare me). Your third comment _is_ as key part of the point. > 3. As I pointed out some time ago on this list, it is *not* > rocket science to provide a user interface that makes it very > clear to people if there are mixed scripts in a domain name; > and also simple to extend it to other confusables. For a demo, > see > http://www.macchiato.com/utc/show_script.html. > > I have recommended and continue to recommend that the IDNA > documents contain some wording on this, something like: > > "To help prevent confusion between characters that are visually > similar, it is recommended that implementations provide visual > indications where a domain name contains multiple scripts. Such > mechanisms can also be used to show when a name contains a > mixture of simplified and traditional characters, or to > distinguish zero and one from O and l." Mark, I ultimately have only three problems with IDNA and the IDN proposals taken as a group. The first is a technical one and applies to IDNA specifically. The others are problems that would probably extend to any in-DNS substitute): (i) It makes assertions about applicability that I believe are over-broad and risky, to no good benefit. I think we are nearing consensus on that one and, in any event, that reviewing it again wouldn't accomplish much. (ii) It is addressed to, and solves, a very narrow problem. We (for some definition of "we") have not been explicit, in an Internet context, as to what that problem is. I believe that we should be explicit. Then, having carefully described that problem, we then need to carefully evaluate the question of whether the benefits of solving it outweigh the risks to the use of the DNS in the Internet community that it might pose. If we conclude that we can't reasonably do that evaluation (e.g., because it isn't an IETF problem), then I think we are still obligated to delineate the issues and risks to the best of our ability -- at least to the extent of writing down the implications of problems and issues we already know about. (iii) A number of items of knowledge and recommendations have surfaced in the working group -- of which your suggestion above is an excellent example -- that could be used to reduce or eliminate some of those risks to the DNS as a piece of usable Internet infrastructure. I think they need to be written down as part of WG output, if only because "this risk can be ameliorated if one does so-and-so" is a much more satisfactory statement than "there is this horrible problem and we should consider stopping progress until someone has a solution". Of course, if "mixed scripts in domain names" are considered good things, warning when they occur won't help much. But _that_ one, I would contend, is not an IETF problem although I think it would be wise and responsible for us to point out that mixed script labels pose challenges that homogeneous ones do not. regards, john
