> The bad news is that, if this character [form] is observed in a > written string, or a non-Unicode coding system that does not > distinguish between Eszett and "small letter [Greek] beta", one > really needs to have language context in order to determine > which Unicode character to map it to, especially because, given > stringprep, the consequences of getting it wrong are likely to > be very significant.
A few items here. 1. While there are many Unicode characters with very similar shapes, beta and eszed -- in normal fonts -- are really no more alike than y and gamma (or, for that matter, capital I, lowercase L, and 1!!). Beta typically has a descender, while eszed does not. See http://www.macchiato.com/unicode/beta-eszed.htm for a GIF image from three fonts. 2. There are no "non-Unicode coding systems" that unify beta and eszed; the language issue is irrelevant. 3. As I pointed out some time ago on this list, it is *not* rocket science to provide a user interface that makes it very clear to people if there are mixed scripts in a domain name; and also simple to extend it to other confusables. For a demo, see http://www.macchiato.com/utc/show_script.html. I have recommended and continue to recommend that the IDNA documents contain some wording on this, something like: "To help prevent confusion between characters that are visually similar, it is recommended that implementations provide visual indications where a domain name contains multiple scripts. Such mechanisms can also be used to show when a name contains a mixture of simplified and traditional characters, or to distinguish zero and one from O and l." Mark __________ http://www.macchiato.com ◄ “Eppur si muove” ► ----- Original Message ----- From: "John C Klensin" <[EMAIL PROTECTED]> To: "Dan Oscarsson" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, June 18, 2002 07:12 Subject: Re: [idn] Re: IDNA: is the specification proper, adequate, and complete? > --On Tuesday, 18 June, 2002 11:41 +0200 Dan Oscarsson > <[EMAIL PROTECTED]> wrote: > > > While there is not doubt about the above, I am not sure that > > the nameprep specification that 00DF (small letter sharp s) > > should be matted to "ss". I am not sure how Germans handle > > this character. Do they always replace double s with it? Or > > only on some special words? If they do not generally do this, > > the mapping should not be done. It is somewhat like the fact > > that the Greek version of latin A is not mapped to the Roman > > version of latin A. Even though their origin is the same latin > > A and look alike. > > This was discussed at length early in the history of the WG. By > convention, it is always possible to replace Eszett with "ss", > and the upper-case form of Eszett is always "SS", but there are > many words in which "ss" appears for which a substitution back > to Eszett is not appropriate. In other words, this is a > one-way, non-reversible (without word-context), mapping. The > good news is that, stringprep gets it right (or at least > consistent with other WG decisions) given coding as 0x00DF, > which clearly identifies the character as Eszett. > > The bad news is that, if this character [form] is observed in a > written string, or a non-Unicode coding system that does not > distinguish between Eszett and "small letter [Greek] beta", one > really needs to have language context in order to determine > which Unicode character to map it to, especially because, given > stringprep, the consequences of getting it wrong are likely to > be very significant. > > But this is not new news, or even a new example. We may see it > differently as our sensitivities to the issues evolve, but the > bottom line is that Unicode is not especially well adapted to > coding of strings that appear without language, or even word, > contexts in non-Unicode form. Whether that form is a > pre-existing coding system, or a sign on the side of a bus, > there are likely to be examples of problematic characters. > Unfortunately, there are no standardized alternatives for a UCS > with even near-global applicability. And it appears obvious to > me that, while a hypothetical Unicode alternative could make > different choices, it wouldn't eliminate these problems, but > rather just create a different set of scary examples. > > I think there are only two ways out of this, and neither > involves either changes in stringprep or more examples of this > type. For the latter, I think almost everyone with a strong > desire to understand the problem has done so and that the odds > of convincing others are fairly low. > > The alternatives are: > > (1) To define the problem this working group is trying to solve > in a way that causes these problems to be non-issues. > Personally, every time I try to do that, I end up with what feel > to me to be silly states, but it is clear that I'm in the > minority of those speaking up in the WG. For example, one could > say, and I think we essentially have, that the WG is solving the > problem of getting things into and out of the DNS given that the > Unicode coding form is accurately known. This implies that any > applications which can't succeed in making the translations are > going to be in very bad trouble and that we offer them no help > or hope -- it isn't our problem. Personally, I think that, if > the WG's position and recommendations are based on that model, > we should be obligated to write it down and make it explicit in > our documents before they go onto the standards track: we owe > that much to those who think we are solving any of a number of > more general internationalization problems. > > (2) To just give it up. The DNS effectively imposes "no > language information" and "no script information" restrictions > on us. Its octet-based comparison rules effectively prevent us > from imposing conventions that would permit guessing anything > from context, nor can we prohibit a string that contains 0x00DF > in the middle of a string of Greek, or for that matter Arabic, > characters. In the Arabic case, it would at least stand out as > strange; in the Greek one, humans would almost certainly mistake > it, in displayed form, for lower-case beta. Given the > restrictions, these presentation-relationship problems have no > in-DNS solution. > > My own personal position on this is presumably well-known at > this point: we have tried very hard to solve a "DNS > internationalizaiton" problem and have ended up with a number of > extremely convincing demonstrations (this example not least > among them) that it is overconstrained and can't be solved. If > one accepts that, then there is a strong case for saying "sorry, > whatever it is people wish for, and can make work in restricted > contexts, DNS labels for common, existing, RRs and applications > are limited to ASCII-based protocol elements and we had best go > solve the real problem and requirement in some less-constrained > environment". > > > While others clearly disagree (and are probably the majority), I > also don't believe it is appropriate for IETF to adopt a > protocol without clear evidence that it can be implemented and > used by at least a few of the applications that are anticipated > for it -- applications that need to deal with presentation > forms, operating systems, and other aspects of the real world. > But, again, if we are going to do it, or if we see some > restricted applications for which this _does_ provide a useful > part of a solution, I believe that we are obligated to explain > exactly what problem we are solving, so as to not inadvertently > surprise implementors and users with scenarios and problems that > we understand. If we didn't know about these problems, it might > be different, but we certainly do. > > But I do think those are the issues that the WG (and really, at > this stage, the IESG) should be discussing. And, if there are > new issues, I, at least, would like to understand them. But > more examples or transcription or transcoding difficulties > won't, I think, help: we already have enough examples to kill a > dozen protocols if the WGs considered them relevant. More > examples, or repeating of older ones, would not persuade anyone > who is convinced that the ones documented so far are not > relevant that they are suddenly relevant. > > Just my opinion. > > john > > > john > > >
