I wrote: > If you know that a URI is occupying an IDN-aware slot, then you're > free to put non-ASCII domain names (escaped) into the URI. But > otherwise IDNA section 3.1 requirement 2 applies, and the domain name > must contain only ASCII characters.
I was assuming that the URI occupies a slot of some sort. If it is not in a slot at all (for example, it appears in a plain text message body), then requirement 2 does not apply, and you are free to put non-ASCII domain names in it. I have a question about the rationale behind allowing non-ASCII domain names in URIs. A URI containing a non-ASCII domain name seems to be useful only for software that is IDN-aware but IRI-unaware. If the software is IDN-unaware, it will choke on the URI. If the software is both IDN-aware and IRI-aware, then you might as well use an IRI rather than a URI, to avoid the ugly %-escaped UTF-8. My question is, do we really expect to see much, if any, software that is IDN-aware and IRI-unaware? IDNs and IRIs are being introduced at roughly the same time, and both deal with internationalization. Isn't it reasonable to assume that almost all web-related software will either support both, or support neither? Why complicate things by introducing a new IDN-supporting URI type if it's not useful? URIs with ACE are ugly, but work with old software. IRIs with non-ASCII domain names are pretty, but require new software. URIs with %-escaped non-ASCII domain names are both ugly and require new software, so what's the point? Here's another model to consider: All generic URIs (URIs beginning with scheme://) continue to be IDN-unaware, and therefore the host field must contain only ASCII characters. Generic IRI syntax (any IRI beginning with scheme://) is IDN-aware, and therefore non-ASCII names are allowed in the host field (escaped). I don't think there will be any problem converting generic IRIs to generic URIs, even if domain names happen to appear in other places (the path or query-string). In the generic syntax, the host field is the only place a domain name can appear that has client-side semantics. If domain names happen to appear anywhere else in the URI, they must be interpreted by the server, right? If non-ASCII domain names appear in a generic IRI outside the host field, that must mean the server is IDN-aware, and therefore those domain names don't need to be converted to ACE if the IRI is converted to a URI. The server will still be IDN-aware regardless of whether it is accessed via a URI or an IRI. The host field is the only domain name that needs to be ToASCII'd when converting a generic IRI to a URI. For converting non-generic IRIs to URIs, if you know the scheme, you can extract the domain names and apply ToASCII to them. If you don't know the scheme, then you have no way of knowing if the URI you produce might contain non-ASCII domain names that old software will choke on. Whether we call the URI valid or invalid doesn't change the fact of whether it breaks old software. Calling the URI valid is like blaming the old software for conforming to yesterday's standard. Calling the URI invalid is like blaming the convertor for performing a conversion without enough knowledge to do it safely. The latter makes more sense to me. How many non-generic URI schemes containing domain names are there? Suppose we require that after the introduction of IRIs all new non-generic schemes containing domain names must either be IDN-aware for both URIs and IRIs, or IDN-unaware for both URIs and IRIs. Then IRI-to-URI convertors won't need to do anything special for those schemes. The number of schemes that convertors need to know about will never increase. AMC
