Re: [whatwg] URL spec and IDN

2014-03-17 Thread Anne van Kesteren
On Wed, Feb 19, 2014 at 11:53 PM, Joshua Cranmer pidgeo...@verizon.net wrote:
 I've noted that the URL specification is currently rather vague when it
 comes to IDN, and has some uncertain comments about issues related to
 IDNA2003, IDNA2008, and UTS #46.

Yeah, it is a clusterfuck. I'm working with the guys behind UTS #46 on
cleaning it up, but due to vacation et al it's taking some time.


 Roughly speaking, in my experience, there are three kinds of labels:
 A-labels, U-labels, and displayable U-labels. A-labels are the
 Punycode-encoded version of the labels used for DNS (normalized to ASCII
 lower-case, naturally). U-labels are the results of converting an A-label to
 Unicode. Displayable U-labels are A-labels converted to U-labels only if
 they do not contain a Unicode homograph attack. My drawing a distinction
 between the displayable U-label and the regular kind is out of concern
 that the definition of displayable may change over time (e.g., certain
 script combinations are newly permitted/prohibited), whereas the U-label
 derived from an A-label should be constant.

Agreed. At some point we should make this clearer in the specification.


 Given these three kinds of labels, it ought to be possible (IMO) to convert
 a generic domain in any (i.e., unnormalized) format. The specification
 currently provides for a domainToASCII and a domainToUnicode function which
 naturally map to producing A-labels and U-labels, but contains a note
 suggesting that they shouldn't be implemented due to the IDNA clusterfuck.
 The way to a displayable U-label would seem to me to come most naturally
 via |new URL(http://; + domain).host|.

No, we should have a dedicated domainToUI() or some such. A parsed URL
contains A-labels. We might want to have something similar for URLs
themselves. To convert percent-encoding and such.


 Looking at the spec, it's not clear if the host, href, and other methods are
 supposed to return U-labels or A-labels (or some potential mix of the two).

It's A-labels, see http://url.spec.whatwg.org/#concept-host-parser for details.


 I'm guessing the reason why the domainTo* methods are unspecified are due to
 inconsistent handling of IDNA2008 by current web browsers, ...

Right.


 * Chrome's documentation calls out ignoring STD3 rules (i.e., permitting
 more ASCII characters) and disallowing unassigned code points. IE's
 documentation does not suggest what they do here.

You want to allow e.g. _ as that is used by subdomains. However, if
you ignore STD3, you need additional checks later on to prevent
reparsing issues. The URL Standard calls out the specific code points
that are problematic here.


 1. Expressly identify how to normalize and process an IDN address under
 IDNA2008 + UTR #46 + other modifications that reflects reality. I'm not
 qualified to know what happens at precise edge cases here.

Yeah this is the plan, once UTR #46 has some changes I proposed. See
http://www.unicode.org/review/pri264/ if you're interested.


 2. Resolve that URL should reflect U-labels as much as possible while
 placing the burden of avoiding Unicode homograph attacks on the browser
 implementors rather than JS consumers of the API.

Currently it's A-labels. Mostly because all other parts of the URL are
ASCII too.


-- 
http://annevankesteren.nl/


[whatwg] URL spec and IDN

2014-02-19 Thread Joshua Cranmer
I've noted that the URL specification is currently rather vague when it 
comes to IDN, and has some uncertain comments about issues related to 
IDNA2003, IDNA2008, and UTS #46.


Roughly speaking, in my experience, there are three kinds of labels: 
A-labels, U-labels, and displayable U-labels. A-labels are the 
Punycode-encoded version of the labels used for DNS (normalized to ASCII 
lower-case, naturally). U-labels are the results of converting an 
A-label to Unicode. Displayable U-labels are A-labels converted to 
U-labels only if they do not contain a Unicode homograph attack. My 
drawing a distinction between the displayable U-label and the regular 
kind is out of concern that the definition of displayable may change 
over time (e.g., certain script combinations are newly 
permitted/prohibited), whereas the U-label derived from an A-label 
should be constant.


Given these three kinds of labels, it ought to be possible (IMO) to 
convert a generic domain in any (i.e., unnormalized) format. The 
specification currently provides for a domainToASCII and a 
domainToUnicode function which naturally map to producing A-labels and 
U-labels, but contains a note suggesting that they shouldn't be 
implemented due to the IDNA clusterfuck. The way to a displayable 
U-label would seem to me to come most naturally via |new URL(http://; + 
domain).host|.


Looking at the spec, it's not clear if the host, href, and other methods 
are supposed to return U-labels or A-labels (or some potential mix of 
the two). Running some tests appears to indicate that Firefox, Chrome, 
and IE actually all do different things:
* Firefox appears to use displayable U-labels (i.e., the result 
matches what is seen in the address bar)

* Chrome appears to always use A-labels
* IE appears to always use U-labels (on http://☃.net, the address bar 
displays http://xn--n3h.net but location.href returns http://☃.net).


I'm guessing the reason why the domainTo* methods are unspecified are 
due to inconsistent handling of IDNA2008 by current web browsers, 
although it appears to me that a consensus has emerged on 
implementation. Chrome and IE appear to roughly implement UTR #46 
transitional mode:

* Uses Unicode $RELATIVELY_NEW for case folding/mapping
* Processes eszett, final sigma, ZWJ, and ZWNJ according to IDNA2003 
(UTR #46's transitional) instead of IDNA2008

* Symbols and punctuation are allowed, in violation of IDNA2008.
* The BiDi check rules appear to follow IDNA2008 instead of IDNA2003. 
Neither Chrome's documentation nor IE's documentation are explicit about 
the changes (sufficiently so, at least, for someone like me who doesn't 
know all that much about how IDNA* is implemented but rather wants to 
make sure things work correctly and safely).
* They do not appear to enforce IDNA2008's contextual rules (again, the 
documentation is slightly unclear).
* Chrome's documentation calls out ignoring STD3 rules (i.e., permitting 
more ASCII characters) and disallowing unassigned code points. IE's 
documentation does not suggest what they do here.
Firefox still implements IDNA2003, but this is only because the person 
who would be implementing IDNA2008 lacks the time.


Given these facts, I'd like to propose changes to the URL spec to better 
specify and more reliably reflect IDN processing as is currently done:
1. Expressly identify how to normalize and process an IDN address under 
IDNA2008 + UTR #46 + other modifications that reflects reality. I'm not 
qualified to know what happens at precise edge cases here.
2. Resolve that URL should reflect U-labels as much as possible while 
placing the burden of avoiding Unicode homograph attacks on the browser 
implementors rather than JS consumers of the API.
3. The comments about avoiding implementation on domainTo* methods 
should be dropped.
4. Tests should be added to ensure that domain labels are processed 
correctly. There is already a testsuite for UTR #46 processing available 
at http://www.unicode.org/Public/idna, which I suspect could be adapted 
into a testsuite for processing domain labels.
5. Browser vendors should implement domainToASCII and domainToUnicode to 
at least as well as they internally implement these methods, even in 
lieu of precise definitions on how to handle IDN. In any case, I'd 
rather have IDN handling that matches how my browser implements it 
internally than one that matches an official, blessed spec, if the two 
diverge. Note that this includes handling deciding when U-labels are 
safe to display (assuming that browsers are competent to enough to 
already think about this).


--
Beware of bugs in the above code; I have only proved it correct, not tried it. 
-- Donald E. Knuth