Soobok Lee <[EMAIL PROTECTED]> wrote: > I have a punycode label of length 63 octets: > L1: zq--o39AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > L2=ToUnicode(L1) produces: U+AC00 x 56 times ( Hangul "KA" repeated 56 times) > > But this L2 can be encoded in various unicode/legacy encodings into > various lengths of octets: > > UTF8 : 3 x 56 = 168 octets > UCS2 : 2 x 56 = 112 octets > UCS4 : 4 x 56 = 224 octets > KSX1001/EUC-KR : 2 x 56 = 112 octets > > Many internet applications impose/assumes the 63-octets-limit of > label lengths.
IDN-unaware applications use this simple 63-octet limit. These applications also assume that the domain label is ASCII. IDN-aware applications will be careful to use the ASCII form when talking to IDN-unaware applications. Applications that use non-ASCII representations will know the more complex syntax rule for non-ASCII labels (namely, that the label is valid if and only if ToASCII can be applied to it without failing). > From implementators' point of view, more precise specificiation is > needed about whether IDN label/FQDN has *NEW* length restrictions in > various char encodings Section 2 defines "internationalized label" as a label to which the ToASCII operation can be applied without failing. There is no other restriction on IDN label syntax. > the implementors have practical security-related need to impose some > limits on the iDN lables in non-ACE encodings. (for example, to avoid > buffer overflow errors due to expanded ToUnicode labels) That's true. A cursory examination of the Punycode algorithm reveals that each ASCII character can represent at most one code point; therefore an internationalized label can represent at most 63 code points, whether it's ACE or not. A given encoding uses a bounded number of octets per code point, so you can allocate your buffers based on that. > The unit of length restriction matters: # of code points or # of > octets ? That should be made clearer. RFC1035 uses "octets", not a > character/code point. RFC 1035 limits domain labels to 63 octets, but RFC 1035 predates IDNA, and it speaks under the explicit assumption that text is ASCII. Because DNS is IDN-unaware, all internationalized labels in DNS are in their ASCII forms. For these reasons, the 63-octet limit applies only to the ASCII forms of internationalized labels. IDNA does not introduce any new length restrictions. The 63-octet limit on ASCII labels is the only length restriction on internationalized labels. > Then, U+AC00 x 56 times (in my previous posting) is a valid label > conforming to RFC1035 ? No, it's not, and that's why IDNA requires that it be converted to its ASCII form before being passed into an IDN-unaware protocol like DNS. > UTF8-encoded IDN labels are not governed by RFC1035 length > restrictions ? Not directly. The 63-octet limit applies to the ASCII form, not the UTF-8 form. It would be absurd to apply the 63-octet limit to every possible encoding form. You'd have to transcode a label into every possible encoding just to check whether it's valid. > IDNA contains brand new length restrictions for 8bit labels which > obsoletes RFC1035 ? No, it contains no new length restrictions. The RFC 1035 restriction on the ASCII form is still the only restriction on the length of internationalized labels. AMC
