I have looked att parsing of hostname or domain name, and there are some areas I think may give problems.
Today a hostname can be made of A-Z,-,0-9 and a domain name A-Z,-,0-9 and . When we go over to UCS we will have many more characters. Looking at the handling of combining code points in UCS, Unicode do not handle them in a way that will be easy to hande for many programmers. For example: SPACE which is not allowed in a ASCII hostname and should probably not be allowed in a UCS hostname, can easily be checked and parsed as a separator in ASCII. But in UCS it is possible to represent spacing accents as SPACE + combining accent. This means that the UTF-8 form may contain the SPACE code point which do not represent the SPACE character. That will make parsing much more difficult. Looking at the Unicode normalisation forms and trying IBM's ICU, NFC do not normalise SPACE + combining accent into the "spacing accent" code point. NFKC does decompose instead of compose, spacing accents. To make things a little bit easier for software handling hostnames we could forbidd all accents, or we could allow spacing accents but not "SPACE+combining accent". In short: the SPACE code point is only allowed when not followed by a combining character. While NFKC may be a good idea for matching names, it is not a good idea for normalised form of a name. NFKC removes duplicate forms of single characters (like wide A and circled A) which is good. But it also replaces code points representing many characters by many characters. In some cases that may be well (makes no semantic difference) but in others the resulting name is not the same (for example: superscript 2 (U+00B2) is replaced by character 2 (U+0032)). >From what I have found out, the best normalised form of a domain name is to use NFC with the alternative code points for those LETTERS that have more than one code point forbidden, and all code point sequences that can be represented by a single code point be combined into that code point. (Note: the IDNA nampreprepped name is a form used for domain name matching. It is not the same as the normalised form above.) Dan
