James has been needling me to put together a second summary of the naming rules in time for the IETF meeting. However, I have been extremely busy lately (it is 3 am in a strange hotel room) but I wanted to at least scratch together enough material for the concepts to be tangible.
The following text is by no means complete. It's hardly just begun. However, it illustrates what the scope will be, exposes some of the open issues, and may be usable as a touchstone to see if this whole IDN thing is going to work or not. What I mean by that is, if systems are going to work with IDNs in their raw form (not encoded form, but raw form) these are the rules they will have to work with. If these rules are too complex, the whole approach has to be reconsidered. That will affect a lot of other things, including ACE. The basic idea here is to declare formal data-types for labels, and to incorporate the data-types into syntaxes for applications and protocols to use when they need to interact with domain names. 1. Summary This memo describes two sets of definitions which are necessary for the consistent and reliable use of internationalized domain names across the Internet. First and foremost, this memo specifies the rules which govern the structure and syntax of internationalized domain names in various scenarios, and also describes their legitimate characters and any normalization which may be required. Secondarily, this memo also clarifies and extends usage rules of common resource records so that internationalized domain names can be stored and exchanged (either as resource record owner domain names or as resource record data) in a form which is consistent across all usage environments. 2. Introduction There are many issues which affect the characters that are desirable for use in DNS domain names. Among these considerations are obvious aspects such as breadth, as well as less-obvious aspects such as normalized forms of particular character sequences, comparison efficiencies, and more. The general consensus of the IDN working group is that domain names should use a mildly-restricted subset of the character codes and arrangement sequences which are documented in the UCS for use with languages, as this subset excludes non-verbal symbols and spurious punctuation which are likely to be problematic, while still allowing international domain names to be created. Furthermore, the consensus is that these character sequences should be normalized and converted to lowercase [in that order?] wherever this is possible, since this will provide the tightest syntactical representation of the supported characters with the least amount of ambiguity. While both of those objectives are highly desirable (and are met in most of the scenarios), there are many instances where these objectives are incompatible with existing practice. For example, existing (STD13-compliant) DNS implementations are allowed to use domain names which contain any eight-bit character code (0x00 through 0xFF), while there are some protocol models which specifically require the use of punctuation (SRV requires underscore, for example), while some resource records can contain domain names that combine both of these elements (SOA and RP both provide email addresses as domain name labels that can contain, and those can use punctuation or case-specific US-ASCII letters). In order to facilitate these divergent requirements, this memo describes multiple types of domain name labels, including their valid characters, any case-conversions and/or normalizations which may be required, and so forth. Furthermore, in order to ensure that these rules are consistently implemented (and to minimize damage when they are not), this memo also states which label data-types are valid for use with many of the common resource records. Cumulatively, this means that a system which attempts to use an internationalized domain name for a specific purpose will have to be aware of the rules which govern the resource record which provides that service, and will have to be aware of the rules which govern the domain name data-types which are valid for that resource record. For example, if an application knows that an internationalized domain name will be used for a forward lookup, it will have to be aware of the label data-types that are usable with A (or AAAA) resource records, and must ensure that the domain name is processed (normalized and lower-cased, in this example) before it is used. NOTE: Legacy systems which use a backwards-compatible encoding scheme for access to resources with internationalized domain names will not be required to perform any of these tests. However, systems which embrace internationalized domain names as specific data (EG, any system which encodes or decodes an internationalized domain name as explicit data) will need to be aware of these issues and will likely be required to enforce their usage. 3. Domain Names and Label Data-Types An internationalized domain name is a sequence of labels which are encapsulated in a message. The message may provide the labels as separate units of data (as is the case with DNS), or may provide them as a series of dot-separated textual strings (as is the case when domain names are "written-out" in protocol or application data streams). In global terms, an internationalized domain name has the following characteristics: * Series of labels (1*label) * Maximum cumulative length of 255 UCS character codes (not necessarily codes with matching characters, and most definitely not octets or any encoded representation). This limit includes any separators which may be provided (such as the full-stop character commonly used as a separator when the domain name is written), and also includes one character for the root domain (the trailing dot). The labels that make up a domain name will vary according to the contextual use of the domain name. 3.1. Opaque Labels Some functions can use domain names which consist of unstructured or unknown labels. For example, a TXT resource record can describe anything, and as such, it can use any sequence of UCS characters for its owner domain name. Opaque labels require no processing on the part of the application which is using the domain name. It is the responsibility of the user to provide the domain name to the application in its correct case and/or normalization form. Opaque labels have the following characteristics: * Any valid UCS character code (not necessarily a valid UCS character). * Minimum length of one UCS character code. * Maximum length of 63 UCS character codes. NOTE: Even though a domain name may sometimes consist of a variable number of opaque labels, most domain names will also contain at least some host labels. In those cases, the entire domain name should be provided as a series of opaque labels, and the host labels should be determined beforehand. For example, a CNAME resource record can reference anything, including an A RR that consists entirely of host labels, or a TXT RR that consists of a mixture of opaque and host labels. As such, it will depend on the formats in use by the alias target, and will inherit those attributes. 3.2. Host Identifier Labels Most functions will use domain names to identify a host, either directly or indirectly. For example, a host may be identified by a relative domain name which consists of only a local label, or by an FQDN which contains a series of host labels. Since all forms must be supportable, all namespace delegation functions also use the host label syntax. The UCS characters provided in host labels are required to be converted to lowercase and normalized according to the rules in [nameprep] before they are processed. Servers are likely to treat such labels as exact matches of the encoded data, so it is imperative that applications perform this work before they encode the label into a DNS query. Host labels are used for any lookups, protocol actions, or message formats which specifically make use of internationalized domain names for host identification purposes. Host labels have the following characteristics: * UCS characters from the following ranges: "letters" [need a property] characters with number property [?] characters with diacritical mark property [?] hyphen-minus (U+002D) * MUST be converted to lowercase according to [nameprep]. * MUST be normalized according to [nameprep]. * First and last characters in the label MUST NOT be a diacritical mark or hyphen-minus. * Minimum length of two characters. * Maximum length of 63 characters. 3.3. ASCII Labels Some functions require labels that contain extended punctuation, but which also require case-neutral comparisons. The most readily apparent of these usages is the SRV resource record, which makes use of the underscore character (U+005F) and case-neutral US-ASCII in the owner labels. ASCII labels have the following characteristics: * Any printable character from US-ASCII (0x21 through 0x7E, inclusive). * SHOULD be converted to lowercase as specified in [nameprep] (note that servers are required to perform case-neutral comparisons, but certain tools will likely prefer to generate and use lower-case wherever possible, so lowercase is the preferred form). All comparison operations on these domain names MUST be performed in a case-neutral form. * Minimum length of one character. * Maximum length of 63 characters. NOTE: some resource records may define tighter restrictions. NOTE: Even though a domain name may sometimes consist of a variable number of ASCII labels, most domain names will also contain at least some host labels. In those cases, the entire domain name should be provided as a series of opaque labels, and the ASCII and host labels should be determined beforehand. 3.4. Mailbox Labels Some functions provide SMTP mailboxes as labels within domain names. For example, the SOA and RP resource records both provide email addresses, with the first label providing a mailbox (local-part) of the address, and with the remainder of the labels providing the delivery domain of the address. In order for these resources to be accessible, applications must process labels which are known to contain email addresses through these rules. This means that data must be provided in a non-normalized, non-lowercased form, and must be restricted to the range of characters which are valid, as specified in section XX of RFC 2822. Until RFC 2822 is deprecated or until such a time as UCS characters can be stored in the mailbox portion of Internet standard email addresses, the mailbox label is to processed according to the rules set forth in RFC 2822. There are two additional rules which govern this data-type: * Minimum length of one character. * Maximum length of 63 characters. NOTE: mailbox labels can contain a large number of special characters such as spaces or full-stop. These characters may require escaping as described in section XX of this document. NOTE: Mailbox labels are NOT a subset of the ASCII labels. Mailbox labels are case-sensitive, while ASCII labels are case-neutral. 4. Resource Records The following structure is used to describe resource records and their usage of internationalized domain names and labels. <owner domain name labels> <mnemonic> <[data] [data] [...]> A, always provides a host identifier <1*host> <A> <[IPv4 address]> AAAA, always provides a host identifier <1*host> <AAAA> <[IPv6 address]> CNAME, can reference anything, can target anything <1*opaque> <CNAME> <[1*opaque]> NS, references a host, provides a host identifier <1*host> <NS> <[1*host]> SOA, references a host (delegation), provides host identifier, email address, and custom data <1*host> <SOA> <[1*host] [1mailbox (*host)] [serial] [refresh] [retry] [expire] [ttl]> WKS, always provides a host identifier <1*host> <WKS> <[XX] [XX]> PTR, can reference anything, must inherit target attributes <1*opaque> <PTR> <[1*opaque]> HINFO, references a host, provides RR-specific data <1*host> <HINFO> <[hardware] [opsys]> MX, references a host, provides a preference and a host identifer <1*host> <MX> <[preference] [1*host]> TXT, can reference anything, provides free-text data <1*host> <TXT> <[text]> RP, can reference anything, provides email address and a pointer to a TXT RR <1*opaque> <RP> <[1mailbox (*host)]> <1*opaque> SRV, references a protocol (which is specified using the ASCII data-type), provides preference values and a host identifier <1*ASCII> <SRV> <[priority] [weight] [port] [1*host]> [NOTE: cannot define <2ASCII *HOST> because not all SRV protocol labels are just _service._transport]
