Patrik Fältström wrote: > On 16 Jun 2015, at 22:45, Robert Edmonds wrote: > >John Levine wrote: > >>Can you give an example of URI RDATA where it would make sense to > >>interpret it other than as ASCII? > > > >This is the FTP example from the URI RR RFC, to which the UTF-8 byte order > >mark has been gratuitously added: > > Hmm...what RFC are you referring to? I can not find this in RFC 7553.
Sorry, it's not from an RFC. I took one of the examples from RFC 7553, and modified it to show a pathological example for John's question. > The RFC says this: > > This field holds the URI of the target, enclosed in double-quote > characters ('"'), where the URI is as specified in RFC 3986 > [RFC3986]. Resolution of the URI is according to the definitions for > the Scheme of the URI. > > >>I suppose to be perfectly clear we might either say "percent encode > >>everything" or we might say "unencoded UTF-8 is allowed." They're > >>both unambigious, and I expect most parsers can handle both. > > > >It would be very nice indeed if application developers did not have to > >guess at the encoding of the bytes. > > Earlier versions of the I-D did say explicitly that UTF-8 encoded characters > is how the Target is to be interpreted, but feedback gave that it is better > to just reuse the same specification as URIs. I.e. the interpretation is > according to RFC 3986 (which implies unclear where 3986 might be unclear). I don't see anywhere in RFC 3986 where it says how to interpret an arbitrary octet sequence as a URI. In fact, it repeatedly emphasizes that the sequence of characters forming a URI is decoupled from possible encodings of that sequence into octets. RFC 3986 §2: 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text. The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules. There are two encoding steps described here. The first is the production of a URI from its components into "URI characters", which uses the percent-encoding scheme that everyone's familiar with to escape URI components. These "URI characters" are ABNF terminal values. The second encoding step is the conversion of these ABNF values into a concrete octet stream. Only this second encoding step is relevant for the URI DNS RR, because serialized URIs have already undergone %-encoding. "Network ASCII" is a very common encoding for ABNF terminal values, but not the only possible encoding. RFC 5234 (ABNF): 2.3. Terminal Values Rules resolve into a string of terminal values, sometimes called characters. In ABNF, a character is merely a non-negative integer. In certain contexts, a specific mapping (encoding) of values into a character set (such as ASCII) will be specified. [...] 2.4. External Encodings External representations of terminal value characters will vary according to constraints in the storage or transmission environment. Hence, the same ABNF-based grammar may have multiple external encodings, such as one for a 7-bit US-ASCII environment, another for a binary octet environment, and still a different one when 16-bit Unicode is used. Encoding details are beyond the scope of ABNF, although Appendix B provides definitions for a 7-bit US-ASCII environment as has been common to much of the Internet. By separating external encoding from the syntax, it is intended that alternate encoding environments can be used for the same syntax. [...] Appendix B. Core ABNF of ABNF This appendix contains some basic rules that are in common use. Basic rules are in uppercase. Note that these rules are only valid for ABNF encoded in 7-bit ASCII or in characters sets that are a superset of 7-bit ASCII. [...] B.2. Common Encoding Externally, data are represented as "network virtual ASCII" (namely, 7-bit US-ASCII in an 8-bit field), with the high (8th) bit set to zero. A string of values is in "network byte order", in which the higher-valued bytes are represented on the left-hand side and are sent over the network first. RFC 3986 specifically declines to specify a particular concrete encoding for URI characters: "This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters." and leaves it up to the protocol that embeds URIs to define the encoding: "When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text." Unlike, say, an email message or webpage, there's no "surrounding text" in a binary DNS response message that a consumer of URI RRs can rely on to inform its choice of character encoding. In fact, there are alternative encodings of URIs contemplated in an example later in 3986: 6.2.1. Simple String Comparison [...] This character comparison requires that each pair of characters be put in comparable form. For example, should one URI be stored in a byte array in EBCDIC encoding and the second in a Java String object (UTF-16), bit-for-bit comparisons applied naively will produce errors. It is better to speak of equality on a character-for- character basis rather than on a byte-for-byte or bit-for-bit basis. In practical terms, character-by-character comparisons should be done codepoint-by-codepoint after conversion to a common character encoding. [...] So, IMO, RFC 7553 is underspecified: it doesn't define an unambiguous meaning for the octets appearing in a URI RR's Target field, so it's up to applications to decide what character encoding to apply. Was that the intention? -- Robert Edmonds _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop