Re: [DNSOP] Character encoding of URI Target RDATA?

Robert Edmonds Wed, 17 Jun 2015 11:14:16 -0700

Patrik Fältström wrote:
> On 16 Jun 2015, at 22:45, Robert Edmonds wrote:
> >John Levine wrote:
> >>Can you give an example of URI RDATA where it would make sense to
> >>interpret it other than as ASCII?
> >
> >This is the FTP example from the URI RR RFC, to which the UTF-8 byte order
> >mark has been gratuitously added:
> 
> Hmm...what RFC are you referring to? I can not find this in RFC 7553.


Sorry, it's not from an RFC.  I took one of the examples from RFC 7553,
and modified it to show a pathological example for John's question.

> The RFC says this:
> 
> This field holds the URI of the target, enclosed in double-quote
> characters ('"'), where the URI is as specified in RFC 3986
> [RFC3986].  Resolution of the URI is according to the definitions for
> the Scheme of the URI.
> 
> >>I suppose to be perfectly clear we might either say "percent encode
> >>everything" or we might say "unencoded UTF-8 is allowed."  They're
> >>both unambigious, and I expect most parsers can handle both.
> >
> >It would be very nice indeed if application developers did not have to
> >guess at the encoding of the bytes.
> 
> Earlier versions of the I-D did say explicitly that UTF-8 encoded characters
> is how the Target is to be interpreted, but feedback gave that it is better
> to just reuse the same specification as URIs. I.e. the interpretation is
> according to RFC 3986 (which implies unclear where 3986 might be unclear).

I don't see anywhere in RFC 3986 where it says how to interpret an
arbitrary octet sequence as a URI.  In fact, it repeatedly emphasizes
that the sequence of characters forming a URI is decoupled from possible
encodings of that sequence into octets.

RFC 3986 §2:

    2.  Characters

       The URI syntax provides a method of encoding data, presumably for the
       sake of identifying a resource, as a sequence of characters.  The URI
       characters are, in turn, frequently encoded as octets for transport
       or presentation.  This specification does not mandate any particular
       character encoding for mapping between URI characters and the octets
       used to store or transmit those characters.  When a URI appears in a
       protocol element, the character encoding is defined by that protocol;
       without such a definition, a URI is assumed to be in the same
       character encoding as the surrounding text.

       The ABNF notation defines its terminal values to be non-negative
       integers (codepoints) based on the US-ASCII coded character set
       [ASCII].  Because a URI is a sequence of characters, we must invert
       that relation in order to understand the URI syntax.  Therefore, the
       integer values used by the ABNF must be mapped back to their
       corresponding characters via US-ASCII in order to complete the syntax
       rules.

There are two encoding steps described here.  The first is the
production of a URI from its components into "URI characters", which
uses the percent-encoding scheme that everyone's familiar with to escape
URI components.  These "URI characters" are ABNF terminal values.  The
second encoding step is the conversion of these ABNF values into a
concrete octet stream.  Only this second encoding step is relevant for
the URI DNS RR, because serialized URIs have already undergone
%-encoding.

"Network ASCII" is a very common encoding for ABNF terminal values, but
not the only possible encoding.  RFC 5234 (ABNF):

    2.3.  Terminal Values

       Rules resolve into a string of terminal values, sometimes called
       characters.  In ABNF, a character is merely a non-negative integer.
       In certain contexts, a specific mapping (encoding) of values into a
       character set (such as ASCII) will be specified.

    [...]

    2.4.  External Encodings

       External representations of terminal value characters will vary
       according to constraints in the storage or transmission environment.
       Hence, the same ABNF-based grammar may have multiple external
       encodings, such as one for a 7-bit US-ASCII environment, another for
       a binary octet environment, and still a different one when 16-bit
       Unicode is used.  Encoding details are beyond the scope of ABNF,
       although Appendix B provides definitions for a 7-bit US-ASCII
       environment as has been common to much of the Internet.

       By separating external encoding from the syntax, it is intended that
       alternate encoding environments can be used for the same syntax.

    [...]

    Appendix B.  Core ABNF of ABNF

       This appendix contains some basic rules that are in common use.
       Basic rules are in uppercase.  Note that these rules are only valid
       for ABNF encoded in 7-bit ASCII or in characters sets that are a
       superset of 7-bit ASCII.

    [...]

    B.2.  Common Encoding

       Externally, data are represented as "network virtual ASCII" (namely,
       7-bit US-ASCII in an 8-bit field), with the high (8th) bit set to
       zero.  A string of values is in "network byte order", in which the
       higher-valued bytes are represented on the left-hand side and are
       sent over the network first.

RFC 3986 specifically declines to specify a particular concrete encoding
for URI characters:

    "This specification does not mandate any particular character
    encoding for mapping between URI characters and the octets used to
    store or transmit those characters."

and leaves it up to the protocol that embeds URIs to define the
encoding:

    "When a URI appears in a protocol element, the character encoding is
    defined by that protocol; without such a definition, a URI is
    assumed to be in the same character encoding as the surrounding
    text."

Unlike, say, an email message or webpage, there's no "surrounding text"
in a binary DNS response message that a consumer of URI RRs can rely on
to inform its choice of character encoding.

In fact, there are alternative encodings of URIs contemplated in an
example later in 3986:

    6.2.1.  Simple String Comparison

       [...]

       This character comparison requires that each pair of characters be
       put in comparable form.  For example, should one URI be stored in a
       byte array in EBCDIC encoding and the second in a Java String object
       (UTF-16), bit-for-bit comparisons applied naively will produce
       errors.  It is better to speak of equality on a character-for-
       character basis rather than on a byte-for-byte or bit-for-bit basis.
       In practical terms, character-by-character comparisons should be done
       codepoint-by-codepoint after conversion to a common character
       encoding.

       [...]

So, IMO, RFC 7553 is underspecified: it doesn't define an unambiguous
meaning for the octets appearing in a URI RR's Target field, so it's up
to applications to decide what character encoding to apply.  Was that
the intention?

-- 
Robert Edmonds

_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Re: [DNSOP] Character encoding of URI Target RDATA?

Reply via email to