Hi Ted,
In reviewing the JOSE drafts in preparation for them being approved, I was
looking at
https://datatracker.ietf.org/doc/draft-ietf-jose-json-web-key/ballot/#ted-lemon
and saw that you'd filed a NO OBJECTION ballot (with COMMENT) that for some
reason wasn't delivered to my e-mail. Since I hasn't seen it until today, I
hadn't previously responded. My apologies! Your comment was:
Comment (2014-10-02 for -33)
I'm not sure whether I need to complain about this, but the following seems
underspecified:
UTF8(STRING) denotes the octets of the UTF-8 [RFC3629] representation
of STRING.
ASCII(STRING) denotes the octets of the ASCII [USASCII]
representation of STRING.
The issue is that we don't know what STRING is. Is it 32-bit unicode? Is it
ASCII? What does it mean to have ASCII(unicode string)? Is ASCII(STRING) an
assertion that STRING is representable as ASCII?
These are fair questions. The STRING in this notation is always a sequence of
characters with an unspecified representation. The notations UTF8(STRING) and
ASCII(STRING) are used to represent the character string as an octet sequence
with a particular character encoding.
You're right that ASCII(Unicode string) isn't meaningful in the general case;
it's only used when the character set of STRING is constrained to containing
only ASCII characters. I suppose that you're right you could think of
ASCII(STRING) as an assertion that STRING is representable in ASCII, but it
means more than that; it specifies a particular octet sequence that represents
those characters.
For instance, while both ASCII("Abc") and UTF8("Abc") result in the octet
sequence [65, 98, 99], if we were to have a related UTF16BitEndian() function
(which we don't), UTF16BitEndian("Abc") would represent the octet sequence [0,
65, 0, 98, 0, 99] and EBCDIC("Abc") would represent the octet sequence [193,
130, 131]. But now I'm off into esoterica... ;-)
Back to the topic at hand, the notation UTF8(STRING) was adopted to replace the
much more verbose notation "the octets of the UTF-8 representation of STRING"
which used to appear repeatedly throughout the drafts and in particular, the
notation BASE64URL(UTF-8(STRING)) replaces the also previously very common
notation "the Base64url encoding of the octets of the UTF-8 representation of
STRING". This was an improvement suggested by Jim Schaad in one of his review
comments.
If you think that the current notation is unclear, we should sort out how to
clarify it. The best I've come up with is to add the phrase ", where STRING is
a sequence of zero or more Unicode characters" to these definitions. (The
language "sequence of zero or more Unicode characters" comes from the
introduction to RFC 7159.) Do you think that would address your questions, or
do you have an alternate suggestion?
Sorry again for you not receiving a reply to this until now!
Best wishes,
-- Mike
_______________________________________________
jose mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/jose