Hi there.

In Prague, we had a few hallway conversations with respect to the default encoding of text/* media types.

Below are my notes (references to the relevant spec sections, information about a recent change in HTTPbis, and a rough proposal about how to proceed).

I'm posting this here because Alexey thought the audience might fit...

Best regards, Julian

-- snip --


1) RFC 2046 says that the default is US-ASCII

"Note that the character set used, if anything other than US- ASCII, must always be explicitly specified in the Content-Type field." -- <http://greenbytes.de/tech/webdav/rfc2046.html#rfc.section.4.1.2.p.18>

2) RFC 2616 says it's ISO-8859-1

"The "charset" parameter is used with some media types to define the character set (Section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See Section 3.4.1 for compatibility problems." -- <http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.7.1.p.4>

3) For text/xml, RFC 3023 says it's US-ASCII, no matter what 2616 says :-)

"Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)" -- <http://tools.ietf.org/html/rfc3023#section-3.1>

The problem

Recipients do not implement this; they take the absence of encoding information as indicator to inspect the payload; this is at least true for text/xml and text/html (see <http://www.w3.org/TR/REC-xml/#sec-guessing> and <http://www.w3.org/TR/2011/WD-html5-20110405/parsing.html#determining-the-character-encoding>)

Current development: HTTPbis, P3 has dropped drop the default and delegate to the relevant media type definitions (see <http://trac.tools.ietf.org/wg/httpbis/trac/ticket/20>, <http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p3-payload-14.html>).

Left to do:

a) Revise RFC 2046; allow text/* types that carry encoding information inline to do the expected thing (overriding the US-ASCII default); warn against doing so in new registrations (recommend to only support UTF-8, and require to always explicitly include the charset parameter, such as text/vcard is going to do it?)

b) Revise RFC 3023 to delegate text/xml charset defaults to revision of 2046?

Best regards, Julian


_______________________________________________
yam mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/yam

Reply via email to