Re: PaceMustBeWellFormed status

Robert Sayre Tue, 25 Jan 2005 14:49:26 -0800


Walter Underwood wrote:

--On Tuesday, January 25, 2005 04:21:29 PM -0500 Robert Sayre <[EMAIL PROTECTED]> wrote:

It's required for interop over HTTP. That's off-topic in the format draft, which mentions HTTP once, in passing.
So you suggest that we have additional format requirements in the
protocol spec?

I suggest saying nothing about it. I will explain why below.

The headers-ueber-alles rule in HTTP means that a legal Atom feed
can become illegal when served as text/xml. That is going to
suprise people and cause breakage.

How about something along the wording of the XML spec's section 4.3.3:

In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is an error for ...

Point 1: It's already in the XML spec. This means we are targetting implementors who will understand section six, but who haven't read the XML spec. Not a set of people worth adding a whole section for. This is really just a tour of Apache's mime.types file that inserts many damaging requirements along the way. I will detail them:

6. Client processing requirements
Atom feeds served over HTTP MUST be well-formed XML 1.0, as defined in Section 2.1 of the XML specification <http://www.w3.org/TR/REC-xml/#sec-well-formed>. Furthermore, the concept of XML well-formedness relies on first determining the character encoding of the XML document. RFC 3023 defines how to determine the character encoding of XML documents served over HTTP.

The first sentence is redundant because all Atom feeds must be well-formed. The second sentence is plainly false. The two concepts are unrelated.

6.1 Determining the character encoding of an Atom feed
The rules for determining the character encoding of an Atom feed are the same as determining the character encoding of any XML document served over HTTP. The rules are wholely defined by RFC 3023, but they are summarized here because there has been widespread confusion over how RFC 3023 should be interpreted:

The text then goes on to state many requirements that are not in RFC 3023.

1. When serving an Atom feed, it is RECOMMENDED that publishers include the charset parameter along with the media type in the Content-type HTTP header. If the charset parameter is present, clients MUST parse the Atom feed in that charset, ignoring any charset declared in the encoding attribute of the XML declaration.

2. Publishers SHOULD serve all Atom feeds with the media type "application/atom+xml" (registered in Section 8 of this document). Clients MUST treat "application/atom+xml" as "application/xml" and determine the character encoding as per RFC 3023 or its successor.

Publishers should serve their documents with the MIME type they want clients to use.

3. If a publisher wishes to serve an Atom feed over HTTP, but for some reason they are unable to use the "application/atom+xml" media type, the publisher SHOULD use "application/xml", and clients MUST determine the character encoding as per RFC 3023 or its successor.

Publishers should serve their documents with the MIME type they want clients to use.

4. If a publisher is unable to serve their Atom feed with a Content-Type of "application/atom+xml" or "application/xml", they MAY use "text/xml". According to RFC 3023, XML documents served as "text/xml" with no charset parameter have a character encoding of "us-ascii".

Of course they can serve it as text/xml. They should do that if they want people to view source. It's not appropriate to send the content to an Atom processor.

5. Publishers MUST NOT serve Atom feeds with a media type other than "application/atom+xml" (registered in this Section 8 of document) or one of the XML media types defined in RFC 3023 or its successor. In particular, "text/plain" is never an appropriate media type for an Atom feed. When retrieving an Atom feed served with a non-XML media type, clients MUST reject it as non-well-formed.

We have no business stating this. I will serve Atom feeds as text/plain if I want them processed as text documents. Clients shouldn't send them to the XML processor at all. Well-formedness errors come from XML processors, not passive-aggressive applications.

6.2 Handling well-formedness errors
After determining the character encoding by the rules in section 6.1 of this document, clients MUST use a conforming XML parser to parse an Atom feed. In particular, clients MUST stop processing at the first well-formedness error, although they MAY display any information they have parsed before the first well-formedness error.

The second sentence is incorrect, since it's acceptable for processors to continue reporting errors.

Here is a non-comprehensive list of things clients have been known to do after encountering a well-formedness error, which this document specifically prohibits: • Clients MUST NOT reparse the feed in any other character encoding. • Clients MUST NOT "tidy" the feed to attempt to fix mismatched start and end tags. • Clients MUST NOT guess at the meaning of undefined entities, including entities defined in the HTML specification.

We have no business making these demands, since the documents in question are not Atom documents.

Robert Sayre

Re: PaceMustBeWellFormed status

Reply via email to