text/html with mode="xml" in Atom 0.3

2006-03-23 Thread James Holderness


I've been seeing a number of feeds recently using Atom 0.3 with a content 
type of "text/html" and no mode attribute (i.e. the equivalent of 
mode="xml"). However, the markup in that content is wrapped in a CDATA 
section, for example something like this:


   
 

   

If it had been marked as "escaped" you would obviously unescape the CDATA 
before interpreting the markup. However, since the mode is technically 
"xml", I was under the impression that it should be treated as inline XML 
and no unescaping was necessary. But that would result in the literal text 
"http://www.w3.org/1999/xhtml";>Content goes here" 
being displayed to the user which is obviously not what is intended.


So is this a bug in the content generator (all the feeds I've seen appear to 
be using TypePad) or are you supposed to ignore the mode attribute when the 
content type is set to "text/html" and always treat it as escaped? I know 
Atom 0.3 is deprecated and I shouldn't be having to deal with this, but the 
reality of the situation is that there are a whole lot of Atom 0.3 feeds 
still out there (probably more than Atom 1.0) and I need to be able to 
support them.


Some feeds where you can see the problem (not all entries though):

http://feeds.feedburner.com/Flickrblog
http://dilbertblog.typepad.com/the_dilbert_blog/atom.xml
http://blog.cymfony.com/atom.xml

Regards
James



atom:name ... text or html?

2006-03-23 Thread Eric Scheid

If I have an author with the name "Bertrand Café", is it acceptable to put
that into atom:author like this;



or should I be using the unicode numeric entity instead?

e.




Re: atom:name ... text or html?

2006-03-23 Thread Anne van Kesteren


Quoting Eric Scheid <[EMAIL PROTECTED]>:

If I have an author with the name "Bertrand Café", is it acceptable to put
that into atom:author like this;

   

or should I be using the unicode numeric entity instead?


Even if it was "HTML" you couldn't really use the entity, could you? I 
think you

have to use a character reference or the actual character instead, yes.


--
Anne van Kesteren





Re: atom:name ... text or html?

2006-03-23 Thread James M Snell

+1 to what Anne says.  If I received that Atom author name, I would
display it exactly as presented "Bertrand Café"

- James

Anne van Kesteren wrote:
> 
> Quoting Eric Scheid <[EMAIL PROTECTED]>:
>> If I have an author with the name "Bertrand Café", is it acceptable to
>> put
>> that into atom:author like this;
>>
>>
>>
>> or should I be using the unicode numeric entity instead?
> 
> Even if it was "HTML" you couldn't really use the entity, could you? I
> think you
> have to use a character reference or the actual character instead, yes.
> 
> 



Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


Hahaha! It's RSS all over again. In the words of Mark Pilgrim: "Here's 
something that might be HTML. Or maybe not. I can't tell you, and you can't 
guess." :-)


Seriously though, the atom:name element is described as "a human-readable 
name", so unless your name really is "Betrand Caf&eacture;" that can't be 
right. If RFC4287 had intended to allow markup in the element it would have 
used atomTextConstruct.


Regards
James

Eric Scheid wrote:

If I have an author with the name "Bertrand Café", is it acceptable to put
that into atom:author like this;

   




Re: atom:name ... text or html?

2006-03-23 Thread Eric Scheid

On 24/3/06 3:21 AM, "Anne van Kesteren" <[EMAIL PROTECTED]> wrote:

>> 
>> 
> Even if it was "HTML" you couldn't really use the entity, could you? I think
> you have to use a character reference or the actual character instead, yes.
> 

It's true that XML has only a half dozen or so entities defined, meaning
most interesting entities from html can't exist in XML ... unless maybe they
are wrapped like in CDATA block like above?

I'm getting the data by scraping an html page, so I'm expecting it to be
acceptable html code, including html entities.

e. 



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid <[EMAIL PROTECTED]> [2006-03-23 17:30]:
>If I have an author with the name "Bertrand Café", is it
>acceptable to put that into atom:author like this;
>
>

No. That means the author’s name is Bertrand Café (he must
have had very cruel parents), not Bertrand Café.

>or should I be using the unicode numeric entity instead?

Yes. Or use a literal é as you did in this mail, provided you
emit the feed as UTF-8 (or ISO-8859-1, if you must).

Regards,
-- 
Aristotle Pagaltzis // 



Re: text/html with mode="xml" in Atom 0.3

2006-03-23 Thread A. Pagaltzis

* James Holderness <[EMAIL PROTECTED]> [2006-03-23 17:30]:
>So is this a bug in the content generator (all the feeds I've
>seen appear to be using TypePad)

Yes.

>or are you supposed to ignore the mode attribute when the
>content type is set to "text/html" and always treat it as
>escaped?

No.

In 0.3, the `mode` attribute was the final arbiter for the form
of the content. In Atom 1.0, its role was subsumed by switching
on the `type` value because consumer developers reported that
this sort of layering was unnecessarily hard to support and
provided no discernible benefit.

Regards,
-- 
Aristotle Pagaltzis // 



Re: atom:name ... text or html?

2006-03-23 Thread Sylvain Hellegouarch





Seriously though, the atom:name element is described as "a 
human-readable name", 
Do you mean that "human-readable" is equivalent to solely English? 
Because as a French, having accents in names is so natural that I see it 
as "human readable" too ;)


- Sylvain




Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


Sylvain Hellegouarch wrote:
Do you mean that "human-readable" is equivalent to solely English? Because 
as a French, having accents in names is so natural that I see it as "human 
readable" too ;)


No. I mean that the literal sequence of characters "& e a c u t e ;" is not 
human-readable (or at least isn't intended to be).


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread Stephane Bortzmeyer

On Fri, Mar 24, 2006 at 03:16:18AM +1100,
 Eric Scheid <[EMAIL PROTECTED]> wrote 
 a message of 10 lines which said:

> or should I be using the unicode numeric entity instead?

Or the character itself, in UTF-8 or any other encoding (but UTF-8 is
the most widely implemented, so you limit the risks).

(That's what I do with http://www.bortzmeyer.org/feed.atom and it
seems OK in every agregator and it validates.)



Re: atom:name ... text or html?

2006-03-23 Thread David Powell


Thursday, March 23, 2006, 4:57:11 PM, you wrote:

> On 24/3/06 3:21 AM, "Anne van Kesteren" <[EMAIL PROTECTED]> wrote:

>>> 
>>> 
>> Even if it was "HTML" you couldn't really use the entity, could you? I think
>> you have to use a character reference or the actual character instead, yes.
>> 

> It's true that XML has only a half dozen or so entities defined, meaning
> most interesting entities from html can't exist in XML ... unless maybe they
> are wrapped like in CDATA block like above?

atom:name is not intended to contain HTML, the spec for it doesn't
mention HTML, it is no more correct to put HTML in it, than it is to
put base64'd PDF in there.

> I'm getting the data by scraping an html page, so I'm expecting it to be
> acceptable html code, including html entities.

Your HTML parser should decode the entities for you and return a
string. Your Atom generator should encode or escape the string using
numeric entities.

If you really need to use HTML entities directly, then you could put:


]>

at the top of your feed and get rid of that CDATA. XML processors are
REQUIRED [1] to process internal DTD subsets.

[Hmm, internal DTD subsets completely fail in IE7's feed reader,
throwing up a "friendly error message"]

[1] 

-- 
Dave



Re: atom:name ... text or html?

2006-03-23 Thread Stephane Bortzmeyer

On Thu, Mar 23, 2006 at 05:01:03PM +0100,
 Sylvain Hellegouarch <[EMAIL PROTECTED]> wrote 
 a message of 11 lines which said:

> Because as a French, having accents in names is so natural that I
> see it as "human readable" too ;)

As I wrote and used and tested on my blog, there is no problem in Atom
to have a first name with accent like mine. Atom is XML and therefore
Unicode rules.



Re: text/html with mode="xml" in Atom 0.3

2006-03-23 Thread James Holderness


A. Pagaltzis wrote:

So is this a bug in the content generator (all the feeds I've
seen appear to be using TypePad)


Yes.


or are you supposed to ignore the mode attribute when the
content type is set to "text/html" and always treat it as
escaped?


No.


Thanks for the confirmation. I was beginning to think I was wrong. I tested 
this in 15 different aggregators and all but one ignored the mode and 
unescaped the content anyway. I have a horrible feeling I'm going to have to 
add code to emulate this behaviour.


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid <[EMAIL PROTECTED]> [2006-03-23 18:05]:
>It's true that XML has only a half dozen or so entities defined,
>meaning most interesting entities from html can't exist in XML
>... unless maybe they are wrapped like in CDATA block like
>above?

No, a CDATA block simply means that characters like <, & and >
stand for themselves.

>I'm getting the data by scraping an html page, so I'm expecting
>it to be acceptable html code, including html entities.

Then decode the entities to a Unicode string and emit the feed as
Unicode. Simplest thing that will work reliably.

Regards,
-- 
Aristotle Pagaltzis // 



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Sylvain Hellegouarch <[EMAIL PROTECTED]> [2006-03-23 18:15]:
>Do you mean that "human-readable" is equivalent to solely
>English? Because as a French, having accents in names is so
>natural that I see it as "human readable" too ;)

Even as a French, you probably write é, not é. :-)

Regards,
-- 
Aristotle Pagaltzis // 



Re: atom:name ... text or html?

2006-03-23 Thread Antone Roundy


On Mar 23, 2006, at 9:48 AM, James Holderness wrote:
Hahaha! It's RSS all over again. In the words of Mark Pilgrim:  
"Here's something that might be HTML. Or maybe not. I can't tell  
you, and you can't guess." :-)


Seriously though, the atom:name element is described as "a human- 
readable name", so unless your name really is "Betrand  
Caf&eacture;" that can't be right. If RFC4287 had intended to allow  
markup in the element it would have used atomTextConstruct.


I agree with James here--if we had intended for the name to be able  
to include markup, we should have used the construct we created to  
allow that.  This from RFC 4287 (section 3.2):


   element atom:name { text }

would have been this:

   element atom:name { atomTextConstruct }

if we had intended for it to be able to contain anything but literal  
text after XML un-escaping, right?


On Mar 23, 2006, at 9:57 AM, Eric Scheid wrote:
It's true that XML has only a half dozen or so entities defined,  
meaning
most interesting entities from html can't exist in XML ... unless  
maybe they

are wrapped like in CDATA block like above?
If they're wrapped in a CDATA block, then they don't trigger an XML  
parsing error, but wrapping something in CDATA isn't a license to  
enter data in a format other than what the RFC allows.


I'm getting the data by scraping an html page, so I'm expecting it  
to be

acceptable html code, including html entities.
You, the producer, are getting the data from an HTML page, so you  
should certainly be prepared to handle HTML entities in it. But you  
the Atom publisher are responsible for making sure that you've made  
any changes to the data that are necessary for it to be proper Atom  
before you publish it. The consumer of the Atom feed doesn't know  
where you got the data, and thus can't be expected to decide how to  
process it based on where you got it.




Re: text/html with mode="xml" in Atom 0.3

2006-03-23 Thread A. Pagaltzis

* James Holderness <[EMAIL PROTECTED]> [2006-03-23 18:40]:
>I tested this in 15 different aggregators and all but one
>ignored the mode and unescaped the content anyway.

Good thing this rule was changed in Atom 1.0, then…

What I really don’t get is what that `xmlns` attribute is doing
there in the CDATA block of your data sample. Sometimes I wonder
if CDATA should not have been left out of the XML spec; it seems
to create far too much confusion to be worthwhile.

Regards,
-- 
Aristotle Pagaltzis // 



Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


David Powell wrote:

[Hmm, internal DTD subsets completely fail in IE7's feed reader,
throwing up a "friendly error message"]


If I remember correctly they considered that a feature. Something to do with 
DTDs being a security risk. I'm not sure if this also meant they were 
incapable of processing Netscape RSS 0.91 feeds. All I know is that if I 
ever have a blog, I'll be sure to include a DTD at the top of my feed.


Regards
James



Does xml:base apply to type="html" content?

2006-03-23 Thread David Powell


xml:base applies to type="xhtml" content, but I'm not sure whether it
is supposed to apply to escaped type="html" content? I reckon that it
does.

Anybody came across this? Any opinions?

-- 
Dave



Re: text/html with mode="xml" in Atom 0.3

2006-03-23 Thread James Holderness


A. Pagaltzis wrote:

What I really don’t get is what that `xmlns` attribute is doing
there in the CDATA block of your data sample. Sometimes I wonder
if CDATA should not have been left out of the XML spec; it seems
to create far too much confusion to be worthwhile.


Well if you look at some of those feeds I listed, many of the entries are 
type="application/xhtml+xml" with a namespaced div element as you would 
expect. It looks like they may have taken the exact same code (or template, 
or however it is they do this stuff) and reused it for type="text/html". 
Only with the html they decided they should wrap everything in a CDATA block 
just to be safe.


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 8:01 AM, Sylvain Hellegouarch wrote:






Seriously though, the atom:name element is described as "a human- 
readable name",
Do you mean that "human-readable" is equivalent to solely English?  
Because as a French, having accents in names is so natural that I  
see it as "human readable" too ;)


You can have accents, you just can't use HTML entities to get them. -Tim



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 8:57 AM, Eric Scheid wrote:



On 24/3/06 3:21 AM, "Anne van Kesteren" <[EMAIL PROTECTED]>  
wrote:





Even if it was "HTML" you couldn't really use the entity, could  
you? I think
you have to use a character reference or the actual character  
instead, yes.




It's true that XML has only a half dozen or so entities defined


To be precise, 5: < & > ' " -Tim



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray


On Mar 23, 2006, at 8:16 AM, Eric Scheid wrote:

If I have an author with the name "Bertrand Café", is it acceptable  
to put

that into atom:author like this;



or should I be using the unicode numeric entity instead?


The key point is that the atom:name element, described in RFC4287  
3.2.1, is not a "Text Construct", as defined in 3.1, so you can't say  
; so no markup allowed.  So just say "Bertrand  
Café".  -Tim





Re: Does xml:base apply to type="html" content?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 10:03 AM, David Powell wrote:




xml:base applies to type="xhtml" content, but I'm not sure whether it
is supposed to apply to escaped type="html" content? I reckon that it
does.


RFC4287, section 2:

   Any element defined by this specification MAY have an xml:base
   attribute [W3C.REC-xmlbase-20010627].  When xml:base is used in an
   Atom Document, it serves the function described in section 5.1.1 of
   [RFC3986], establishing the base URI (or IRI) for resolving any
   relative references found within the effective scope of the xml:base
   attribute.

Seems pretty clear to me.  Yes, the base URI of that HTML is now  
whatever xml:base said it was -Tim




Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell

Just wanted to follow through on this for everyone.  Given that there
are vendors getting ready to ship code based on the current rev of the
spec, I'm *not* going to rename the "id" attribute to "ref".  Yes, I
know that "id" is confusing to some folks, but we're just talking the
name of a single attribute and not a critical functional bug.  From this
point forward, only critical spec bugs will be fixed and I will be
submitting the spec for consideration as a standards track RFC in the
not too distant future.

- James

Sylvain Hellegouarch wrote:
> 
> Hi everyone,
> 
> I was reading the Atom Feed Thread draft [1] yesterday and I ran into a
> problem as I described in my blog [2]. To recap the 'in-reply-to'
> element defined in that specification takes an 'id' attribute that
> specifies /the universally unique identifier of the resource being
> responded to/.
> 
> Calling such an attribute 'id' is a mistake in my opinion as it confuses
> with the actual ID of the element itself within the XML document it
> belongs to and it makes impossible for another element within the
> document to have the same value as an 'id'. I would rather move the
> content of that attribute as a text element of the 'in-reply-to' element
> (as does the atom:id element).
> 
> Thoughts?
> - Sylvain
> 
> [1]
> http://www.ietf.org/internet-drafts/draft-snell-atompub-feed-thread-05.txt
> [2] http://www.defuze.org/archives/2006/03/14/about-atom-feed-threads
> 
> 



Re: atom:name ... text or html?

2006-03-23 Thread Eric Scheid

On 24/3/06 4:42 AM, "A. Pagaltzis" <[EMAIL PROTECTED]> wrote:

>> I'm getting the data by scraping an html page, so I'm expecting
>> it to be acceptable html code, including html entities.
> 
> Then decode the entities to a Unicode string and emit the feed as
> Unicode. Simplest thing that will work reliably.

I figured as much. Oh well, now to track down a list of html entities and
their corresponding unicodes ...

e.



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 2:20 PM, Eric Scheid wrote:


Oh well, now to track down a list of html entities and
their corresponding unicodes ...


http://www.google.com/search?q=xhtml%20entities



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid <[EMAIL PROTECTED]> [2006-03-23 23:30]:
>Oh well, now to track down a list of html entities and their
>corresponding unicodes ...

That would be in the spec.
http://www.w3.org/TR/REC-html40/sgml/entities.html

But you shouldn’t have to. Any self-respecting language has a
library for that somewhere.

Regards,
-- 
Aristotle Pagaltzis // 



Re: Atom Thread Feed syntax

2006-03-23 Thread David Powell


Thursday, March 23, 2006, 9:39:09 PM, James M Snell wrote:

> Just wanted to follow through on this for everyone.  Given that there
> are vendors getting ready to ship code based on the current rev of the
> spec, I'm *not* going to rename the "id" attribute to "ref".  Yes, I
> know that "id" is confusing to some folks, but we're just talking the
> name of a single attribute and not a critical functional bug.  From this
> point forward, only critical spec bugs will be fixed and I will be
> submitting the spec for consideration as a standards track RFC in the
> not too distant future.

I'm more bothered about the use of undefined markup on the link
element. I know, I know, I keep going on and on about this, but I keep
seeing more drafts that do the same thing and it isn't just a
theoretical problem: Windows Feed Platform does not preserve arbitrary
markup other than proper extension elements. Other feed stores and
servers are likely to do the same (justifiably IMO).

The abandonment of extension constructs in favour of undefined markup
by this draft, and other draft-*-atompub-* drafts would be an
interoperability concern if these drafts were deployed. If you want to
extend Atom, use Extension Elements.

-- 
Dave



Re: Atom Thread Feed syntax

2006-03-23 Thread A. Pagaltzis

* David Powell <[EMAIL PROTECTED]> [2006-03-24 02:20]:
>The abandonment of extension constructs in favour of undefined
>markup by this draft, and other draft-*-atompub-* drafts would
>be an interoperability concern if these drafts were deployed. If
>you want to extend Atom, use Extension Elements.

I don’t follow. Please explain how these drafts fail to satisfy
the criteria in Section 6.4.2, Structured Extension Elements.

Regards,
-- 
Aristotle Pagaltzis // 



Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell

I believe the concern is over the thr:count and thr:when attributes for
the replies link relation, both of which are optional, and both of which
provide what I consider to be extra information.  In other words, it's
ok if an implementation drops them.  The important bit is the
in-reply-to element and the replies link rel, both of which fall within
the bounds of the Atom extension model.

- James

A. Pagaltzis wrote:
> * David Powell <[EMAIL PROTECTED]> [2006-03-24 02:20]:
>> The abandonment of extension constructs in favour of undefined
>> markup by this draft, and other draft-*-atompub-* drafts would
>> be an interoperability concern if these drafts were deployed. If
>> you want to extend Atom, use Extension Elements.
> 
> I don’t follow. Please explain how these drafts fail to satisfy
> the criteria in Section 6.4.2, Structured Extension Elements.
> 
> Regards,



Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell


David Powell wrote:
>[snip]
> The abandonment of extension constructs in favour of undefined markup
> by this draft, and other draft-*-atompub-* drafts would be an
> interoperability concern if these drafts were deployed. If you want to
> extend Atom, use Extension Elements.
> 

I'm most certainly not abandoning the extension constructs.  One of the
motivations for walking these extension specs through the I-D and
eventually standards-track process is so that they get their own RFC
number.  Implementations that choose to support the extension can point
to RFC4287 *and* RFCwhatever and say, "I support both".  If an
implementation only says "I support RFC4287" and doesn't say anything
about RFCwhatever, it's pretty clear what the result would be.

The most an RFC4287 implementation should be expected to do is adhere to
the defined extension model.  If that implementation also chooses to
support other RFC's that go beyond that extension model, so be it.

That said, the critical parts of the Feed Thread draft (the in-reply-to
element and the replies link rel) follow the guidelines of the Atom
extension model.  That is, any RFC4287 implementation *should* be able
to do something with those elements (even if it's just preserving them).
 The optional parts of the extension (thr:count an thr:when) fall
outside of the Atom extension model.  That's ok.  Implementations can
choose to ignore those things, even completely drop them.

As for the other extension drafts I put out, keep in mind that most
should be considered strictly experimental at this time.  That said,
there is really only one that really falls outside the extension model..
the Link Extensions draft [1]... which, by definition cannot adhere to
the extension model given the fact that Atom link elements are actually
not extensible.

[1]
http://www.ietf.org/internet-drafts/draft-snell-atompub-link-extensions-02.txt

- James