The subject of “Fat Pings” or full content streaming from blogs has come up on the FeedMesh list and in a proposal by Brad Fitzpatrick of LiveJournal. I’ve responded to the FeedMesh list suggesting that the best way to move forward is to simply use Atom feeds rather than invent new formats. See my response at: http://groups.yahoo.com/group/feedmesh/message/451

 

The problem being addressed here is that of increasing the efficiency with which feed search and/or monitoring services (like PubSub, Feedster, IceRocket, Technorati, BlogDigger, etc.) obtain posts from the major blog hosting platforms. In the past, most services have limited what they do to simply sending pings. However, while the pinging mechanism works fine in the case of low volume publishers, it simply doesn’t scale to the requirements of high volume publishers like the major blog hosting platforms (LiveJournal, TypePad, Blogspot, Bryght, etc.). The problem is that when a service is pinged, it must reach back to the pinging site and retrieve an RSS or Atom file that probably contains many duplicate entries. The service must then filter out the dupes before indexing, publishing, or matching the “new” or “changed” items discovered.

LiveJournal has lead for some time in showing a more efficient and effective way for search engines to obtain new and changed postings. What they do is produce an aggregate feed that contains copies of all entries written on any of their public blogs. This feed typically contains as many as 200-300 new entries per minute. But, while that might sound like a great many entries to process each minute, it is massively less than the number of entries that would need to be processed if LiveJournal were to rely on a simple pinging process. The reason is that search engines can focus their ingestion processors only on the aggregate feed and thus never need to deal with the wasted bandwidth and processing that comes from duplicate entries. Of course, LiveJournal benefits as well since the bandwidth and processing cost of serving external search and/or monitoring systems is drastically reduced. At PubSub, because of the way that LiveJournal publishes updates, we find that the cost of processing LiveJournal updates is very much lower than the cost to process entries from other blog hosting services that use traditional content-free ping formats.

But, as with most RSS based systems, the LiveJournal system has been based on a polling model. Given the speed with which the feed updates, services like PubSub have been forced to read the LiveJournal feed at least once a minute if not more frequently. Given that the entire (massive) feed must be downloaded very frequently and given that LiveJournal does not currently support RFC3229+feed, there are inevitably duplicates that appear in the feed. Also, the polling services never have any idea what the publishing rate is in the feed and thus can’t slack off the frequency with which they poll during “slow” periods. The result is that as the rate at which LiveJournal’s users begins to slacken during “slow” periods, the percentage of duplicate entries increases. Clearly, the solution to the problem is to move to a push feed. In this case, LiveJournal would push the data updates to services that were interested in them rather than forcing  those services to poll LiveJournal.

Brad has proposed a somewhat bent and extended version of Atom which would be streamed over a TCP/IP connection in much the same way that we currently stream FeedMesh data. (I’ve included a snapshot of his proposed format below.) He defines an “AtomStream” and suggests that individual posts from the various LiveJournal hosted blogs would be included in the stream as a sequence of single-entry feeds. This is a solution that would work… However, I suggest that there is actually no need to do anything other than “vanilla flavored” Atom in order to address the needs here. A stream which began with an atom:feed element and continued with a series of atom:entry elements that contained atom:source elements would be a much more natural solution than the “stream of feeds” that Brad proposes. (A sample of what I think a “proper Atom” format for Brad’s sample appears below.)

The problem being addressed by “Fat Pings” is very much like the one addressed by the “Atom over XMPP” protocol and is very much like the service that we provide at PubSub.com.  I believe it will be an important test of Atom to determine if it is adequate to handle this sort of problem. I would greatly appreciate comments from others on this use of Atom.

It should be noted that “Fat Pings” are probably only properly generated by large, trusted blog hosting platforms. One of the essential elements of controlling spam in feeds is the ability to trace back to an actual network resource which can be used to verify the data in a “ping” and can be used, to some extent, to identify the publisher of the data. For a service like PubSub to forgo actual verification that an entry exists as claimed by a ping, we would have to be able to trust the pinger. Normally, creating such trust relationships is very expensive. However, given that the vast majority of posts are made on the large services, we can drastically increase the efficiency of the overall system by having just a few of these hosters/publishers who are permitted the privilege of publishing Fat Pings. It is my hope that in the future we’ll be able to rely on Atom’s support for Digital Signatures to expand drastically the number of publishers who could be trusted to publish Fat Pings.

 

Brad Proposes:

 

<?xml version='1.0' encoding='utf-8' ?>
<atomStream>
<time>1124247941</time>
<feed xmlns='http://www.w3.org/2005/Atom'>
<title type='text'>some journal title</title>
<link href=''" title="http://www.livejournal.com/users/username/'">http://www.livejournal.com/users/username/' />
<author><name>some name</name></author>
<entry>
<title>some entry title</title>
<link href=''" title="http://www.livejournal.com/users/username/12345.html'">http://www.livejournal.com/users/username/12345.html' />
<content type='html'>
content
</content>
</entry>
</feed>

 

I believe that the sample feed above would be better represented as a “simple” Atom feed which contains entries having source elements. Note: My sample is a bit bigger than Brad’s since I’ve included various bits that are required in Atom but that Brad’s proposal omits. He readily admits in his postings that he has not yet gone to the effort of ensuring that he is issuing compliant data.

 

I propose the following as an equivelant to Brad’s sample:

 

<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom">

   <title>LiveJournal Aggregate Feed</title>

   <link href="">

   <updated>2005-08-21T16:30:02Z </updated>

   <author><name>Brad</name></author>

   <id>tag:livejournal.org,2005:aggregatefeed-1</id>

 

<entry xmlns='http://www.w3.org/2005/Atom'>

   <source>

     <title type=text>'Example Feed'</title>

     <link href=''/>

     <link rel='self' type='application/atom+xml'

         href=''/>

     <id>tag:livejournal.org,2005:feed-username</id>

     <updated>2005-08-21T16:30:02Z</updated>

     <author><name>John Doe</name></author>

   </source>

   <title> some entry title </title>

   <link rel='alternate' type='text/html'

        href=''/>

   <id>tag:livejournal.org,2003:entry-username-32397</id>

   <published>2005-08-21T16:30:02Z </published>

   <updated>2005-08-21T16:30:02Z </updated>

   <content type="html">

         This is some <b>content</b>.

   </content>

</entry>

. . .

 

</feed>

 

            What do you think? Is there any conceptual problem with streaming basic Atom over TCP/IP, HTTP continuous sessions (probably using chunked content) etc.? Is there any really good reason not just to use Atom as defined?

 

            bob wyman

 

Reply via email to