The subject of “Fat Pings”
or full content streaming from blogs has come up on the FeedMesh list and in a proposal by Brad
Fitzpatrick of LiveJournal. I’ve responded to the FeedMesh list
suggesting that the best way to move forward is to simply use Atom feeds rather
than invent new formats. See my response at: http://groups.yahoo.com/group/feedmesh/message/451 The problem being addressed here is
that of increasing the efficiency with which feed search and/or monitoring
services (like PubSub, Feedster, IceRocket, Technorati, BlogDigger, etc.) obtain
posts from the major blog hosting platforms. In the past, most services have
limited what they do to simply sending pings. However, while the pinging
mechanism works fine in the case of low volume publishers, it simply doesn’t
scale to the requirements of high volume publishers like the major blog hosting
platforms (LiveJournal, TypePad, Blogspot, Bryght, etc.). The problem is that when
a service is pinged, it must reach back to the pinging site and retrieve an RSS
or Atom file that probably contains many duplicate entries. The service must
then filter out the dupes before indexing, publishing, or matching the “new”
or “changed” items discovered. LiveJournal has lead for some time
in showing a more efficient and effective way for search engines to obtain new
and changed postings. What they do is produce an aggregate feed that contains copies
of all entries written on any of their public blogs. This feed typically
contains as many as 200-300 new entries per minute. But, while that might sound
like a great many entries to process each minute, it is massively less than the
number of entries that would need to be processed if LiveJournal were to rely
on a simple pinging process. The reason is that search engines can focus their
ingestion processors only on the aggregate feed and thus never need to deal
with the wasted bandwidth and processing that comes from duplicate entries. Of
course, LiveJournal benefits as well since the bandwidth and processing cost of
serving external search and/or monitoring systems is drastically reduced. At
PubSub, because of the way that LiveJournal publishes updates, we find that the
cost of processing LiveJournal updates is very much lower than the cost to
process entries from other blog hosting services that use traditional content-free
ping formats. But, as with most RSS based systems,
the LiveJournal system has been based on a polling model. Given the speed with
which the feed updates, services like PubSub have been forced to read the LiveJournal
feed at least once a minute if not more frequently. Given that the entire
(massive) feed must be downloaded very frequently and given that LiveJournal
does not currently support RFC3229+feed,
there are inevitably duplicates that appear in the feed. Also, the polling
services never have any idea what the publishing rate is in the feed and thus
can’t slack off the frequency with which they poll during “slow”
periods. The result is that as the rate at which LiveJournal’s users begins
to slacken during “slow” periods, the percentage of duplicate entries
increases. Clearly, the solution to the problem is to move to a push feed. In
this case, LiveJournal would push the data updates to services that were
interested in them rather than forcing those services to poll LiveJournal. Brad has proposed a somewhat bent
and extended version of Atom which would be streamed over a TCP/IP connection in
much the same way that we currently stream FeedMesh data. (I’ve included
a snapshot of his proposed format below.) He defines an “AtomStream”
and suggests that individual posts from the various LiveJournal hosted blogs would
be included in the stream as a sequence of single-entry feeds. This is a
solution that would work… However, I suggest that there is actually no
need to do anything other than “vanilla flavored” Atom in order to
address the needs here. A stream which began with an atom:feed element and continued
with a series of atom:entry elements that contained atom:source elements would be
a much more natural solution than the “stream of feeds” that Brad
proposes. (A sample of what I think a “proper Atom” format for Brad’s
sample appears below.) The problem being addressed by “Fat
Pings” is very much like the one addressed by the “Atom
over XMPP” protocol and is very much like the service that we provide
at PubSub.com. I believe it will be an important
test of Atom to determine if it is adequate to handle this sort of problem. I would
greatly appreciate comments from others on this use of Atom. It should be noted that “Fat
Pings” are probably only properly generated by large, trusted blog hosting
platforms. One of the essential elements of controlling spam in feeds is the
ability to trace back to an actual network resource which can be used to verify
the data in a “ping” and can be used, to some extent, to identify
the publisher of the data. For a service like PubSub to forgo actual verification
that an entry exists as claimed by a ping, we would have to be able to trust the
pinger. Normally, creating such trust relationships is very expensive. However,
given that the vast majority of posts are made on the large services, we can
drastically increase the efficiency of the overall system by having just a few
of these hosters/publishers who are permitted the privilege of publishing Fat
Pings. It is my hope that in the future we’ll be able to rely on Atom’s
support for Digital Signatures to expand drastically the number of publishers
who could be trusted to publish Fat Pings. Brad Proposes: <?xml version='1.0' encoding='utf-8' ?> I believe that the sample feed above would be better
represented as a “simple” Atom feed which contains entries having
source elements. Note: My sample is a bit bigger than Brad’s since I’ve
included various bits that are required in Atom but that Brad’s proposal
omits. He readily admits in his postings that he has not yet gone to the effort
of ensuring that he is issuing compliant data. I propose the following as an equivelant to Brad’s
sample: <?xml
version="1.0" encoding="utf-8"?> <feed
xmlns="http://www.w3.org/2005/Atom">
<title>LiveJournal Aggregate Feed</title> <link href=""> <updated>2005-08-21T16:30:02Z </updated>
<author><name>Brad</name></author> <id>tag:livejournal.org,2005:aggregatefeed-1</id> <entry xmlns='http://www.w3.org/2005/Atom'> <source> <title type=text>'Example Feed'</title> <link
href=''/> <link rel='self'
type='application/atom+xml'
href=''/>
<id>tag:livejournal.org,2005:feed-username</id>
<updated>2005-08-21T16:30:02Z</updated> <author><name>John
Doe</name></author> </source> <title> some entry title </title> <link rel='alternate' type='text/html'
href=''/>
<id>tag:livejournal.org,2003:entry-username-32397</id> <published>2005-08-21T16:30:02Z </published> <updated>2005-08-21T16:30:02Z
</updated> <content type="html"> This is some
<b>content</b>. </content> </entry> . . . </feed> What do you think? Is there any conceptual
problem with streaming basic Atom over TCP/IP, HTTP continuous sessions (probably
using chunked content) etc.? Is there any really good reason not just to use
Atom as defined? bob wyman |
- If you want "Fat Pings" just use Atom! Bob Wyman
- Re: If you want "Fat Pings" just use Atom! A. Pagaltzis
- RE: If you want "Fat Pings" just use A... Bob Wyman
- Re: If you want "Fat Pings" just u... 'A. Pagaltzis'
- RE: If you want "Fat Pings" ju... Bob Wyman
- Re: If you want "Fat Pings&quo... 'A. Pagaltzis'
- Re: If you want "Fat Pings&quo... James M Snell
- Re: If you want "Fat Pings" just u... Bill de hÓra
- Re: If you want "Fat Pings" just use A... Sam Ruby
- Re: If you want "Fat Pings" just u... A. Pagaltzis
- Re: If you want "Fat Pings" just u... Walter Underwood