RE: Don't Aggregrate Me

Bob Wyman Fri, 26 Aug 2005 14:57:41 -0700

Roger Benningfield wrote:
> We've got a mechanism that allows any user with his own domain
> and a text editor to tell us whether or not he wants us messing with
> his stuff. I think it's foolish to ignore that.
        The problem is that we have *many* such mechanisms. Robots.txt is
only one. Others have been mentioned on this list in the past. Others are
buried in obscure posts that you really have to dig to find. How do we
decide which mechanisms to use? Also, since I don't think robots.txt was
intended to be used for services like the aggregators we're discussing, I
believe that for us to encourage people to use it in the way you suggest
would be an abuse of the robots.txt system.


> Bob: What about FeedMesh? If I ping blo.gs, they pass that ping
> along to you, and PubSub fetches my feed, then PubSub is doing
> something a desktop client doesn't do.
        Wrong. Some desktop clients *do* work like FeedMesh. Consider the
Shrook distributed checking system[1]. FeedMesh and PubSub work very much
like Shrook's desktop clients do. In the Shrook system, all the desktop
clients report back updates that they have found to a central service that
then distributes the update info to other clients. The result is that the
amount of polling that goes on is drastically reduced and the freshness of
data is increased since every client benefits from the polling of all other
clients. Although no single client might poll a site more frequently than
once an hour, if you have 60 Shrook clients each polling once an hour, each
client is getting the effect of polling every minute... The Shrook model is
basically the same as the FeedMesh model except that in FeedMesh you
typically ask for info on ALL sites whereas in Shrook, you typically only
get updates for a smaller, enumerated set of feeds. However, the number of
feeds you monitor does not change the basic nature of the distributed
checking system. Shrook and FeedMesh are, as far as I'm concerned, largely
indistinguishable in this area. (There are some detail differences of
course. For instance, Shrook worries about client privacy issues that aren't
relevant in the FeedMesh case.)

        Remember, PubSub only deals with data from Pings and from sites that
have been manually added to our system. We don't do any web scraping and we
don't follow links to find other blogs. Also, we filter out of our system
feeds that originate with services that are known to scrape web pages and
inject data that was not intended by the original publisher to appear in
feeds. (Often, people try to get around partial feeds by "filling in the
missing bits by scraping from blog's websites.) Thus, we filter out any feed
that comes from a service like Technorati since they scrape blogs and inject
scraped content into feeds without the explicit approval or consent of the
publishers of the sites they scraped. 

                bob wyman

[1] http://www.fondantfancies.com/apps/shrook/distfaq.php

RE: Don't Aggregrate Me

Reply via email to