Neil -

To prevent over-polling and, as Roger pointed out, potentially getting
your IP blocked, consider Etag/If-None-Match headers as well as the
Last-Modified/If-Modified-Since headers:

1.  When you retrieve a feed, store the ETag and Last-Modified response headers
2.  When you next poll the feed, only retrieve those feeds that have
been updated

<cfhttp url="#variables.feedURL#"
        method="GET"
        useragent="feedsquirrel.com (or whatever)"
        throwonerror="yes"
>
        <cfhttpparam    type="header"
                                name="If-None-Match"
                                value="#variables.storedEtagValue#"
        />
        <cfhttpparam    type="header"
                                name="If-Modified-Since"
                                value="#variables.storedLastModifiedValue#"
        />
</cfhttp>

A nice way to reduce bandwidth consumption and be respectful of the
host server/feed author.  A couple of additional suggestions:

1.  Provide a user agent that allows a host server to know where the
request is coming from and, if the feel it necessary, block that
request.
2.  Respect the feed authors TTL value (in the case of an RSS 2.0
feed).  Don't update the feed any more often than requested in this
value (if there is one).
3.  Again, in the case of RSS 2.0 feeds, respect any skipDays and
skipHours values.  Don't poll on Sundays if the author has told you
that the feed won't be updated on Sundays.

I know there is a TTL equivalent in Atom 1.0/RSS 1.0, but honestly
can't remember what it is.  If you look at the specs, it should jump
out.  It's been a while since I wrote the feed aggregator that is
embedded in the product I build.  I don't recall there being a decent
equivalent for RSS 1.0 or Atom 1.0 for skipDays and skipHours.

On 4/19/06, Roger Benningfield <[EMAIL PROTECTED]> wrote:
> >Currently the site is aggregating ~500 RSS feeds, but checking these feeds
> >is growing to be a pain in the butt.  Having to get CF to check each of
> >these feeds regulary (ideally every 15 minutes) is more difficult than it
> >sounds.
>
> Neil: Polling every fifteen minutes is an enormous waste of CPU and 
> bandwidth... for both you and the source sites. For example, if you're 
> aggregating individual blogs, once every 24 hours will cover the vast 
> majority just fine. Ideally, you'd either opt for some middle ground (once an 
> hour or so), or come up with adaptive code that spaces out polling based upon 
> observed update periods.
>
> But even if you're gonna stick with over-polling (a good way to get your IP 
> blocked), there are places to optimize:
>
> * Use Conditional GET... since 90% of feeds won't have seen an update in the 
> last fifteen minutes, you've saved nearly 90% of your server's effort.
>
> * Make your spider compatible with RFC 3229. It won't help in most cases, but 
> some high-flow publishers (Microsoft, etc.) will send you deltas of their 
> sliding-window feeds. That'll cut down on parsing time.
>
> * Try CFX_HTTP5 in async mode.
>
> --
> Roger Benningfield
> http://admin.mxblogspace.journurl.com/
>
> 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:238126
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54

Reply via email to