Recrawl a domain without downloading already downloaded pages

john smith Wed, 19 Nov 2014 06:49:07 -0800

Hi,

I have a blog which I like to crawl every day. This works fine but I don't 
like to crawl/download everything again, just new pages.


I thought about generating a list of downloaded urls and check each time in 
the Downloader Middleware if this URL was sooner downloaded. The problem 
is, that the list is huge and it takes some time to look up and this for 
every request.

Any better ideas? Is there a good way or maybe a Scrapy-functionality I 
don't know about?

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Recrawl a domain without downloading already downloaded pages

Reply via email to