Re: Recrawl a domain without downloading already downloaded pages

Nicolás Alejandro Ramírez Quiros Wed, 19 Nov 2014 07:47:20 -0800

Its a blog, post have date and time. Crawl, save the date and time from the 
last post and when you recrawl fetch till there.


El miércoles, 19 de noviembre de 2014 12:31:58 UTC-2, john smith escribió:
>
> Hi,
>
> I have a blog which I like to crawl every day. This works fine but I don't 
> like to crawl/download everything again, just new pages. 
>
> I thought about generating a list of downloaded urls and check each time 
> in the Downloader Middleware if this URL was sooner downloaded. The problem 
> is, that the list is huge and it takes some time to look up and this for 
> every request.
>
> Any better ideas? Is there a good way or maybe a Scrapy-functionality I 
> don't know about?
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Recrawl a domain without downloading already downloaded pages

Reply via email to