Its a blog, post have date and time. Crawl, save the date and time from the last post and when you recrawl fetch till there.
El miércoles, 19 de noviembre de 2014 12:31:58 UTC-2, john smith escribió: > > Hi, > > I have a blog which I like to crawl every day. This works fine but I don't > like to crawl/download everything again, just new pages. > > I thought about generating a list of downloaded urls and check each time > in the Downloader Middleware if this URL was sooner downloaded. The problem > is, that the list is huge and it takes some time to look up and this for > every request. > > Any better ideas? Is there a good way or maybe a Scrapy-functionality I > don't know about? > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
