Re: Recrawl a domain without downloading already downloaded pages

Sungmin Lee Mon, 24 Nov 2014 23:28:46 -0800

Hey,

1. You don't have to re-crawl everything if you already have them.
scrapy support duplicate filter, but only for the running status; all the 
urls you crawled are stored as a Set() if activated. 
In your case, as you are going to crawl only once a day, I suggest you to 
build your own duplicate filter. it's not hard.
  1) save the urls you already crawled as a file
  2) when you re run the crawler, say next day, you fetch the list of urls 
as a Set() and filter the urls out before request
  3) update the url list


2. If you are asking whether it's possible to automatically get the newly 
posted urls, it depends on the site structure.
You cannot know from outside what the new pages newly added are unless the 
urls are chronological order (index, date, etc.) or there are some pointers 
to new urls from main page.
What I can suggest is that
  1) Imagine you are the owner of the webpage: if you are the owner, you 
probably want your subscribers to access the new pages easily. All the 
links are likely posted on the main page.
  2) If it's really a blog? then why don't you take advantage of rss feed?
  3) You may want to utilize google to get the links. match and mix some 
filters such as, keyword site:www.foo.com, search within specific period, 
etc. 
      Utilizing google search result comes sometimes really handy.

Good luck

On Sunday, November 23, 2014 7:43:55 AM UTC-8, john smith wrote:
>
> Hy Nicolás,
>
> thanks for your answer. I think your solution wold work if the new posts 
> don't have any links to old posts. But let's for example take a page like 
> http://www.bbc.com/news/ . I like to crawl it daily and only want to 
> download new articles. Is there a scrapy-function for that or must I 
> program it on myself?
>
> Am Mittwoch, 19. November 2014 16:45:53 UTC+1 schrieb Nicolás Alejandro 
> Ramírez Quiros:
>>
>> Its a blog, post have date and time. Crawl, save the date and time from 
>> the last post and when you recrawl fetch till there.
>>
>> El miércoles, 19 de noviembre de 2014 12:31:58 UTC-2, john smith escribió:
>>>
>>> Hi,
>>>
>>> I have a blog which I like to crawl every day. This works fine but I 
>>> don't like to crawl/download everything again, just new pages. 
>>>
>>> I thought about generating a list of downloaded urls and check each time 
>>> in the Downloader Middleware if this URL was sooner downloaded. The problem 
>>> is, that the list is huge and it takes some time to look up and this for 
>>> every request.
>>>
>>> Any better ideas? Is there a good way or maybe a Scrapy-functionality I 
>>> don't know about?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Recrawl a domain without downloading already downloaded pages

Reply via email to