Re: Recrawl a domain without downloading already downloaded pages

jbeverley . cmu Tue, 27 Jan 2015 18:32:50 -0800

Hi Sungmin,

I have a question for your mention 
"*scrapy support duplicate filter, but only for the running status; all the 
urls you crawled are stored as a Set() if activated.* "


I use following command to pause / resume spider to crawl.

scrapy crawl somespider -s JOBDIR=crawls/somespider-1


But after I resume the spider, I found duplicate filter does not work and 
the spider still 
crawls duplicate urls. I already set dont_filter as False in Request.



Jack


On Monday, November 24, 2014 at 11:28:30 PM UTC-8, Sungmin Lee wrote:
>
> Hey,
>
> 1. You don't have to re-crawl everything if you already have them.
> scrapy support duplicate filter, but only for the running status; all the 
> urls you crawled are stored as a Set() if activated. 
> In your case, as you are going to crawl only once a day, I suggest you to 
> build your own duplicate filter. it's not hard.
>   1) save the urls you already crawled as a file
>   2) when you re run the crawler, say next day, you fetch the list of urls 
> as a Set() and filter the urls out before request
>   3) update the url list
>
> 2. If you are asking whether it's possible to automatically get the newly 
> posted urls, it depends on the site structure.
> You cannot know from outside what the new pages newly added are unless the 
> urls are chronological order (index, date, etc.) or there are some pointers 
> to new urls from main page.
> What I can suggest is that
>   1) Imagine you are the owner of the webpage: if you are the owner, you 
> probably want your subscribers to access the new pages easily. All the 
> links are likely posted on the main page.
>   2) If it's really a blog? then why don't you take advantage of rss feed?
>   3) You may want to utilize google to get the links. match and mix some 
> filters such as, keyword site:www.foo.com, search within specific period, 
> etc. 
>       Utilizing google search result comes sometimes really handy.
>
> Good luck
>
> On Sunday, November 23, 2014 7:43:55 AM UTC-8, john smith wrote:
>>
>> Hy Nicolás,
>>
>> thanks for your answer. I think your solution wold work if the new posts 
>> don't have any links to old posts. But let's for example take a page like 
>> http://www.bbc.com/news/ . I like to crawl it daily and only want to 
>> download new articles. Is there a scrapy-function for that or must I 
>> program it on myself?
>>
>> Am Mittwoch, 19. November 2014 16:45:53 UTC+1 schrieb Nicolás Alejandro 
>> Ramírez Quiros:
>>>
>>> Its a blog, post have date and time. Crawl, save the date and time from 
>>> the last post and when you recrawl fetch till there.
>>>
>>> El miércoles, 19 de noviembre de 2014 12:31:58 UTC-2, john smith 
>>> escribió:
>>>>
>>>> Hi,
>>>>
>>>> I have a blog which I like to crawl every day. This works fine but I 
>>>> don't like to crawl/download everything again, just new pages. 
>>>>
>>>> I thought about generating a list of downloaded urls and check each 
>>>> time in the Downloader Middleware if this URL was sooner downloaded. The 
>>>> problem is, that the list is huge and it takes some time to look up and 
>>>> this for every request.
>>>>
>>>> Any better ideas? Is there a good way or maybe a Scrapy-functionality I 
>>>> don't know about?
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Recrawl a domain without downloading already downloaded pages

Reply via email to