Hi Sungmin, I have a question for your mention "*scrapy support duplicate filter, but only for the running status; all the urls you crawled are stored as a Set() if activated.* "
I use following command to pause / resume spider to crawl. scrapy crawl somespider -s JOBDIR=crawls/somespider-1 But after I resume the spider, I found duplicate filter does not work and the spider still crawls duplicate urls. I already set dont_filter as False in Request. Jack On Monday, November 24, 2014 at 11:28:30 PM UTC-8, Sungmin Lee wrote: > > Hey, > > 1. You don't have to re-crawl everything if you already have them. > scrapy support duplicate filter, but only for the running status; all the > urls you crawled are stored as a Set() if activated. > In your case, as you are going to crawl only once a day, I suggest you to > build your own duplicate filter. it's not hard. > 1) save the urls you already crawled as a file > 2) when you re run the crawler, say next day, you fetch the list of urls > as a Set() and filter the urls out before request > 3) update the url list > > 2. If you are asking whether it's possible to automatically get the newly > posted urls, it depends on the site structure. > You cannot know from outside what the new pages newly added are unless the > urls are chronological order (index, date, etc.) or there are some pointers > to new urls from main page. > What I can suggest is that > 1) Imagine you are the owner of the webpage: if you are the owner, you > probably want your subscribers to access the new pages easily. All the > links are likely posted on the main page. > 2) If it's really a blog? then why don't you take advantage of rss feed? > 3) You may want to utilize google to get the links. match and mix some > filters such as, keyword site:www.foo.com, search within specific period, > etc. > Utilizing google search result comes sometimes really handy. > > Good luck > > On Sunday, November 23, 2014 7:43:55 AM UTC-8, john smith wrote: >> >> Hy Nicolás, >> >> thanks for your answer. I think your solution wold work if the new posts >> don't have any links to old posts. But let's for example take a page like >> http://www.bbc.com/news/ . I like to crawl it daily and only want to >> download new articles. Is there a scrapy-function for that or must I >> program it on myself? >> >> Am Mittwoch, 19. November 2014 16:45:53 UTC+1 schrieb Nicolás Alejandro >> Ramírez Quiros: >>> >>> Its a blog, post have date and time. Crawl, save the date and time from >>> the last post and when you recrawl fetch till there. >>> >>> El miércoles, 19 de noviembre de 2014 12:31:58 UTC-2, john smith >>> escribió: >>>> >>>> Hi, >>>> >>>> I have a blog which I like to crawl every day. This works fine but I >>>> don't like to crawl/download everything again, just new pages. >>>> >>>> I thought about generating a list of downloaded urls and check each >>>> time in the Downloader Middleware if this URL was sooner downloaded. The >>>> problem is, that the list is huge and it takes some time to look up and >>>> this for every request. >>>> >>>> Any better ideas? Is there a good way or maybe a Scrapy-functionality I >>>> don't know about? >>>> >>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
