Hi, Last night, I was trying to use RFPDupeFilter to discard duplicate urls.
I implemented a class inheriting RFPDupeFilter and overrode request_seen() method. After linking the custom class to settings.py, I tested the code, but the crawler still scraped all duplicate urls. After some investigation, I realized that overridden request_seen() method is never called, and this is happening because dont_filter variable is set to True. which is weird. according to Scrapy documentation, it is supposed to be set to False: - *dont_filter* (*boolean*) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False. Just to test, I ended up changing a bit of scrapy code at https://github.com/scrapy/scrapy/blob/master/scrapy/core/scheduler.py#L48 from if not request.dont_filter and self.df.request_seen(request): to if self.df.request_seen(request): , and finally the dupefilter started to work. Why is this happening? Why is dont_filter value set to True by default? Is there any neater solution rather than changing original Scrapy library? -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
