Hi,

Last night, I was trying to use RFPDupeFilter to discard duplicate urls.

I implemented a class inheriting RFPDupeFilter and overrode request_seen() 
method.
After linking the custom class to settings.py, I tested the code, but the 
crawler still scraped all duplicate urls.

After some investigation, I realized that overridden request_seen() method 
is never called, and this is happening because dont_filter variable is set 
to True.

which is weird. according to Scrapy documentation, it is supposed to be set 
to False:

   - *dont_filter* (*boolean*) – indicates that this request should not be 
   filtered by the scheduler. This is used when you want to perform an 
   identical request multiple times, to ignore the duplicates filter. Use it 
   with care, or you will get into crawling loops. Default to False.

Just to test, I ended up changing a bit of scrapy code 
at https://github.com/scrapy/scrapy/blob/master/scrapy/core/scheduler.py#L48
from 
    if not request.dont_filter and self.df.request_seen(request):
to
    if self.df.request_seen(request):

, and finally the dupefilter started to work.


Why is this happening? Why is dont_filter value set to True by default?

Is there any neater solution rather than changing original Scrapy library?

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to