If the dont_filter argument in Request is True or the spider's allowed_domains is empty, OffsiteMiddleware does nothing
On 15-12-15 上午2:52, somewhatofftheway wrote: I'm trying to implement a spider which will: a. Pull URLs from a queue of some sort b. Only crawl those sites It's essentially a broad crawl in that it is designed to look at any site, but I want to be able to limit the sites rather than letting it crawl the whole web. I had experimented with a RabbitMQ based solution, but have recently been trying scrapy-redis. This seems to generally work very well. However, it attempts to crawl sites other than those specified, as self.allowed_domain does not get set and therefore the OffsiteMiddleware does not trigger. I implemented a workaround for this; I wanted to present it both it case it is useful and to see if anybody has found better solutions to this problem. What I did was 1. Modify the parse_start_url function to add the domain in question 2. Use a filter_links callback to only allow links from that domain I guess I could also override make_requests_from_url or similar to achieve the same thing? In any case, any comments on this approach welcomed or suggestions on how to achieve the above. Thanks, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
