That's interesting, thank you. It does seem though that that would mean keeping hundreds of thousands of URLs in Redis and checking them could be expensive each time.
Would it be feasible to implement a middleware that only accepts a link if it comes from the same site? That is, does Middleware have access to the referrer / original request? On Friday, December 18, 2015 at 5:03:43 AM UTC, lnxpgn wrote: > > Could implement a spider middleware which fetches a domain white list from > Redis during initialization, the domain white list is changed later, using > Redis Pub/Sub to update it. if a URL's domain isn't in the white list, > discard the Request, don't matter where it comes from. > > 2015-12-17 22:03 GMT+08:00 somewhatofftheway <[email protected] > <javascript:>>: > >> Exactly, that's what I'm trying to work around. My solution does work, I >> was just interested in whether anybody had tried other approaches. >> >> On Thursday, December 17, 2015 at 2:59:58 AM UTC, lnxpgn wrote: >>> >>> >>> If the dont_filter argument in Request is True or the spider's >>> allowed_domains is empty, OffsiteMiddleware does nothing >>> >>> On 15-12-15 上午2:52, somewhatofftheway wrote: >>> >>> I'm trying to implement a spider which will: >>> >>> a. Pull URLs from a queue of some sort >>> b. Only crawl those sites >>> >>> It's essentially a broad crawl in that it is designed to look at any >>> site, but I want to be able to limit the sites rather than letting it crawl >>> the whole web. >>> >>> I had experimented with a RabbitMQ based solution, but have recently >>> been trying scrapy-redis. This seems to generally work very well. However, >>> it attempts to crawl sites other than those specified, as >>> self.allowed_domain does not get set and therefore the OffsiteMiddleware >>> does not trigger. >>> >>> I implemented a workaround for this; I wanted to present it both it case >>> it is useful and to see if anybody has found better solutions to this >>> problem. >>> >>> What I did was >>> >>> 1. Modify the parse_start_url function to add the domain in question >>> 2. Use a filter_links callback to only allow links from that domain >>> >>> I guess I could also override make_requests_from_url or similar to >>> achieve the same thing? >>> >>> In any case, any comments on this approach welcomed or suggestions on >>> how to achieve the above. >>> >>> Thanks, >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
