If the dont_filter argument in Request is True or the spider's
allowed_domains is empty,  OffsiteMiddleware does nothing

On 15-12-15 上午2:52, somewhatofftheway wrote:

I'm trying to implement a spider which will:

a. Pull URLs from a queue of some sort
b. Only crawl those sites

It's essentially a broad crawl in that it is designed to look at any site,
but I want to be able to limit the sites rather than letting it crawl the
whole web.

I had experimented with a RabbitMQ based solution, but have recently been
trying scrapy-redis. This seems to generally work very well. However, it
attempts to crawl sites other than those specified, as self.allowed_domain
does not get set and therefore the OffsiteMiddleware does not trigger.

I implemented a workaround for this; I wanted to present it both it case it
is useful and to see if anybody has found better solutions to this problem.

What I did was

1. Modify the parse_start_url function to add the domain in question
2. Use a filter_links callback to only allow links from that domain

I guess I could also override make_requests_from_url or similar to achieve
the same thing?

In any case, any comments on this approach welcomed or suggestions on how
to achieve the above.

Thanks,

-- 
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to