Re: 302 redirect triggers IgnoreRequest

Michele Coscia Tue, 18 Nov 2014 11:35:27 -0800

My code already contains that argument. For some reason, my original 
message was cut. Here's the rest of it:



However, in my crawler this does not happen. The spider does not enter in 
the parse method. I overwrote the start_request method as

    def start_requests(self):
        for url in self.start_urls:
           yield Request(url, dont_filter = True, callback = self.parse, 
errback = self.handle_errors)

and handle_errors gets called where I see that scrapy raised a 
IgnoreRequest.

My spider is a simple extension of scrapy.Spider. It is called as advised 
in 
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
.

What I think is going on, is that:
- The spider gets the 302 to http://mhs.mt.gov/
- Puts http://mhs.mt.gov/ into the queue
- Puts http://mhs.mt.gov/ into the visited pages
- Asks for the next page
- Sees http://mhs.mt.gov/
- Raises the IgnoreRequests because it already has seen the URL
- Ends

However this does not happen in the shell.

How can I plug the shell behavior into my spider?
Can I tell the crawler to revisit the page once (but not more, otherwise 
I'll be stuck forever)?

I searched on the web, but people are more interested in avoiding following 
a 302 than actually following it.


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: 302 redirect triggers IgnoreRequest

Reply via email to