Found the problem. I have a custom middleware for checking the MIME type. 302 wasn't matching so the request was discarded. I added in my middleware a check: if response status is 302 then do nothing. Now it works.
Thanks! Michele C Il giorno martedì 18 novembre 2014 14:34:16 UTC-5, Michele Coscia ha scritto: > > My code already contains that argument. For some reason, my original > message was cut. Here's the rest of it: > > > However, in my crawler this does not happen. The spider does not enter in > the parse method. I overwrote the start_request method as > > def start_requests(self): > for url in self.start_urls: > yield Request(url, dont_filter = True, callback = self.parse, > errback = self.handle_errors) > > and handle_errors gets called where I see that scrapy raised a > IgnoreRequest. > > My spider is a simple extension of scrapy.Spider. It is called as advised > in > http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script > . > > What I think is going on, is that: > - The spider gets the 302 to http://mhs.mt.gov/ > - Puts http://mhs.mt.gov/ into the queue > - Puts http://mhs.mt.gov/ into the visited pages > - Asks for the next page > - Sees http://mhs.mt.gov/ > - Raises the IgnoreRequests because it already has seen the URL > - Ends > > However this does not happen in the shell. > > How can I plug the shell behavior into my spider? > Can I tell the crawler to revisit the page once (but not more, otherwise > I'll be stuck forever)? > > I searched on the web, but people are more interested in avoiding > following a 302 than actually following it. > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
