My code already contains that argument. For some reason, my original
message was cut. Here's the rest of it:
However, in my crawler this does not happen. The spider does not enter in
the parse method. I overwrote the start_request method as
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter = True, callback = self.parse,
errback = self.handle_errors)
and handle_errors gets called where I see that scrapy raised a
IgnoreRequest.
My spider is a simple extension of scrapy.Spider. It is called as advised
in
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
.
What I think is going on, is that:
- The spider gets the 302 to http://mhs.mt.gov/
- Puts http://mhs.mt.gov/ into the queue
- Puts http://mhs.mt.gov/ into the visited pages
- Asks for the next page
- Sees http://mhs.mt.gov/
- Raises the IgnoreRequests because it already has seen the URL
- Ends
However this does not happen in the shell.
How can I plug the shell behavior into my spider?
Can I tell the crawler to revisit the page once (but not more, otherwise
I'll be stuck forever)?
I searched on the web, but people are more interested in avoiding following
a 302 than actually following it.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.