Hi!
I'm writing spider that uses random proxy (list of proxies) to retrieve
URLs.
Sometimes, even without proxy, response received is not HtmlResponse, but
Response, so body_as_unicode() can't be called and exception rised.
I know for sure that response must be plain HTML, so I wrote extension that
checks if response has correct type and if not re-schedules it:
class CheckNonHTMLResponse(object):
def __init__(self, stats):
self.stats = stats
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def process_response(self, request, response, spider):
if not isinstance(response, HtmlResponse) or not hasattr(response,
'body_as_unicode'):
self.stats.inc_value('rescheduled_non_html_response')
spider.log('Re-scheduling request of <%s> - non HTML response
(%r)' % (request.url, type(response)), level=log.ERROR)
request.dont_filter = True
return request
else:
spider.log('Got HTML response on <%s> (%r)' % (request.url,
type(response)))
return response
Question: do Scheduler filters such request done such way? Because logs
showed Filtered duplicate request ... with URL proccessed by extension, so
I added request.dont_filter = True.
Thanks
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.