Hi!

I'm writing spider that uses random proxy (list of proxies) to retrieve 
URLs. 
Sometimes, even without proxy, response received is not HtmlResponse, but 
Response, so body_as_unicode() can't be called and exception rised.
I know for sure that response must be plain HTML, so I wrote extension that 
checks if response has correct type and if not re-schedules it:

class CheckNonHTMLResponse(object):

    def __init__(self, stats):
        self.stats = stats

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def process_response(self, request, response, spider):
        if not isinstance(response, HtmlResponse) or not hasattr(response, 
'body_as_unicode'):
            self.stats.inc_value('rescheduled_non_html_response')
            spider.log('Re-scheduling request of <%s> - non HTML response 
(%r)' % (request.url, type(response)), level=log.ERROR)
            request.dont_filter = True
            return request
        else:
            spider.log('Got HTML response on <%s> (%r)' % (request.url, 
type(response)))
        return response

 Question: do Scheduler filters such request done such way? Because logs 
showed Filtered duplicate request ... with URL proccessed by extension, so 
I added request.dont_filter = True.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to