I am trying to write a customised ROBOTSTXTMIDDLEWARE, the following code 
snippets are from there. I need to get robots through a nginx proxy first. 
My problem is with Errbback callback. It seems to be triggering only for 
some error like 4xx etc. But it is not even called for errors like DNS , 
connection refused, internet failure etc. The funny thing is the same 
perfectly works when I use it in a spider and it captures all kinds of 
errors. If I see the stats middleware for the scenario when my proxy is 
disabled I see a connection refused stat but somehow this is taken by the 
Errback callback. I am using scrapy v0.24. Am I missing something?

"downloader/exception_count": 1, 

"downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError":
 
1, 


def _download_robots(self, robots_url, req_netloc, spider):
        print "requested download: %s" % robots_url
        robots_req = Request(
            robots_url,
            priority=self.DOWNLOAD_PRIORITY,
            meta={'bypass_robots': True, 'req_netloc': req_netloc}
        )
        dfd = self.crawler.engine.download(robots_req, spider)
        dfd.addCallback(self._download_success)
        dfd.addErrback(self._download_error, robots_req, spider) 

 

def _download_error(self, failure, request, spider):

                   # Not called for non http errors :( 

        netloc = request.meta.get('req_netloc')
        url = urlparse_cached(request)
        print 'download error ' + url

        # Check if we have failed for nginx, we try directly
        if self._downloading_robots[netloc][0] == CacheLocation.nginx:
            self._downloading_robots[netloc] = CacheLocation.nginx, 
DownloadingStatus.downloaded
            print 'download error nginx'
        else:
            print 'download error direct'
            # We have failed directly too, check response codes and act 
accordingly
            if isinstance(failure.value, HttpError):
                http_status = failure.value.response.status
                status = http_status if http_status in (401, 403, 404) else 
404
                print 'status http ' + status
            else:
                # Rest of failures, allow fetching ;)
                status = 404
                print 'status failure ' + status

            # Make a reppy rule and add
            rules = Rules(netloc, status, '', time() + self._cache_lifespan)
            self._robots_cache.add(rules)

            # Remove from downloading robots
            self._downloading_robots.pop(netloc, None)

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to