I am trying to write a customised ROBOTSTXTMIDDLEWARE, the following code
snippets are from there. I need to get robots through a nginx proxy first.
My problem is with Errbback callback. It seems to be triggering only for
some error like 4xx etc. But it is not even called for errors like DNS ,
connection refused, internet failure etc. The funny thing is the same
perfectly works when I use it in a spider and it captures all kinds of
errors. If I see the stats middleware for the scenario when my proxy is
disabled I see a connection refused stat but somehow this is taken by the
Errback callback. I am using scrapy v0.24. Am I missing something?
"downloader/exception_count": 1,
"downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError":
1,
def _download_robots(self, robots_url, req_netloc, spider):
print "requested download: %s" % robots_url
robots_req = Request(
robots_url,
priority=self.DOWNLOAD_PRIORITY,
meta={'bypass_robots': True, 'req_netloc': req_netloc}
)
dfd = self.crawler.engine.download(robots_req, spider)
dfd.addCallback(self._download_success)
dfd.addErrback(self._download_error, robots_req, spider)
def _download_error(self, failure, request, spider):
# Not called for non http errors :(
netloc = request.meta.get('req_netloc')
url = urlparse_cached(request)
print 'download error ' + url
# Check if we have failed for nginx, we try directly
if self._downloading_robots[netloc][0] == CacheLocation.nginx:
self._downloading_robots[netloc] = CacheLocation.nginx,
DownloadingStatus.downloaded
print 'download error nginx'
else:
print 'download error direct'
# We have failed directly too, check response codes and act
accordingly
if isinstance(failure.value, HttpError):
http_status = failure.value.response.status
status = http_status if http_status in (401, 403, 404) else
404
print 'status http ' + status
else:
# Rest of failures, allow fetching ;)
status = 404
print 'status failure ' + status
# Make a reppy rule and add
rules = Rules(netloc, status, '', time() + self._cache_lifespan)
self._robots_cache.add(rules)
# Remove from downloading robots
self._downloading_robots.pop(netloc, None)
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.