I'm having an issue with the XMLFeedSpider that I've been trying to wrap my head around for a week. It's got to be something stupid that I'm doing. The https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml is shows an encoding of 8859-1. I did not include, but I verified that the response.body was indeed the full xml file body. I did a quick print on that in a def adapt_response(self,response): function. I threw some printouts in the python/utils/iterators.py and verified that nodetext=<cvrfdoc><Vulnerability> .........</Vulnerability></cvrfdoc>. I also verified that nodeName was Vulnerability. However, it bombs on the Selector stating "exceptions.IndexError: list index out of range"
Obviously, then never gets to the parse_node function, but bombs during the itertag processing. Any help would be greatly appreciated b/c I've been floundering. This hasn't been the only xml file I've come across that didn't parse well. With that stated, I can open the xml file in firefox, select all, copy to notepad, and save as xml file. It won't render in a browser, but scrapy has no problems steamrolling through and parsing it. What a puzzle? Environment: Python 2.7 with Scrapy .22 Spider: from scrapy.contrib.spiders import XMLFeedSpider from scrapy.http import XmlResponse from vulnerability.items import VulnerabilityItem class CveSpider(XMLFeedSpider): name = 'cve' allowed_domains = ['https://cve.mitre.org'] start_urls = ['https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml'] iterator = 'iternodes' itertag = 'Vulnerability' def parse_node(self, response, node): item = VulnerabilityItem() vulnerabilityId = node.xpath('CVE/text()').extract() item['id'] = urls return item Trace: 2014-07-10 11:19:48-0500 [cve] ERROR: Spider error processing <GET https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml> Traceback (most recent call last): File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent call.func(*call.args, **call.kw) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick taskObj._oneWorkUnit() File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit result = next(self._iterator) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr> work = (callable(elem, *args, **named) for elem in iterable) --- <exception caught here> --- File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback yield next(it) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output for x in result: File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 37, in process_spider_output for item_or_request in self._process_requests(result): File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 52, in _process_requests for request in iter(items_or_requests): File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr> return (_set_referer(r) for r in result or ()) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py", line 61, in parse_nodes for selector in nodes: File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py", line 87, in _iternodes for node in xmliter(response, self.itertag): File "/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/iterators.py", line 31, in xmliter yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0] exceptions.IndexError: list index out of range 2014-07-10 11:19:48-0500 [cve] INFO: Closing spider (finished) 2014-07-10 11:19:48-0500 [cve] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 318, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 3880685, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 7, 10, 16, 19, 48, 933372), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2014, 7, 10, 16, 19, 37, 768318)} 2014-07-10 11:19:48-0500 [cve] INFO: Spider closed (finished) -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
