XMLFeedSpider parsing issue with xml file that 8859-1 encoded

SlappySquirrel Thu, 10 Jul 2014 09:55:48 -0700

I'm having an issue with the XMLFeedSpider that I've been trying to wrap my 
head around for a week. It's got to be something stupid that I'm doing. The 
https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml is shows 
an encoding of 8859-1. I did not include, but I verified that the 
response.body was indeed the full xml file body. I did a quick print on 
that in a def adapt_response(self,response): function. I threw some 
printouts in the python/utils/iterators.py and verified that 
nodetext=<cvrfdoc><
Vulnerability> .........</Vulnerability></cvrfdoc>. I also verified that 
nodeName was Vulnerability. However, it bombs on the Selector stating 
"exceptions.IndexError: list index out of range"


Obviously, then never gets to the parse_node function, but bombs during the 
itertag processing. Any help would be greatly appreciated b/c I've been 
floundering. This hasn't been the only xml file I've come across that 
didn't parse well.

With that stated, I can open the xml file in firefox, select all, copy to 
notepad, and save as xml file. It won't render in a browser, but scrapy has 
no problems steamrolling through and parsing it. What a puzzle?

Environment: Python 2.7 with Scrapy .22

Spider:
from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.http import XmlResponse
from vulnerability.items import VulnerabilityItem
 
class CveSpider(XMLFeedSpider):
   name = 'cve'
   allowed_domains = ['https://cve.mitre.org']
   start_urls = ['
https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml']
                  
   iterator = 'iternodes'
   itertag = 'Vulnerability'
   
   def parse_node(self, response, node):
       item = VulnerabilityItem()
       vulnerabilityId = node.xpath('CVE/text()').extract()
       item['id'] = vulnerabilityId 
       return item

Trace:
2014-07-10 11:19:48-0500 [cve] ERROR: Spider error processing <GET 
https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml>
    Traceback (most recent call last):
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/base.py",
 
line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py",
 
line 638, in _tick
        taskObj._oneWorkUnit()
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/twisted/internet/task.py",
 
line 484, in _oneWorkUnit
        result = next(self._iterator)
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py",
 
line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/defer.py",
 
line 96, in iter_errback
        yield next(it)
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py",
 
line 23, in process_spider_output
        for x in result:
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py",
 
line 37, in process_spider_output
        for item_or_request in self._process_requests(result):
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py",
 
line 52, in _process_requests
        for request in iter(items_or_requests):
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py",
 
line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py",
 
line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py",
 
line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py",
 
line 61, in parse_nodes
        for selector in nodes:
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/contrib/spiders/feed.py",
 
line 87, in _iternodes
        for node in xmliter(response, self.itertag):
      File 
"/home/SlappySquirrel/workspace/vigenv/local/lib/python2.7/site-packages/scrapy/utils/iterators.py",
 
line 31, in xmliter
        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
    exceptions.IndexError: list index out of range
    
2014-07-10 11:19:48-0500 [cve] INFO: Closing spider (finished)
2014-07-10 11:19:48-0500 [cve] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 318,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 3880685,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 7, 10, 16, 19, 48, 933372),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'spider_exceptions/IndexError': 1,
     'start_time': datetime.datetime(2014, 7, 10, 16, 19, 37, 768318)}
2014-07-10 11:19:48-0500 [cve] INFO: Spider closed (finished)

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to