Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Paul Tremberth Fri, 11 Jul 2014 10:40:30 -0700

Hello,

You are right, the "iternodes" iterator has an issue with namespaces.
The problem is in scrapy.utils.iterators.xmliter() which uses regular 
expressions


When matching tag patterns are found, a new XML document is created, with 
all the XML headers
(those containing namespaces declarations)

but the Selector created for this new XML snippet is not registered with 
namespaces,
so the XPath //<value_of_itertag> will not match anything in your case
See 
https://github.com/scrapy/scrapy/blob/master/scrapy/utils/iterators.py#L31

I was able to make "iternodes" work with a custom xmliter that uses XPath's 
local-name()
yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' % 
nodename)[0]

but that's not really pretty.

I suggest you use the "xml" iterator and register the 
"http://www.icasi.org/CVRF/schema/vuln/1.1"; 
namespace

class CveSpider(XMLFeedSpider):
    name = 'cve'
    allowed_domains = ['https://cve.mitre.org']
    start_urls = 
['https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml']

    iterator = 'xml'
    namespaces = [
       ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
    ]
    itertag = 'vuln:Vulnerability'

    def parse_node(self, response, node):
        item = VulnerabilityItem()
        vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
        item['id'] = vulnerabilityId
        return item

Hope this helps.

Paul.


On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>
> Upgrading to 0.24.2 since scrapy exposes the selector off the response, 
> but this doesn't yield the desired affect.
>
> This has to be a BUG.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to