Hello,

You are right, the "iternodes" iterator has an issue with namespaces.
The problem is in scrapy.utils.iterators.xmliter() which uses regular 
expressions

When matching tag patterns are found, a new XML document is created, with 
all the XML headers
(those containing namespaces declarations)

but the Selector created for this new XML snippet is not registered with 
namespaces,
so the XPath //<value_of_itertag> will not match anything in your case
See 
https://github.com/scrapy/scrapy/blob/master/scrapy/utils/iterators.py#L31

I was able to make "iternodes" work with a custom xmliter that uses XPath's 
local-name()
yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' % 
nodename)[0]

but that's not really pretty.

I suggest you use the "xml" iterator and register the 
"http://www.icasi.org/CVRF/schema/vuln/1.1"; 
namespace

class CveSpider(XMLFeedSpider):
    name = 'cve'
    allowed_domains = ['https://cve.mitre.org']
    start_urls = 
['https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml']

    iterator = 'xml'
    namespaces = [
       ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
    ]
    itertag = 'vuln:Vulnerability'

    def parse_node(self, response, node):
        item = VulnerabilityItem()
        vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
        item['id'] = vulnerabilityId
        return item

Hope this helps.

Paul.


On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>
> Upgrading to 0.24.2 since scrapy exposes the selector off the response, 
> but this doesn't yield the desired affect.
>
> This has to be a BUG.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to