Hello, You are right, the "iternodes" iterator has an issue with namespaces. The problem is in scrapy.utils.iterators.xmliter() which uses regular expressions
When matching tag patterns are found, a new XML document is created, with all the XML headers (those containing namespaces declarations) but the Selector created for this new XML snippet is not registered with namespaces, so the XPath //<value_of_itertag> will not match anything in your case See https://github.com/scrapy/scrapy/blob/master/scrapy/utils/iterators.py#L31 I was able to make "iternodes" work with a custom xmliter that uses XPath's local-name() yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' % nodename)[0] but that's not really pretty. I suggest you use the "xml" iterator and register the "http://www.icasi.org/CVRF/schema/vuln/1.1" namespace class CveSpider(XMLFeedSpider): name = 'cve' allowed_domains = ['https://cve.mitre.org'] start_urls = ['https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml'] iterator = 'xml' namespaces = [ ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1") ] itertag = 'vuln:Vulnerability' def parse_node(self, response, node): item = VulnerabilityItem() vulnerabilityId = node.xpath('vuln:CVE/text()').extract() item['id'] = vulnerabilityId return item Hope this helps. Paul. On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote: > > Upgrading to 0.24.2 since scrapy exposes the selector off the response, > but this doesn't yield the desired affect. > > This has to be a BUG. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
