Yes, like I said, "iternodes" has issues with namespaces in your case. But the "xml" iterator works when registering the vulnerability namespace. Have you tried my example ? Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected]> a écrit :
> Andrew, that is a good question, but I'm not sure on the answer to that. > > Paul, now using the local-name() is interesting. I've never tried that. > Also, something I omitted from my original post is that I tried using the > namespace within the spider, it simply doesn't work. That's what led to my > pleas for ideas. Just like you said, we're out of luck until Scrapy > supports what we need. > > Thanks for the input. > > On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote: >> >> Hello, >> >> You are right, the "iternodes" iterator has an issue with namespaces. >> The problem is in scrapy.utils.iterators.xmliter() which uses regular >> expressions >> >> When matching tag patterns are found, a new XML document is created, with >> all the XML headers >> (those containing namespaces declarations) >> >> but the Selector created for this new XML snippet is not registered with >> namespaces, >> so the XPath //<value_of_itertag> will not match anything in your case >> See https://github.com/scrapy/scrapy/blob/master/scrapy/ >> utils/iterators.py#L31 >> >> I was able to make "iternodes" work with a custom xmliter that uses >> XPath's local-name() >> yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' >> % nodename)[0] >> >> but that's not really pretty. >> >> I suggest you use the "xml" iterator and register the " >> http://www.icasi.org/CVRF/schema/vuln/1.1" namespace >> >> class CveSpider(XMLFeedSpider): >> name = 'cve' >> allowed_domains = ['https://cve.mitre.org'] >> start_urls = ['https://cve.mitre.org/data/ >> downloads/allitems-cvrf-year-2014.xml'] >> >> iterator = 'xml' >> namespaces = [ >> ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1") >> ] >> itertag = 'vuln:Vulnerability' >> >> def parse_node(self, response, node): >> item = VulnerabilityItem() >> vulnerabilityId = node.xpath('vuln:CVE/text()').extract() >> item['id'] = vulnerabilityId >> return item >> >> Hope this helps. >> >> Paul. >> >> >> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote: >>> >>> Upgrading to 0.24.2 since scrapy exposes the selector off the response, >>> but this doesn't yield the desired affect. >>> >>> This has to be a BUG. >>> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
