Paul, I tried that before initially posting. When it gets to the iterators.py, it still errors out with index out of range. What a puzzle?
On Sunday, July 13, 2014 10:41:31 AM UTC-5, Paul Tremberth wrote: > > Yes, like I said, "iternodes" has issues with namespaces in your case. > But the "xml" iterator works when registering the vulnerability namespace. > Have you tried my example ? > Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected] <javascript:>> > a écrit : > >> Andrew, that is a good question, but I'm not sure on the answer to that. >> >> Paul, now using the local-name() is interesting. I've never tried that. >> Also, something I omitted from my original post is that I tried using the >> namespace within the spider, it simply doesn't work. That's what led to my >> pleas for ideas. Just like you said, we're out of luck until Scrapy >> supports what we need. >> >> Thanks for the input. >> >> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote: >>> >>> Hello, >>> >>> You are right, the "iternodes" iterator has an issue with namespaces. >>> The problem is in scrapy.utils.iterators.xmliter() which uses regular >>> expressions >>> >>> When matching tag patterns are found, a new XML document is created, >>> with all the XML headers >>> (those containing namespaces declarations) >>> >>> but the Selector created for this new XML snippet is not registered with >>> namespaces, >>> so the XPath //<value_of_itertag> will not match anything in your case >>> See https://github.com/scrapy/scrapy/blob/master/scrapy/ >>> utils/iterators.py#L31 >>> >>> I was able to make "iternodes" work with a custom xmliter that uses >>> XPath's local-name() >>> yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' >>> % nodename)[0] >>> >>> but that's not really pretty. >>> >>> I suggest you use the "xml" iterator and register the " >>> http://www.icasi.org/CVRF/schema/vuln/1.1" namespace >>> >>> class CveSpider(XMLFeedSpider): >>> name = 'cve' >>> allowed_domains = ['https://cve.mitre.org'] >>> start_urls = ['https://cve.mitre.org/data/ >>> downloads/allitems-cvrf-year-2014.xml'] >>> >>> iterator = 'xml' >>> namespaces = [ >>> ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1") >>> ] >>> itertag = 'vuln:Vulnerability' >>> >>> def parse_node(self, response, node): >>> item = VulnerabilityItem() >>> vulnerabilityId = node.xpath('vuln:CVE/text()').extract() >>> item['id'] = vulnerabilityId >>> return item >>> >>> Hope this helps. >>> >>> Paul. >>> >>> >>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote: >>>> >>>> Upgrading to 0.24.2 since scrapy exposes the selector off the response, >>>> but this doesn't yield the desired affect. >>>> >>>> This has to be a BUG. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
