That's weird. Please find my test spider here: https://gist.github.com/redapple/b4f677640861a77ecae5
The console.log show it extracts 4522 items Could you share your spider and console output using the "xml" iterator? Thanks. Paul. On Monday, July 14, 2014 3:35:50 AM UTC+2, SlappySquirrel wrote: > > Paul, > > I tried that before initially posting. When it gets to the iterators.py, > it still errors out with index out of range. What a puzzle? > > On Sunday, July 13, 2014 10:41:31 AM UTC-5, Paul Tremberth wrote: >> >> Yes, like I said, "iternodes" has issues with namespaces in your case. >> But the "xml" iterator works when registering the vulnerability namespace. >> Have you tried my example ? >> Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected]> a écrit : >> >>> Andrew, that is a good question, but I'm not sure on the answer to that. >>> >>> Paul, now using the local-name() is interesting. I've never tried that. >>> Also, something I omitted from my original post is that I tried using the >>> namespace within the spider, it simply doesn't work. That's what led to my >>> pleas for ideas. Just like you said, we're out of luck until Scrapy >>> supports what we need. >>> >>> Thanks for the input. >>> >>> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote: >>>> >>>> Hello, >>>> >>>> You are right, the "iternodes" iterator has an issue with namespaces. >>>> The problem is in scrapy.utils.iterators.xmliter() which uses regular >>>> expressions >>>> >>>> When matching tag patterns are found, a new XML document is created, >>>> with all the XML headers >>>> (those containing namespaces declarations) >>>> >>>> but the Selector created for this new XML snippet is not registered >>>> with namespaces, >>>> so the XPath //<value_of_itertag> will not match anything in your case >>>> See https://github.com/scrapy/scrapy/blob/master/scrapy/ >>>> utils/iterators.py#L31 >>>> >>>> I was able to make "iternodes" work with a custom xmliter that uses >>>> XPath's local-name() >>>> yield Selector(text=nodetext, type='xml').xpath('//*[local- >>>> name()="%s"]' % nodename)[0] >>>> >>>> but that's not really pretty. >>>> >>>> I suggest you use the "xml" iterator and register the " >>>> http://www.icasi.org/CVRF/schema/vuln/1.1" namespace >>>> >>>> class CveSpider(XMLFeedSpider): >>>> name = 'cve' >>>> allowed_domains = ['https://cve.mitre.org'] >>>> start_urls = ['https://cve.mitre.org/data/ >>>> downloads/allitems-cvrf-year-2014.xml'] >>>> >>>> iterator = 'xml' >>>> namespaces = [ >>>> ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1") >>>> ] >>>> itertag = 'vuln:Vulnerability' >>>> >>>> def parse_node(self, response, node): >>>> item = VulnerabilityItem() >>>> vulnerabilityId = node.xpath('vuln:CVE/text()').extract() >>>> item['id'] = vulnerabilityId >>>> return item >>>> >>>> Hope this helps. >>>> >>>> Paul. >>>> >>>> >>>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote: >>>>> >>>>> Upgrading to 0.24.2 since scrapy exposes the selector off the >>>>> response, but this doesn't yield the desired affect. >>>>> >>>>> This has to be a BUG. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
