Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Paul Tremberth Tue, 15 Jul 2014 02:32:12 -0700

That's weird.
Please find my test spider 
here: https://gist.github.com/redapple/b4f677640861a77ecae5


The console.log show it extracts 4522 items

Could you share your spider and console output using the "xml" iterator?
Thanks.

Paul.

On Monday, July 14, 2014 3:35:50 AM UTC+2, SlappySquirrel wrote:
>
> Paul,
>
> I tried that before initially posting. When it gets to the iterators.py, 
> it still errors out with index out of range. What a puzzle?
>
> On Sunday, July 13, 2014 10:41:31 AM UTC-5, Paul Tremberth wrote:
>>
>> Yes, like I said, "iternodes" has issues with namespaces in your case. 
>> But the "xml" iterator works when registering the vulnerability namespace.
>> Have you tried my example ?
>> Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected]> a écrit :
>>
>>> Andrew, that is a good question, but I'm not sure on the answer to that.
>>>
>>> Paul, now using the local-name() is interesting. I've never tried that. 
>>> Also, something I omitted from my original post is that I tried using the 
>>> namespace within the spider, it simply doesn't work. That's what led to my 
>>> pleas for ideas. Just like you said, we're out of luck until Scrapy 
>>> supports what we need.
>>>
>>> Thanks for the input.
>>>
>>> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote:
>>>>
>>>> Hello,
>>>>
>>>> You are right, the "iternodes" iterator has an issue with namespaces.
>>>> The problem is in scrapy.utils.iterators.xmliter() which uses regular 
>>>> expressions
>>>>
>>>> When matching tag patterns are found, a new XML document is created, 
>>>> with all the XML headers
>>>> (those containing namespaces declarations)
>>>>
>>>> but the Selector created for this new XML snippet is not registered 
>>>> with namespaces,
>>>> so the XPath //<value_of_itertag> will not match anything in your case
>>>> See https://github.com/scrapy/scrapy/blob/master/scrapy/
>>>> utils/iterators.py#L31
>>>>
>>>> I was able to make "iternodes" work with a custom xmliter that uses 
>>>> XPath's local-name()
>>>> yield Selector(text=nodetext, type='xml').xpath('//*[local-
>>>> name()="%s"]' % nodename)[0]
>>>>
>>>> but that's not really pretty.
>>>>
>>>> I suggest you use the "xml" iterator and register the "
>>>> http://www.icasi.org/CVRF/schema/vuln/1.1"; namespace
>>>>
>>>> class CveSpider(XMLFeedSpider):
>>>>     name = 'cve'
>>>>     allowed_domains = ['https://cve.mitre.org']
>>>>     start_urls = ['https://cve.mitre.org/data/
>>>> downloads/allitems-cvrf-year-2014.xml']
>>>>
>>>>     iterator = 'xml'
>>>>     namespaces = [
>>>>        ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
>>>>     ]
>>>>     itertag = 'vuln:Vulnerability'
>>>>
>>>>     def parse_node(self, response, node):
>>>>         item = VulnerabilityItem()
>>>>         vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
>>>>         item['id'] = vulnerabilityId
>>>>         return item
>>>>
>>>> Hope this helps.
>>>>
>>>> Paul.
>>>>
>>>>
>>>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>>>>>
>>>>> Upgrading to 0.24.2 since scrapy exposes the selector off the 
>>>>> response, but this doesn't yield the desired affect.
>>>>>
>>>>> This has to be a BUG.
>>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to