Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

SlappySquirrel Sun, 13 Jul 2014 18:36:12 -0700

Paul,

I tried that before initially posting. When it gets to the iterators.py, it 
still errors out with index out of range. What a puzzle?


On Sunday, July 13, 2014 10:41:31 AM UTC-5, Paul Tremberth wrote:
>
> Yes, like I said, "iternodes" has issues with namespaces in your case. 
> But the "xml" iterator works when registering the vulnerability namespace.
> Have you tried my example ?
> Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected] <javascript:>> 
> a écrit :
>
>> Andrew, that is a good question, but I'm not sure on the answer to that.
>>
>> Paul, now using the local-name() is interesting. I've never tried that. 
>> Also, something I omitted from my original post is that I tried using the 
>> namespace within the spider, it simply doesn't work. That's what led to my 
>> pleas for ideas. Just like you said, we're out of luck until Scrapy 
>> supports what we need.
>>
>> Thanks for the input.
>>
>> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote:
>>>
>>> Hello,
>>>
>>> You are right, the "iternodes" iterator has an issue with namespaces.
>>> The problem is in scrapy.utils.iterators.xmliter() which uses regular 
>>> expressions
>>>
>>> When matching tag patterns are found, a new XML document is created, 
>>> with all the XML headers
>>> (those containing namespaces declarations)
>>>
>>> but the Selector created for this new XML snippet is not registered with 
>>> namespaces,
>>> so the XPath //<value_of_itertag> will not match anything in your case
>>> See https://github.com/scrapy/scrapy/blob/master/scrapy/
>>> utils/iterators.py#L31
>>>
>>> I was able to make "iternodes" work with a custom xmliter that uses 
>>> XPath's local-name()
>>> yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' 
>>> % nodename)[0]
>>>
>>> but that's not really pretty.
>>>
>>> I suggest you use the "xml" iterator and register the "
>>> http://www.icasi.org/CVRF/schema/vuln/1.1"; namespace
>>>
>>> class CveSpider(XMLFeedSpider):
>>>     name = 'cve'
>>>     allowed_domains = ['https://cve.mitre.org']
>>>     start_urls = ['https://cve.mitre.org/data/
>>> downloads/allitems-cvrf-year-2014.xml']
>>>
>>>     iterator = 'xml'
>>>     namespaces = [
>>>        ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
>>>     ]
>>>     itertag = 'vuln:Vulnerability'
>>>
>>>     def parse_node(self, response, node):
>>>         item = VulnerabilityItem()
>>>         vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
>>>         item['id'] = vulnerabilityId
>>>         return item
>>>
>>> Hope this helps.
>>>
>>> Paul.
>>>
>>>
>>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>>>>
>>>> Upgrading to 0.24.2 since scrapy exposes the selector off the response, 
>>>> but this doesn't yield the desired affect.
>>>>
>>>> This has to be a BUG.
>>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to