Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Paul Tremberth Sun, 13 Jul 2014 08:41:39 -0700

Yes, like I said, "iternodes" has issues with namespaces in your case.
But the "xml" iterator works when registering the vulnerability namespace.
Have you tried my example ?
Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected]> a écrit :


> Andrew, that is a good question, but I'm not sure on the answer to that.
>
> Paul, now using the local-name() is interesting. I've never tried that.
> Also, something I omitted from my original post is that I tried using the
> namespace within the spider, it simply doesn't work. That's what led to my
> pleas for ideas. Just like you said, we're out of luck until Scrapy
> supports what we need.
>
> Thanks for the input.
>
> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote:
>>
>> Hello,
>>
>> You are right, the "iternodes" iterator has an issue with namespaces.
>> The problem is in scrapy.utils.iterators.xmliter() which uses regular
>> expressions
>>
>> When matching tag patterns are found, a new XML document is created, with
>> all the XML headers
>> (those containing namespaces declarations)
>>
>> but the Selector created for this new XML snippet is not registered with
>> namespaces,
>> so the XPath //<value_of_itertag> will not match anything in your case
>> See https://github.com/scrapy/scrapy/blob/master/scrapy/
>> utils/iterators.py#L31
>>
>> I was able to make "iternodes" work with a custom xmliter that uses
>> XPath's local-name()
>> yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]'
>> % nodename)[0]
>>
>> but that's not really pretty.
>>
>> I suggest you use the "xml" iterator and register the "
>> http://www.icasi.org/CVRF/schema/vuln/1.1"; namespace
>>
>> class CveSpider(XMLFeedSpider):
>>     name = 'cve'
>>     allowed_domains = ['https://cve.mitre.org']
>>     start_urls = ['https://cve.mitre.org/data/
>> downloads/allitems-cvrf-year-2014.xml']
>>
>>     iterator = 'xml'
>>     namespaces = [
>>        ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
>>     ]
>>     itertag = 'vuln:Vulnerability'
>>
>>     def parse_node(self, response, node):
>>         item = VulnerabilityItem()
>>         vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
>>         item['id'] = vulnerabilityId
>>         return item
>>
>> Hope this helps.
>>
>> Paul.
>>
>>
>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>>>
>>> Upgrading to 0.24.2 since scrapy exposes the selector off the response,
>>> but this doesn't yield the desired affect.
>>>
>>> This has to be a BUG.
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to