Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

SlappySquirrel Sun, 13 Jul 2014 07:59:16 -0700

Andrew, that is a good question, but I'm not sure on the answer to that.

Paul, now using the local-name() is interesting. I've never tried that. 
Also, something I omitted from my original post is that I tried using the 
namespace within the spider, it simply doesn't work. That's what led to my 
pleas for ideas. Just like you said, we're out of luck until Scrapy 
supports what we need.


Thanks for the input.

On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote:
>
> Hello,
>
> You are right, the "iternodes" iterator has an issue with namespaces.
> The problem is in scrapy.utils.iterators.xmliter() which uses regular 
> expressions
>
> When matching tag patterns are found, a new XML document is created, with 
> all the XML headers
> (those containing namespaces declarations)
>
> but the Selector created for this new XML snippet is not registered with 
> namespaces,
> so the XPath //<value_of_itertag> will not match anything in your case
> See 
> https://github.com/scrapy/scrapy/blob/master/scrapy/utils/iterators.py#L31
>
> I was able to make "iternodes" work with a custom xmliter that uses 
> XPath's local-name()
> yield Selector(text=nodetext, type='xml').xpath('//*[local-name()="%s"]' % 
> nodename)[0]
>
> but that's not really pretty.
>
> I suggest you use the "xml" iterator and register the "
> http://www.icasi.org/CVRF/schema/vuln/1.1"; namespace
>
> class CveSpider(XMLFeedSpider):
>     name = 'cve'
>     allowed_domains = ['https://cve.mitre.org']
>     start_urls = ['
> https://cve.mitre.org/data/downloads/allitems-cvrf-year-2014.xml']
>
>     iterator = 'xml'
>     namespaces = [
>        ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
>     ]
>     itertag = 'vuln:Vulnerability'
>
>     def parse_node(self, response, node):
>         item = VulnerabilityItem()
>         vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
>         item['id'] = vulnerabilityId
>         return item
>
> Hope this helps.
>
> Paul.
>
>
> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>>
>> Upgrading to 0.24.2 since scrapy exposes the selector off the response, 
>> but this doesn't yield the desired affect.
>>
>> This has to be a BUG.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to