Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

SlappySquirrel Tue, 29 Jul 2014 09:01:22 -0700

Paul,

I'm truly sorry for not checking earlier. I think we're on to something 
here considering you just brought something to light for me with your code 
example. So let this be a lesson learned for all who use Scrapy .22 and .24 
as it stands.


*You cannot register and use namespaces when defining an iterator as 
'iternodes' for an XMLFeedSpider. It simply doesn't work. You may, however, 
register and use namespaces when defining your iterator as 'xml' in your 
XMLFeedSpider.*

The difference is iternodes uses an iterator without having to generate the 
whole DOM to read, while 'xml' and 'html' generates the whole DOM before 
reading.

A quick solution is to override the adapt_response(self, response) and 
replace all namespaces within the response body with empty string.

I'd like to point back to an original post that discusses no namespace 
support for 'iternodes' so credit can be given: 
https://groups.google.com/forum/#!topic/scrapy-users/xdt_BDj_208




On Tuesday, July 15, 2014 4:31:40 AM UTC-5, Paul Tremberth wrote:
>
> That's weird.
> Please find my test spider here: 
> https://gist.github.com/redapple/b4f677640861a77ecae5
>
> The console.log show it extracts 4522 items
>
> Could you share your spider and console output using the "xml" iterator?
> Thanks.
>
> Paul.
>
> On Monday, July 14, 2014 3:35:50 AM UTC+2, SlappySquirrel wrote:
>>
>> Paul,
>>
>> I tried that before initially posting. When it gets to the iterators.py, 
>> it still errors out with index out of range. What a puzzle?
>>
>> On Sunday, July 13, 2014 10:41:31 AM UTC-5, Paul Tremberth wrote:
>>>
>>> Yes, like I said, "iternodes" has issues with namespaces in your case. 
>>> But the "xml" iterator works when registering the vulnerability 
>>> namespace.
>>> Have you tried my example ?
>>> Le 13 juil. 2014 16:58, "SlappySquirrel" <[email protected]> a écrit :
>>>
>>>> Andrew, that is a good question, but I'm not sure on the answer to that.
>>>>
>>>> Paul, now using the local-name() is interesting. I've never tried that. 
>>>> Also, something I omitted from my original post is that I tried using the 
>>>> namespace within the spider, it simply doesn't work. That's what led to my 
>>>> pleas for ideas. Just like you said, we're out of luck until Scrapy 
>>>> supports what we need.
>>>>
>>>> Thanks for the input.
>>>>
>>>> On Friday, July 11, 2014 12:39:24 PM UTC-5, Paul Tremberth wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> You are right, the "iternodes" iterator has an issue with namespaces.
>>>>> The problem is in scrapy.utils.iterators.xmliter() which uses regular 
>>>>> expressions
>>>>>
>>>>> When matching tag patterns are found, a new XML document is created, 
>>>>> with all the XML headers
>>>>> (those containing namespaces declarations)
>>>>>
>>>>> but the Selector created for this new XML snippet is not registered 
>>>>> with namespaces,
>>>>> so the XPath //<value_of_itertag> will not match anything in your case
>>>>> See https://github.com/scrapy/scrapy/blob/master/scrapy/
>>>>> utils/iterators.py#L31
>>>>>
>>>>> I was able to make "iternodes" work with a custom xmliter that uses 
>>>>> XPath's local-name()
>>>>> yield Selector(text=nodetext, type='xml').xpath('//*[local-
>>>>> name()="%s"]' % nodename)[0]
>>>>>
>>>>> but that's not really pretty.
>>>>>
>>>>> I suggest you use the "xml" iterator and register the "
>>>>> http://www.icasi.org/CVRF/schema/vuln/1.1"; namespace
>>>>>
>>>>> class CveSpider(XMLFeedSpider):
>>>>>     name = 'cve'
>>>>>     allowed_domains = ['https://cve.mitre.org']
>>>>>     start_urls = ['https://cve.mitre.org/data/
>>>>> downloads/allitems-cvrf-year-2014.xml']
>>>>>
>>>>>     iterator = 'xml'
>>>>>     namespaces = [
>>>>>        ("vuln", "http://www.icasi.org/CVRF/schema/vuln/1.1";)
>>>>>     ]
>>>>>     itertag = 'vuln:Vulnerability'
>>>>>
>>>>>     def parse_node(self, response, node):
>>>>>         item = VulnerabilityItem()
>>>>>         vulnerabilityId = node.xpath('vuln:CVE/text()').extract()
>>>>>         item['id'] = vulnerabilityId
>>>>>         return item
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Paul.
>>>>>
>>>>>
>>>>> On Friday, July 11, 2014 6:11:06 PM UTC+2, SlappySquirrel wrote:
>>>>>>
>>>>>> Upgrading to 0.24.2 since scrapy exposes the selector off the 
>>>>>> response, but this doesn't yield the desired affect.
>>>>>>
>>>>>> This has to be a BUG.
>>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Reply via email to