Re: Where do I define my output format for item dictionaries

Paul Tremberth Mon, 25 Jan 2016 04:22:00 -0800

Thanks for the sample.

I would suggest that you loop on <race> elements, not <meeting>s.
Something like this will help you work and extract on individual race 
snippets:


class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)
        items = []
        for race in sel.xpath('//meeting//race'):
            item = ConvXmlItem()
            item['id'] = race.xpath('@id').extract_first()
            item['num'] = race.xpath('@number').extract_first()
            item['dist'] = race.xpath('@distance').extract_first()
            items.append(item)

        return items



Hope this helps.

Paul.

On Monday, January 25, 2016 at 1:11:39 PM UTC+1, Sayth Renshaw wrote:
>
>
> Hi
>
> > Looks like the the XPath selectors you are using are returning more than 
> one item for each page, e.g. site.xpath('.//race/@id'). 
>
> yes it does, mostly for that selector 8 occurrences though it can vary it 
> wouldn't often, other selectors could have upwards to 24 items in them. 
>
> An exert may be messy i will try and edit and small copy the originals are 
> posted on a public website, this is a link (don't click unless you accept 
> to download as its a direct link) 
> http://old.racingnsw.com.au/Site/_content/racebooks/20160130RHIL0.xml
>
> This is an id by itself and yes they love attributes, in the example above 
> for my output though I am trying to filter so that for each .//race/@id I 
> extract I can output the desired attributes so that I am designing a csv or 
> json file which has all the ids and the descriptors from the attributes.
>
> <race id="209165" number="1" nomnumber="2" division="0" name="SCHWEPPES 
> QUALITY" mediumname="WILKES" shortname="WILKES" stage="Acceptances" 
> distance="1000" minweight="55" raisedweight="1" class="~         " age="3   
>       " grade="4" weightcondition="QLT       " trophy="0" owner="0" 
> trainer="0" jockey="0" strapper="0" totalprize="85000" first="48750" 
> second="16750" third="8350" fourth="4150" fifth="2000" 
> time="2016-01-23T12:40:00" bonustype="BOB7      " nomsfee="0" acceptfee="0" 
> trackcondition="          " timingmethod="          " fastesttime="         
>  " sectionaltime="          " formavailable="0" racebookprize="Of $85000. 
> First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth 
> $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000">
>
> Thanks
> Sayth
>
>> Looks like the the XPath selectors you are using are returning more than 
>> one item for each page, e.g. site.xpath('.//race/@id'). The extract() 
>> method returns a SelectorList with all the matching elements inside.
>>
>> Can you paste an excerpt of the XML file that you are parsing?
>>
>> On Sun, Jan 24, 2016 at 4:02 AM, Sayth Renshaw <[email protected]> 
>> wrote:
>>
>>>
>>> Hi all
>>>
>>> Currently when i output to csv scrapy runspider myxml.py -o ~/items.csv 
>>> -t csv I get the header items I defined in settings under feed export, 
>>> however i get the values collected as dictionaries dumped as a dictionary.
>>>
>>> Where do i define that dict[0] for each element should be its own line?
>>>
>>> So at the moment this is my output
>>>
>>> id,num,dist
>>>
>>> "209165,209166,209167,209168,209169,209170,209171,209172,209173","1,2,3,4,5,6,7,8,9","1000,1000,1400,1200,1200,1600,1600,1000,2000"
>>>
>>> I would want it as
>>>
>>> id,num,dist
>>> 209165,1,1000
>>> 209166,2,1000
>>> ...
>>>
>>> Looking in feedexporters in the docs for info but feeling I should just 
>>> be creating a customer function to tidy it up, is that what I do if yes 
>>> where. Seems like scrapy has thought of most things so expect its done I am 
>>> just not sure what its called.
>>>
>>> My current code.
>>>
>>> # -*- coding: utf-8 -*-
>>> import scrapy
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> from scrapy.selector import XmlXPathSelector
>>> from conv_xml.items import ConvXmlItem
>>> # http://stackoverflow.com/a/27391649/461887
>>> import json
>>>
>>>
>>> class MyxmlSpider(scrapy.Spider):
>>>     name = "myxml"
>>>
>>>     start_urls = (
>>>         ["file:///home/sayth/Downloads/20160123RAND0.xml"]
>>>     )
>>>
>>>     def parse(self, response):
>>>         sel = Selector(response)
>>>         sites = sel.xpath('//meeting')
>>>         items = []
>>>
>>>         for site in sites:
>>>             item = ConvXmlItem()
>>>             # item['venue'] = site.xpath('.//@venue').extract()
>>>             item['id'] = site.xpath('.//race/@id').extract()
>>>             item['num'] = site.xpath('.//race/@number').extract()
>>>             item['dist'] = site.xpath('.//race/@distance').extract()
>>>             items.append(item)
>>>
>>>         return items
>>>
>>>
>>> Thanks Sayth
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> [image: Scrapinghub] <https://scrapinghub.com> 
>>
>> Valdir Stumm Junior 
>> Developer Evangelist, Scrapinghub 
>> [image: Skype] stummjr
>> [image: Twitter] <https://twitter.com/stummjr> [image: Github] 
>> <https://github.com/stummjr>
>> [image: Twitter] <https://twitter.com/scrapinghub> [image: LinkedIn] 
>> <https://www.linkedin.com/company/scrapinghub> [image: Github] 
>> <https://github.com/scrapinghub>
>>
>> *We turn web content into structured data. Lead maintainers of Scrapy 
>> <http://scrapy.org>.*
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Where do I define my output format for item dictionaries

Reply via email to