Re: Process Multiple Requests For Single Item

Joey Espinosa Wed, 05 Mar 2014 05:48:09 -0800

HOLY TYPOS. Sorry. Revised:

class SiteSpider(Spider):
    site_loader = SiteLoader
    ...
    def parse(self, response):
        item = Place()
        sel = Selector(response)
        bl = self.site_loader(item=item, selector=sel)
        bl.add_value('domain', self.parent_domain)
        bl.add_value('origin', response.url)
        for place_property in item.fields:
            parse_xpath = self.template.get(place_property)


            # parse_xpath will look like either:
            # '//path/to/property/text()'
            # {'url': '//a[@id="Location"]/@href', 'xpath': 
'//div[@class="directions"]/span[contains(@class, "address")]/text()'}
            if isinstance(parse_xpath, dict):    # if True, then this 
place_property is in another URL
                url = sel.xpath(parse_xpath['url_elem']).extract()
                yield Request(url, callback=self.get_url_property, meta={
'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
            else:  # process normally
                bl.add_xpath(place_property, parse_xpath)
        yield bl.load_item()

    def get_url_property(self, response):
        loader = response.meta['loader']
        parse_xpath = response.meta['parse_xpath']
        place_property = response.meta['place_property']
        sel = Selector(response)
        loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
        return loader


--
Joey "JoeLinux" Espinosa
 <http://therealjoelinux.blogspot.com/> 
<http://twitter.com/therealjoelinux><http://about.me/joelinux>

On Wednesday, March 5, 2014 8:41:12 AM UTC-5, Joey Espinosa wrote:
>
> Hey guys,
>
> Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but 
> certainly not Python).
>
> Here is the issue I'm having. In my Scrapy project, I point to a page and 
> hopefully grab everything I need for the item. However, some domains (I'm 
> scraping a significant amount of separate domains) have certain item 
> properties located in another page within the initial page (for example, 
> "location" might only be found by clicking on the "Get Directions" link on 
> the page). I can't seem to get those "secondary" pages to work (the initial 
> item goes through the pipelines without those properties, and I never see 
> another item with those properties come through).
>
> class SiteSpider(Spider):
>     site_loader = SiteLoader
>     ...
>     def parse(self, response):
>         item = Place()
>         sel = Selector(response)
>         bl = self.site_loader(item=item, selector=sel)
>         bl.add_value('domain', self.parent_domain)
>         bl.add_value('origin', response.url)
>         for place_property in item.fields:
>             parse_xpath = template.get(place_property)
>
>             # parse_xpath will look like either:
>             # '//path/to/property/text()'
>             # {'url': '//a[@id="Location"]/@href', 'xpath': 
> '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
>             if isinstance(parse_xpath, dict):    # if True, then this 
> place_property 
> is in another URL
>                 url = sel.xpath(parse_xpath['url_elem']).extract()
>                 yield Request(url, callback=self.get_url_property, meta={
> 'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property
> })
>             else:  # process normally
>                 bl.add_xpath(event_property, template.get(event_property))
>         yield bl.load_item()
>
>
>     def get_url_property(self, response):
>         loader = response.meta['loader']
>         parse_xpath = response.meta['parse_xpath']
>         place_property = response.meta['place_property']
>         sel = Selector(response)
>         loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
>         return loader
>
>
> Basically, the part I'm confused about is where you see "yield Request". I 
> only put it there to illustrate where the problem lies; I know that this 
> will cause the item to get processed without the properties found at that 
> Request. So in my example, if the Place().location property is located at 
> another link on the page, I'd like to load that page and fill that property 
> with the appropriate value. Even if a single loader can't do it, that's 
> fine, maybe I can use loader.item or something. I don't know, that's pretty 
> much where my Google trail has ended.
>
> Is what I want possible? I would prefer to keep the request asynchronous 
> somehow, but if I really have to, making a synchronous request would 
> suffice. Can someone kinda lead me in the right direction? I'd appreciate 
> it. Thanks!
>
> --
> Joey "JoeLinux" Espinosa
>  
> <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Process Multiple Requests For Single Item

Reply via email to