Process Multiple Requests For Single Item

Joey Espinosa Wed, 05 Mar 2014 05:43:13 -0800

Hey guys,

Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but 
certainly not Python).


Here is the issue I'm having. In my Scrapy project, I point to a page and 
hopefully grab everything I need for the item. However, some domains (I'm 
scraping a significant amount of separate domains) have certain item 
properties located in another page within the initial page (for example, 
"location" might only be found by clicking on the "Get Directions" link on 
the page). I can't seem to get those "secondary" pages to work (the initial 
item goes through the pipelines without those properties, and I never see 
another item with those properties come through).

class SiteSpider(Spider):
    site_loader = SiteLoader
    ...
    def parse(self, response):
        item = Place()
        sel = Selector(response)
        bl = self.site_loader(item=item, selector=sel)
        bl.add_value('domain', self.parent_domain)
        bl.add_value('origin', response.url)
        for place_property in item.fields:
            parse_xpath = template.get(place_property)

            # parse_xpath will look like either:
            # '//path/to/property/text()'
            # {'url': '//a[@id="Location"]/@href', 'xpath': 
'//div[@class="directions"]/span[contains(@class, "address")]/text()'}
            if isinstance(parse_xpath, dict):    # if True, then this 
place_property 
is in another URL
                url = sel.xpath(parse_xpath['url_elem']).extract()
                yield Request(url, callback=self.get_url_property, meta={
'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
            else:  # process normally
                bl.add_xpath(event_property, template.get(event_property))
        yield bl.load_item()


    def get_url_property(self, response):
        loader = response.meta['loader']
        parse_xpath = response.meta['parse_xpath']
        place_property = response.meta['place_property']
        sel = Selector(response)
        loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
        return loader


Basically, the part I'm confused about is where you see "yield Request". I 
only put it there to illustrate where the problem lies; I know that this 
will cause the item to get processed without the properties found at that 
Request. So in my example, if the Place().location property is located at 
another link on the page, I'd like to load that page and fill that property 
with the appropriate value. Even if a single loader can't do it, that's 
fine, maybe I can use loader.item or something. I don't know, that's pretty 
much where my Google trail has ended.

Is what I want possible? I would prefer to keep the request asynchronous 
somehow, but if I really have to, making a synchronous request would 
suffice. Can someone kinda lead me in the right direction? I'd appreciate 
it. Thanks!

--
Joey "JoeLinux" Espinosa
 <http://therealjoelinux.blogspot.com/> 
<http://twitter.com/therealjoelinux><http://about.me/joelinux>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Process Multiple Requests For Single Item

Reply via email to