HOLY TYPOS. Sorry. Revised:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# {'url': '//a[@id="Location"]/@href', 'xpath':
'//div[@class="directions"]/span[contains(@class, "address")]/text()'}
if isinstance(parse_xpath, dict): # if True, then this
place_property is in another URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property, meta={
'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property})
else: # process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
--
Joey "JoeLinux" Espinosa
<http://therealjoelinux.blogspot.com/>
<http://twitter.com/therealjoelinux><http://about.me/joelinux>
On Wednesday, March 5, 2014 8:41:12 AM UTC-5, Joey Espinosa wrote:
>
> Hey guys,
>
> Disclaimer: I'm new to this group, and fairly new to Scrapy as well (but
> certainly not Python).
>
> Here is the issue I'm having. In my Scrapy project, I point to a page and
> hopefully grab everything I need for the item. However, some domains (I'm
> scraping a significant amount of separate domains) have certain item
> properties located in another page within the initial page (for example,
> "location" might only be found by clicking on the "Get Directions" link on
> the page). I can't seem to get those "secondary" pages to work (the initial
> item goes through the pipelines without those properties, and I never see
> another item with those properties come through).
>
> class SiteSpider(Spider):
> site_loader = SiteLoader
> ...
> def parse(self, response):
> item = Place()
> sel = Selector(response)
> bl = self.site_loader(item=item, selector=sel)
> bl.add_value('domain', self.parent_domain)
> bl.add_value('origin', response.url)
> for place_property in item.fields:
> parse_xpath = template.get(place_property)
>
> # parse_xpath will look like either:
> # '//path/to/property/text()'
> # {'url': '//a[@id="Location"]/@href', 'xpath':
> '//div[@class="directions"]/span[contains(@class, "address")]/text()'}
> if isinstance(parse_xpath, dict): # if True, then this
> place_property
> is in another URL
> url = sel.xpath(parse_xpath['url_elem']).extract()
> yield Request(url, callback=self.get_url_property, meta={
> 'loader': bl, 'parse_xpath': parse_xpath, 'place_property': place_property
> })
> else: # process normally
> bl.add_xpath(event_property, template.get(event_property))
> yield bl.load_item()
>
>
> def get_url_property(self, response):
> loader = response.meta['loader']
> parse_xpath = response.meta['parse_xpath']
> place_property = response.meta['place_property']
> sel = Selector(response)
> loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
> return loader
>
>
> Basically, the part I'm confused about is where you see "yield Request". I
> only put it there to illustrate where the problem lies; I know that this
> will cause the item to get processed without the properties found at that
> Request. So in my example, if the Place().location property is located at
> another link on the page, I'd like to load that page and fill that property
> with the appropriate value. Even if a single loader can't do it, that's
> fine, maybe I can use loader.item or something. I don't know, that's pretty
> much where my Google trail has ended.
>
> Is what I want possible? I would prefer to keep the request asynchronous
> somehow, but if I really have to, making a synchronous request would
> suffice. Can someone kinda lead me in the right direction? I'd appreciate
> it. Thanks!
>
> --
> Joey "JoeLinux" Espinosa
>
> <http://therealjoelinux.blogspot.com/><http://twitter.com/therealjoelinux><http://about.me/joelinux>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.