Re: How to scrape Ajax content

Travis Leleu Mon, 01 Dec 2014 20:46:19 -0800

Can you show the actual scrapy full debug output?  Best bet is with
pastebin or the like.



On Mon, Dec 1, 2014 at 6:41 PM, Chetan Motamarri <[email protected]> wrote:

> Hi  Travis,
>
> Thanks for your reply.
> The problem I am facing when page 2 URL is given in start_urls is, spider
> is not extracting games in page 2. Instead it is extracting games in page 1
> again.
>
> *Output when I gave only page 1 in **start_url**: *
>
> 335410,333790,283080,334970,327490,333540,313680,314300,315396,314450,335210,292750,315080,334400,334401,334402,324331,335,70,321660,331910,318530,303830,328140,334080,279440
>
> *Output when I gave both page 1 and  page 2 in start_url:  *
> 335410,333790,283080,334970,327490,333540,313680,314300,315396,314450,335210,292750,315080,334400,334401,334402,324331,335,70,321660,331910,318530,303830,328140,334080,279440
>
>  
> 335410,333790,283080,334970,327490,333540,313680,314300,315396,314450,335210,292750,315080,334400,334401,334402,324331,335,70,321660,331910,318530,303830,328140,334080,279440
>
> Same output is coming.
>
> On Monday, December 1, 2014 6:12:08 PM UTC-7, Travis Leleu wrote:
>>
>> Hi Chetan,
>>
>> What happens when you only have the URL for page 2 in your start_urls?
>> That page seems to load fine without javascript, so I'm not convinced you
>> need any sort of ajax support.
>>
>> Please provide the output you expect from the running script, and the
>> actual output -- that will help evaluate whether the bug is in your
>> understanding of scrapy's internals (something that happens a lot to me!
>> It's a confusing piece of software at times because there is so much going
>> on...) or if something else is occurring.
>>
>> Cheers,
>> Travis
>>
>> On Mon, Dec 1, 2014 at 5:07 PM, Chetan Motamarri <[email protected]> wrote:
>>
>>> Hi All
>>>
>>> I need to extract *id's of games* in "http://store.steampowered.
>>> com/search/?sort_by=Released_DESC&os=win#sort_by=Released_DESC".
>>>
>>> The point is, I was able to extract game id's in first page. I don't
>>> have any idea on how to move to next page and extract ids in those pages.
>>> My code is:
>>>
>>> class ScrapePriceSpider(BaseSpider):
>>>
>>>     name = 'UpdateGames'
>>>     allowed_domains = ['http://store.steampowered.com']
>>>     start_urls = 
>>> ['*http://store.steampowered.com/search/?sort_by=Released_DESC&os=win#sort_by=Released_DESC&;
>>> <http://store.steampowered.com/search/?sort_by=Released_DESC&os=win#sort_by=Released_DESC&;>page=1'*
>>> ]
>>>
>>>     def parse(self, response):
>>>         hxs = Selector(response)
>>>
>>>         path = hxs.xpath(".//div[@id='search_result_container']")
>>>         item = ItemscountItem()
>>>
>>>         for ids in path:
>>>             gameIds = pack.xpath('.//a/@data-ds-appid').extract() #
>>> extracting all game ids
>>>
>>>              item["GameID"] = str(gameIds)
>>>              return item
>>>
>>> Like this *my goal is to extract all game ids in 353 pages given there.
>>> *I think Ajax is used for pagination. I was not able to extract game
>>> ids from 2nd page onwards. I tried giving 
>>> *"http://store.steampowered.com/search/?sort_by=Released_DESC&os=win#sort_by=Released_DESC&;
>>> <http://store.steampowered.com/search/?sort_by=Released_DESC&os=win#sort_by=Released_DESC&;>page=2*"
>>> is given in start_urls but no use.
>>>
>>>
>>> Please help me with this.
>>>
>>>
>>> Thanks
>>> Chetan Motamarri
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How to scrape Ajax content

Reply via email to