Re: Automatically fill drop down values with scrapy

Chetan Motamarri Wed, 16 Jul 2014 10:54:44 -0700

Hi dude,
Thanks for the reply.

It is enough to use BaseSpider in this case right. I think no need of crawl 
spider as there are no rules here..


But in this case, can BaseSpider and start_requests go together in one 
spider ? When I used both I got errors as shown below.. Also I have 
attached spider 

2014-07-16 10:32:37-0700 [scrapy] DEBUG: Web service listening on 
0.0.0.0:6080
2014-07-16 10:32:37-0700 [hai] DEBUG: Redirecting (302) to <GET 
http://store.ste
ampowered.com/app/252490/> from <POST 
http://store.steampowered.com/agecheck/app
/252490/>
2014-07-16 10:32:38-0700 [hai] DEBUG: Crawled (200) <GET 
http://store.steampower
ed.com/app/252490/> (referer: None)
2014-07-16 10:32:38-0700 [hai] ERROR: Spider error processing <GET 
http://store.
steampowered.com/app/252490/>
        Traceback (most recent call last):
          File "C:\Python27\lib\site-packages\twisted\internet\base.py", 
line 12
01, in mainLoop
            self.runUntilCurrent()
          File "C:\Python27\lib\site-packages\twisted\internet\base.py", 
line 82
4, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "C:\Python27\lib\site-packages\twisted\internet\defer.py", 
line 3
83, in callback
            self._startRunCallbacks(result)
          File "C:\Python27\lib\site-packages\twisted\internet\defer.py", 
line 4
91, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "C:\Python27\lib\site-packages\twisted\internet\defer.py", 
line 5
78, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File 
"C:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\spi
der.py", line 56, in parse
            raise NotImplementedError
        exceptions.NotImplementedError: 

On Tuesday, July 15, 2014 5:36:34 PM UTC-7, Lhassan Baazzi wrote:
>
> Hi,
>
> You can use this method to parse the CSV and generate Request to crawl 
> each game, in the response of this request check the existence of agecheck 
> form:
>
> => IS TRUE then return a FormRequest and fill all form fields.
> => IS FALSE then parse the response and retrieve game informations.
>
>
> Regards.
>
> Merci.
> ---------
> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
> Email/Gtalk: [email protected] <javascript:> - Skype: baazzilhassan - 
> Twitter: @baazzilhassan <http://twitter.com/baazzilhassan>
> Blog: http://blog.jbinfo.io/
> Donate - PayPal - 
> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>  
>
> 2014-07-16 0:14 GMT+00:00 Chetan Motamarri <[email protected] <javascript:>>
> :
>
>> Hi Lhassan,
>>
>> Is that possible to take input games from csv ? Instead of crawling top 
>> 500 games from steamcharts.com.
>>
>> I have a csv file with list of games whose data is to scraped. Some of 
>> the games in the csv file also requires agecheck. I have attached the csv 
>> file. The point is that some games in csv here also requires agecheck.. 
>> Could you please help on this..
>>
>> I just want to scrape games data from steam urls like (
>> http://store.steampowered.com/app/570/ 
>> <http://store.steampowered.com/app/202480/>), for all the games in csv 
>> attached.
>>
>>
>> On Wednesday, June 25, 2014 2:09:55 AM UTC-7, Lhassan Baazzi wrote:
>>
>>> Yeh you are right, I pushed a correction of this on the repository and 
>>> the CSV output: https://github.com/jbinfo/scrapy_store.steampowered.com/
>>> blob/master/output.csv
>>> perhaps only 476 games that are exported, the messing  game is due to 
>>> the website like the ranked game 257 http://store.steampowered.com/
>>> app/202480/ if you enter on that link your are redirect to the homepage!
>>>
>>>
>>> Regards.
>>>
>>> Merci.
>>> ---------
>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>> Email/Gtalk: [email protected] - Skype: baazzilhassan - Twitter: 
>>> @baazzilhassan <http://twitter.com/baazzilhassan>
>>> Blog: http://blog.jbinfo.io/
>>> [image: Donate - PayPal -] 
>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>  
>>> - PayPal - 
>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>  
>>>
>>> 2014-06-25 2:16 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>
>>> Hi dude, 
>>>>
>>>> The code works very fine. But only 450 games out of 500 are retrieving. 
>>>> I think it is not scraping two pages out of 20 pages in steamcharts.com. 
>>>> I found out that one page it is ignoring is "
>>>> http://steamcharts.com/top/p.1";. In the output csv there are no games 
>>>> present in before url. 
>>>>
>>>> I tried editing regex "allow=r'(top\/p\.)([1-9]|1[0-9]|20)$' " but not 
>>>> able to resolve. Could you please help me.. 
>>>>
>>>>
>>>> On Tuesday, June 24, 2014 4:35:10 AM UTC-7, Lhassan Baazzi wrote:
>>>>
>>>>> Hi 
>>>>>
>>>>> Complexe task, look at github repository, I pushed the script of the 
>>>>> top 500 games, see also the output.csv
>>>>>
>>>>> There is some minor bugs to fix.
>>>>>
>>>>> Regards.
>>>>>
>>>>> Merci.
>>>>> ---------
>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>> Blog: http://blog.jbinfo.io/
>>>>> [image: Donate - PayPal -] 
>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>  
>>>>> - PayPal - 
>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>  
>>>>>
>>>>> 2014-06-24 8:10 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> The game links which I wanted to scrape are the top 500 games 
>>>>>> extracted from http://steamcharts.com/top.
>>>>>>
>>>>>>  Some example urls are "http://store.steampowered.com/app/570/";, "
>>>>>> http://store.steampowered.com/app/730/";, "http://store.steampowered.
>>>>>> com/app/440/". These urls are top 3 games in that steamcharts.com. 
>>>>>>
>>>>>> These top 500 game urls don't have any unique structure. These urls 
>>>>>> look like all the other game urls. So how to scrape only these urls. ?
>>>>>>
>>>>>> On Monday, June 23, 2014 11:48:17 PM UTC-7, Lhassan Baazzi wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I need the structure of this links that you want to scrap, if look 
>>>>>>> at my code you can see that I limit link by this:
>>>>>>>
>>>>>>> rules = (
>>>>>>>         Rule(SgmlLinkExtractor(allow=r'genre/'), follow=True),
>>>>>>>         Rule(SgmlLinkExtractor(allow=r'app/\d+'), callback=
>>>>>>> 'parse_item')
>>>>>>>     )
>>>>>>>
>>>>>>> Just link that begin by app/\d+ and genre/ to follow and reject 
>>>>>>> anythings else.
>>>>>>> So, give me example of this specifics links.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>> Merci.
>>>>>>> ---------
>>>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>>>> Blog: http://blog.jbinfo.io/
>>>>>>> [image: Donate - PayPal -] 
>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>>>  
>>>>>>> - PayPal - 
>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>>>  
>>>>>>>
>>>>>>> 2014-06-24 7:40 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>>>
>>>>>>> Hi Lhassan,
>>>>>>>>
>>>>>>>> Thanks for your response. Your code was amazing and I got what I am 
>>>>>>>> looking for. 
>>>>>>>> But I want to crawl only specific set of urls i.e. I don't want to 
>>>>>>>> crawl all games. So I specified those urls in start_urls[]. But I came 
>>>>>>>> to 
>>>>>>>> know that we cant use both "def start_requests(self)" and 
>>>>>>>> "start_urls[]"
>>>>>>>>
>>>>>>>> So do you have any idea about this ? I just want to scrape specific 
>>>>>>>> set of urls(some 500 urls) but not all urls.
>>>>>>>>
>>>>>>>> On Friday, June 20, 2014 4:38:15 AM UTC-7, Lhassan Baazzi wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I create a github project and contain a scrapy project that scrap 
>>>>>>>>> data for this website, see github repository: 
>>>>>>>>> https://github.com/jbinfo/scrapy_store.steampowered.com
>>>>>>>>> Look at it and clone the project on your local and correct bugs.
>>>>>>>>>
>>>>>>>>> If you like it, you can make a donate, see my email signature.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>> ---------
>>>>>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>>>>>> Blog: http://blog.jbinfo.io/
>>>>>>>>> [image: Donate - PayPal -] 
>>>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>>>>>  
>>>>>>>>> - PayPal - 
>>>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>> 2014-06-19 8:23 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>>>>>
>>>>>>>>>>  Hi folks,
>>>>>>>>>>
>>>>>>>>>> I am new to scrapy. I had an issue which I don't know how to 
>>>>>>>>>> solve.
>>>>>>>>>>
>>>>>>>>>> I need to scrape game info from the url: "
>>>>>>>>>> http://store.steampowered.com/agecheck/app/252490/"; but it 
>>>>>>>>>> requires agecheck to process further and scrape game data. So I need 
>>>>>>>>>> to 
>>>>>>>>>> fill this once for game. The website stores info as cookies(I guess) 
>>>>>>>>>> as it 
>>>>>>>>>> is not asking for agecheck for subsequent games. i.e for the first 
>>>>>>>>>> game 
>>>>>>>>>> only we need to enter age then it automatically stores age.
>>>>>>>>>>
>>>>>>>>>> So my problem is how to automatically send drop down values in 
>>>>>>>>>> scrapy and store them and use as cookies, and use those cookies for 
>>>>>>>>>> subsequent start urls.
>>>>>>>>>>
>>>>>>>>>> Plz help me friends. Thanks in advance.
>>>>>>>>>>  
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "scrapy-users" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>>
>>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "scrapy-users" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "scrapy-users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

import csv
import copy
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from ExtractPrice.items import ExtractpriceItem

with open('D:\Python stuff\PythonWorkSpace2\AllGames1.csv', 'rU') as f: # 1000Users.csv has 1000 stable users 
    reader = csv.reader(f)
    allGames = []                   # list to hold all rows in csv
    for row in reader:
            strng=str(row)
            allGames.append(strng[:-2])
            
class SteampoweredComSpider(BaseSpider):
    name = 'hai'
    allowed_domains = ['store.steampowered.com']
    #print "all games", allGames
    list = ['http://store.steampowered.com/app/'+str(i)[2: ] for i in allGames]  # appending common url to each item in list
    #print "list",list
    #start_urls = ['http://store.steampowered.com/app/570']
    start_urls = copy.deepcopy(list)        # copying list to starturls

    def start_requests(self):
        request = FormRequest('http://store.steampowered.com/agecheck/app/252490/', 
            formdata={'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1980'},)

        return [request]

    def parse_item(self, response):
        sel = Selector(response)
        i = ExtractpriceItem()

        i['name']           = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Title")]/following-sibling::text())').extract())
        videosvars          = sel.xpath('normalize-space(//*[@id="highlight_player_area"]/*[1]/script)').re(r'[src|data\-hd\-src|poster]=("http\://.*?")')

        try:
            i['video']      = videosvars[0]
        except IndexError:
            i['video']      = ''

        try:
            i['video_hd']   = videosvars[1]
        except IndexError:
            i['video_hd']   = ''

        i['genres']         = sel.xpath('//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Developer")]/preceding-sibling::a//text()').extract()
        if not i['genres']:
            i['genres']     = sel.xpath('//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Publisher")]/preceding-sibling::a//text()').extract()

        i['developer']      = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Developer")]/following-sibling::a[1]/text())').extract())
        i['publisher']      = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Publisher")]/following-sibling::a[1]/text())').extract())
        i['release_date']   = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Release Date")]/following-sibling::text())').extract())
        i['languages']      = sel.xpath('//*[@id="main_content"]//*[@class="game_language_options"]//tr[position() > 1]/th[1]//text()').extract()
        
        macSel  = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "Mac System")]/..')
        pcSel   = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "PC System")]/..')

        if not macSel and not pcSel:
            pcSel   = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "System")]/..')


        return i

Re: Automatically fill drop down values with scrapy

Reply via email to