Hi dude,
Thanks for the reply.
It is enough to use BaseSpider in this case right. I think no need of crawl
spider as there are no rules here..
But in this case, can BaseSpider and start_requests go together in one
spider ? When I used both I got errors as shown below.. Also I have
attached spider
2014-07-16 10:32:37-0700 [scrapy] DEBUG: Web service listening on
0.0.0.0:6080
2014-07-16 10:32:37-0700 [hai] DEBUG: Redirecting (302) to <GET
http://store.ste
ampowered.com/app/252490/> from <POST
http://store.steampowered.com/agecheck/app
/252490/>
2014-07-16 10:32:38-0700 [hai] DEBUG: Crawled (200) <GET
http://store.steampower
ed.com/app/252490/> (referer: None)
2014-07-16 10:32:38-0700 [hai] ERROR: Spider error processing <GET
http://store.
steampowered.com/app/252490/>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py",
line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py",
line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py",
line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py",
line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py",
line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File
"C:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\spi
der.py", line 56, in parse
raise NotImplementedError
exceptions.NotImplementedError:
On Tuesday, July 15, 2014 5:36:34 PM UTC-7, Lhassan Baazzi wrote:
>
> Hi,
>
> You can use this method to parse the CSV and generate Request to crawl
> each game, in the response of this request check the existence of agecheck
> form:
>
> => IS TRUE then return a FormRequest and fill all form fields.
> => IS FALSE then parse the response and retrieve game informations.
>
>
> Regards.
>
> Merci.
> ---------
> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
> Email/Gtalk: [email protected] <javascript:> - Skype: baazzilhassan -
> Twitter: @baazzilhassan <http://twitter.com/baazzilhassan>
> Blog: http://blog.jbinfo.io/
> Donate - PayPal -
> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>
>
> 2014-07-16 0:14 GMT+00:00 Chetan Motamarri <[email protected] <javascript:>>
> :
>
>> Hi Lhassan,
>>
>> Is that possible to take input games from csv ? Instead of crawling top
>> 500 games from steamcharts.com.
>>
>> I have a csv file with list of games whose data is to scraped. Some of
>> the games in the csv file also requires agecheck. I have attached the csv
>> file. The point is that some games in csv here also requires agecheck..
>> Could you please help on this..
>>
>> I just want to scrape games data from steam urls like (
>> http://store.steampowered.com/app/570/
>> <http://store.steampowered.com/app/202480/>), for all the games in csv
>> attached.
>>
>>
>> On Wednesday, June 25, 2014 2:09:55 AM UTC-7, Lhassan Baazzi wrote:
>>
>>> Yeh you are right, I pushed a correction of this on the repository and
>>> the CSV output: https://github.com/jbinfo/scrapy_store.steampowered.com/
>>> blob/master/output.csv
>>> perhaps only 476 games that are exported, the messing game is due to
>>> the website like the ranked game 257 http://store.steampowered.com/
>>> app/202480/ if you enter on that link your are redirect to the homepage!
>>>
>>>
>>> Regards.
>>>
>>> Merci.
>>> ---------
>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>> Email/Gtalk: [email protected] - Skype: baazzilhassan - Twitter:
>>> @baazzilhassan <http://twitter.com/baazzilhassan>
>>> Blog: http://blog.jbinfo.io/
>>> [image: Donate - PayPal -]
>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>
>>> - PayPal -
>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>
>>>
>>> 2014-06-25 2:16 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>
>>> Hi dude,
>>>>
>>>> The code works very fine. But only 450 games out of 500 are retrieving.
>>>> I think it is not scraping two pages out of 20 pages in steamcharts.com.
>>>> I found out that one page it is ignoring is "
>>>> http://steamcharts.com/top/p.1". In the output csv there are no games
>>>> present in before url.
>>>>
>>>> I tried editing regex "allow=r'(top\/p\.)([1-9]|1[0-9]|20)$' " but not
>>>> able to resolve. Could you please help me..
>>>>
>>>>
>>>> On Tuesday, June 24, 2014 4:35:10 AM UTC-7, Lhassan Baazzi wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Complexe task, look at github repository, I pushed the script of the
>>>>> top 500 games, see also the output.csv
>>>>>
>>>>> There is some minor bugs to fix.
>>>>>
>>>>> Regards.
>>>>>
>>>>> Merci.
>>>>> ---------
>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>> Blog: http://blog.jbinfo.io/
>>>>> [image: Donate - PayPal -]
>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>
>>>>> - PayPal -
>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>
>>>>>
>>>>> 2014-06-24 8:10 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> The game links which I wanted to scrape are the top 500 games
>>>>>> extracted from http://steamcharts.com/top.
>>>>>>
>>>>>> Some example urls are "http://store.steampowered.com/app/570/", "
>>>>>> http://store.steampowered.com/app/730/", "http://store.steampowered.
>>>>>> com/app/440/". These urls are top 3 games in that steamcharts.com.
>>>>>>
>>>>>> These top 500 game urls don't have any unique structure. These urls
>>>>>> look like all the other game urls. So how to scrape only these urls. ?
>>>>>>
>>>>>> On Monday, June 23, 2014 11:48:17 PM UTC-7, Lhassan Baazzi wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I need the structure of this links that you want to scrap, if look
>>>>>>> at my code you can see that I limit link by this:
>>>>>>>
>>>>>>> rules = (
>>>>>>> Rule(SgmlLinkExtractor(allow=r'genre/'), follow=True),
>>>>>>> Rule(SgmlLinkExtractor(allow=r'app/\d+'), callback=
>>>>>>> 'parse_item')
>>>>>>> )
>>>>>>>
>>>>>>> Just link that begin by app/\d+ and genre/ to follow and reject
>>>>>>> anythings else.
>>>>>>> So, give me example of this specifics links.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>> Merci.
>>>>>>> ---------
>>>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>>>> Blog: http://blog.jbinfo.io/
>>>>>>> [image: Donate - PayPal -]
>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>>>
>>>>>>> - PayPal -
>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>>>
>>>>>>>
>>>>>>> 2014-06-24 7:40 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>>>
>>>>>>> Hi Lhassan,
>>>>>>>>
>>>>>>>> Thanks for your response. Your code was amazing and I got what I am
>>>>>>>> looking for.
>>>>>>>> But I want to crawl only specific set of urls i.e. I don't want to
>>>>>>>> crawl all games. So I specified those urls in start_urls[]. But I came
>>>>>>>> to
>>>>>>>> know that we cant use both "def start_requests(self)" and
>>>>>>>> "start_urls[]"
>>>>>>>>
>>>>>>>> So do you have any idea about this ? I just want to scrape specific
>>>>>>>> set of urls(some 500 urls) but not all urls.
>>>>>>>>
>>>>>>>> On Friday, June 20, 2014 4:38:15 AM UTC-7, Lhassan Baazzi wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I create a github project and contain a scrapy project that scrap
>>>>>>>>> data for this website, see github repository:
>>>>>>>>> https://github.com/jbinfo/scrapy_store.steampowered.com
>>>>>>>>> Look at it and clone the project on your local and correct bugs.
>>>>>>>>>
>>>>>>>>> If you like it, you can make a donate, see my email signature.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>> ---------
>>>>>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
>>>>>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan
>>>>>>>>> Blog: http://blog.jbinfo.io/
>>>>>>>>> [image: Donate - PayPal -]
>>>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
>>>>>>>>>
>>>>>>>>> - PayPal -
>>>>>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-06-19 8:23 GMT+01:00 Chetan Motamarri <[email protected]>:
>>>>>>>>>
>>>>>>>>>> Hi folks,
>>>>>>>>>>
>>>>>>>>>> I am new to scrapy. I had an issue which I don't know how to
>>>>>>>>>> solve.
>>>>>>>>>>
>>>>>>>>>> I need to scrape game info from the url: "
>>>>>>>>>> http://store.steampowered.com/agecheck/app/252490/" but it
>>>>>>>>>> requires agecheck to process further and scrape game data. So I need
>>>>>>>>>> to
>>>>>>>>>> fill this once for game. The website stores info as cookies(I guess)
>>>>>>>>>> as it
>>>>>>>>>> is not asking for agecheck for subsequent games. i.e for the first
>>>>>>>>>> game
>>>>>>>>>> only we need to enter age then it automatically stores age.
>>>>>>>>>>
>>>>>>>>>> So my problem is how to automatically send drop down values in
>>>>>>>>>> scrapy and store them and use as cookies, and use those cookies for
>>>>>>>>>> subsequent start urls.
>>>>>>>>>>
>>>>>>>>>> Plz help me friends. Thanks in advance.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "scrapy-users" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>>
>>>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "scrapy-users" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "scrapy-users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
import csv
import copy
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
from ExtractPrice.items import ExtractpriceItem
with open('D:\Python stuff\PythonWorkSpace2\AllGames1.csv', 'rU') as f: # 1000Users.csv has 1000 stable users
reader = csv.reader(f)
allGames = [] # list to hold all rows in csv
for row in reader:
strng=str(row)
allGames.append(strng[:-2])
class SteampoweredComSpider(BaseSpider):
name = 'hai'
allowed_domains = ['store.steampowered.com']
#print "all games", allGames
list = ['http://store.steampowered.com/app/'+str(i)[2: ] for i in allGames] # appending common url to each item in list
#print "list",list
#start_urls = ['http://store.steampowered.com/app/570']
start_urls = copy.deepcopy(list) # copying list to starturls
def start_requests(self):
request = FormRequest('http://store.steampowered.com/agecheck/app/252490/',
formdata={'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1980'},)
return [request]
def parse_item(self, response):
sel = Selector(response)
i = ExtractpriceItem()
i['name'] = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Title")]/following-sibling::text())').extract())
videosvars = sel.xpath('normalize-space(//*[@id="highlight_player_area"]/*[1]/script)').re(r'[src|data\-hd\-src|poster]=("http\://.*?")')
try:
i['video'] = videosvars[0]
except IndexError:
i['video'] = ''
try:
i['video_hd'] = videosvars[1]
except IndexError:
i['video_hd'] = ''
i['genres'] = sel.xpath('//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Developer")]/preceding-sibling::a//text()').extract()
if not i['genres']:
i['genres'] = sel.xpath('//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Publisher")]/preceding-sibling::a//text()').extract()
i['developer'] = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Developer")]/following-sibling::a[1]/text())').extract())
i['publisher'] = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Publisher")]/following-sibling::a[1]/text())').extract())
i['release_date'] = ''.join(sel.xpath('normalize-space(//*[@id="main_content"]//*[@class="block_content_inner"]//*[contains(text(), "Release Date")]/following-sibling::text())').extract())
i['languages'] = sel.xpath('//*[@id="main_content"]//*[@class="game_language_options"]//tr[position() > 1]/th[1]//text()').extract()
macSel = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "Mac System")]/..')
pcSel = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "PC System")]/..')
if not macSel and not pcSel:
pcSel = sel.xpath('//*[@id="game_area_sys_req"]/h2[contains(., "System")]/..')
return i