Yeh you are right, I pushed a correction of this on the repository and the CSV output: https://github.com/jbinfo/scrapy_store.steampowered.com/blob/master/output.csv perhaps only 476 games that are exported, the messing game is due to the website like the ranked game 257 http://store.steampowered.com/app/202480/ if you enter on that link your are redirect to the homepage!
Regards. Merci. --------- Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy Email/Gtalk: [email protected] - Skype: baazzilhassan - Twitter: @baazzilhassan <http://twitter.com/baazzilhassan> Blog: http://blog.jbinfo.io/ [image: Donate - PayPal -] <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate - PayPal - <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> 2014-06-25 2:16 GMT+01:00 Chetan Motamarri <[email protected]>: > Hi dude, > > The code works very fine. But only 450 games out of 500 are retrieving. I > think it is not scraping two pages out of 20 pages in steamcharts.com. I > found out that one page it is ignoring is "http://steamcharts.com/top/p.1". > In the output csv there are no games present in before url. > > I tried editing regex "allow=r'(top\/p\.)([1-9]|1[0-9]|20)$' " but not > able to resolve. Could you please help me.. > > > On Tuesday, June 24, 2014 4:35:10 AM UTC-7, Lhassan Baazzi wrote: > >> Hi >> >> Complexe task, look at github repository, I pushed the script of the top >> 500 games, see also the output.csv >> >> There is some minor bugs to fix. >> >> Regards. >> >> Merci. >> --------- >> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy >> Email/Gtalk: [email protected] - Skype: baazzilhassan >> Blog: http://blog.jbinfo.io/ >> [image: Donate - PayPal -] >> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate >> - PayPal - >> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> >> >> >> 2014-06-24 8:10 GMT+01:00 Chetan Motamarri <[email protected]>: >> >> Hi, >>> >>> The game links which I wanted to scrape are the top 500 games extracted >>> from http://steamcharts.com/top. >>> >>> Some example urls are "http://store.steampowered.com/app/570/", " >>> http://store.steampowered.com/app/730/", "http://store.steampowered. >>> com/app/440/". These urls are top 3 games in that steamcharts.com. >>> >>> These top 500 game urls don't have any unique structure. These urls look >>> like all the other game urls. So how to scrape only these urls. ? >>> >>> On Monday, June 23, 2014 11:48:17 PM UTC-7, Lhassan Baazzi wrote: >>> >>>> Hi, >>>> >>>> I need the structure of this links that you want to scrap, if look at >>>> my code you can see that I limit link by this: >>>> >>>> rules = ( >>>> Rule(SgmlLinkExtractor(allow=r'genre/'), follow=True), >>>> Rule(SgmlLinkExtractor(allow=r'app/\d+'), callback='parse_item' >>>> ) >>>> ) >>>> >>>> Just link that begin by app/\d+ and genre/ to follow and reject >>>> anythings else. >>>> So, give me example of this specifics links. >>>> >>>> Regards. >>>> >>>> >>>> Merci. >>>> --------- >>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy >>>> Email/Gtalk: [email protected] - Skype: baazzilhassan >>>> Blog: http://blog.jbinfo.io/ >>>> [image: Donate - PayPal -] >>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate >>>> - PayPal - >>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> >>>> >>>> >>>> 2014-06-24 7:40 GMT+01:00 Chetan Motamarri <[email protected]>: >>>> >>>> Hi Lhassan, >>>>> >>>>> Thanks for your response. Your code was amazing and I got what I am >>>>> looking for. >>>>> But I want to crawl only specific set of urls i.e. I don't want to >>>>> crawl all games. So I specified those urls in start_urls[]. But I came to >>>>> know that we cant use both "def start_requests(self)" and >>>>> "start_urls[]" >>>>> >>>>> So do you have any idea about this ? I just want to scrape specific >>>>> set of urls(some 500 urls) but not all urls. >>>>> >>>>> On Friday, June 20, 2014 4:38:15 AM UTC-7, Lhassan Baazzi wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I create a github project and contain a scrapy project that scrap >>>>>> data for this website, see github repository: >>>>>> https://github.com/jbinfo/scrapy_store.steampowered.com >>>>>> Look at it and clone the project on your local and correct bugs. >>>>>> >>>>>> If you like it, you can make a donate, see my email signature. >>>>>> >>>>>> Regards. >>>>>> --------- >>>>>> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy >>>>>> Email/Gtalk: [email protected] - Skype: baazzilhassan >>>>>> Blog: http://blog.jbinfo.io/ >>>>>> [image: Donate - PayPal -] >>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate >>>>>> - PayPal - >>>>>> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> >>>>>> >>>>>> >>>>>> 2014-06-19 8:23 GMT+01:00 Chetan Motamarri <[email protected]>: >>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> I am new to scrapy. I had an issue which I don't know how to solve. >>>>>>> >>>>>>> I need to scrape game info from the url: "http://store.steampowered. >>>>>>> com/agecheck/app/252490/" but it requires agecheck to process >>>>>>> further and scrape game data. So I need to fill this once for game. The >>>>>>> website stores info as cookies(I guess) as it is not asking for agecheck >>>>>>> for subsequent games. i.e for the first game only we need to enter age >>>>>>> then >>>>>>> it automatically stores age. >>>>>>> >>>>>>> So my problem is how to automatically send drop down values in >>>>>>> scrapy and store them and use as cookies, and use those cookies for >>>>>>> subsequent start urls. >>>>>>> >>>>>>> Plz help me friends. Thanks in advance. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "scrapy-users" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> >>>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
