Simple Scrapy problem but don't know how to resolve :( :( :(

Chetan Motamarri Wed, 11 Feb 2015 00:55:49 -0800

Hello All,

1. I want to extract count of "minutes", "hours"  text occurred (i.e. 
number of times text "minutes" & "hours" occurred) in this url ->  
http://steamcommunity.com/forum/*3703047*/trading/render/-1/?
*start=0&count=5* 
<http://steamcommunity.com/forum/3703047/trading/render/-1/?start=151&count=50>
*0*         *(In this URL id is 3703047).*


2. After this, in above url I should increase start to 51 by keeping count 
as 50.(New url is http://steamcommunity.com/forum/*3703047*
/trading/render/-1/?*start=51&count=5* 
<http://steamcommunity.com/forum/3703047/trading/render/-1/?start=151&count=50>
*0*) 
   Again I have to get count of those two words in this second url.

3. Now the problem is I should add count obtained in step 1 and step 2 then 
I should place total count in *item["total_count"] *and return this item in 
scrapy.

4. Again process repeats, for another id *3381077* as shown in start_urls 
below. 

*My ultimate goal is I have add count value if id in start_url is same. 
Then I should return that total count of that id as **item["total_count"].*


In this program I wrote like it will return count individually for each 
start_url. But I should add count if id's in start_url are same then I 
should return the sum. Please help me regarding this.

class MySpider(BaseSpider):
    name = "TotalCount"
    allowed_domains = ["steamcommunity.com"]
    start_urls = ["http://steamcommunity.com/forum/*3703047*
/Trading/render/-1/?*start=0&count=50*",
                       "http://steamcommunity.com/forum/*3703047*
/Trading/render/-1/?*start=51&count=50*"
                       "http://steamcommunity.com/forum/*3381077*
/Trading/render/-1/?*start=0&count=50*",
                       "http://steamcommunity.com/forum/*3381077*
/Trading/render/-1/?*start=51&count=50*" ]

    def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())

        items = []    
        item = TodaydiscussionsItem()
        topics_html = jsonresponse["topics_html"]  # extracts data in JSON 
attribute topics_html
        
        AppId = str(response.url)
        item["Id"] = str(AppId.split("/")[4])   # placing start_url's id in 
item["Id"]
        
        # Find number of times 'minutes, hour, hours' text occurred in JSON 
data
        count_minutes = re.findall("minutes ago", topics_html)
        count_hour = re.findall("hour", topics_html)

        
        item["total_count"] = len(count_minutes)+len(count_hour)

        items.append(item)
        return items


I can also change URL like "http://steamcommunity.com/forum/*3703047*
/Trading/render/-1/?start=0&count=100" but in this case I am not getting 
exact count of those 2 words from that JSON file. So please don't suggest 
to change 2 urls as above single url.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Simple Scrapy problem but don't know how to resolve :( :( :(

Reply via email to