Is it possible to limit the number of links crawled by crawl spider ?

Chetan Motamarri Mon, 29 Sep 2014 00:06:53 -0700

Hi,

I am new to use crawlspider...


My problem is, *I need to extract top 5 items data in this link* 
(http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). I 
have done this like this:

start_urls = [ '
*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=*' 
]

and specified rules as 
rules = (
             Rule(SgmlLinkExtractor(allow=("
*http://steamcommunity.com/sharedfiles/filedetails/*";,)), 
callback='parse_items'),
            )

Now it is crawling through all urls that starts with 
"http://steamcommunity.com/sharedfiles/filedetails"; on the start_url 
<http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>
page. 

My problem is it should crawl through only first 5 urls that starts with 
"http://steamcommunity.com/sharedfiles/filedetails/";  on the start_url  
<http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>page. 
Can we do this by crawlspider restrict or any other means ?

*My code: *

class ScrapePriceSpider(CrawlSpider):
    
    name = 'ScrapeItems'     
    allowed_domains = ['steamcommunity.com']     
    start_urls = ['
*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=*' 
]
    
    rules = (
            
 
Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
 
callback='parse_items'),
            )


    def parse_items(self, response):
           hxs = HtmlXPathSelector(response)       

            item = ExtractitemsItem()

            item["Item Name"]                      = 
hxs.select("//div[@class='workshopItemTitle']/text()").extract()
            item["Unique Visits"]                  = 
hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
            item["Current Favorites"]          = 
hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
            return item

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Is it possible to limit the number of links crawled by crawl spider ?

Reply via email to