Re: Is it possible to limit the number of links crawled by crawl spider ?

Paul Tremberth Mon, 29 Sep 2014 04:14:44 -0700

Hi,

You can use process_links for this:



Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
     process_links=lambda l: l[:5],
     callback='parse_items'),




On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote:
>
> Hi,
>
> I am new to use crawlspider... 
>
> My problem is, *I need to extract top 5 items data in this link* (
> http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). I 
> have done this like this:
>
> start_urls = [ 
> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext= 
> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>  
> ]
>
> and specified rules as 
> rules = (
>              
> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/
>  
> <http://steamcommunity.com/sharedfiles/filedetails/>*",)), 
> callback='parse_items'),
>             )
>
> Now it is crawling through all urls that starts with "
> http://steamcommunity.com/sharedfiles/filedetails"; on the start_url 
> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>
> page. 
>
> My problem is it should crawl through only first 5 urls that starts with "
> http://steamcommunity.com/sharedfiles/filedetails/";  on the start_url  
> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>page. 
> Can we do this by crawlspider restrict or any other means ?
>
> *My code: *
>
> class ScrapePriceSpider(CrawlSpider):
>     
>     name = 'ScrapeItems'     
>     allowed_domains = ['steamcommunity.com']     
>     start_urls = 
> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>  
> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>  
> ]
>     
>     rules = (
>              Rule(SgmlLinkExtractor(allow=("
> http://steamcommunity.com/sharedfiles/filedetails/";,)), 
> callback='parse_items'),
>             )
>
>
>     def parse_items(self, response):
>            hxs = HtmlXPathSelector(response)       
>
>             item = ExtractitemsItem()
>
>             item["Item Name"]                      = 
> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>             item["Unique Visits"]                  = 
> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
>             item["Current Favorites"]          = 
> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>             return item
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is it possible to limit the number of links crawled by crawl spider ?

Reply via email to