Re: Is it possible to limit the number of links crawled by crawl spider ?

lnxpgn Mon, 29 Sep 2014 02:20:57 -0700

I haven't used SgmlLinkExtractor before, but i think you should use 
http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems as the 
start url, try procss_links callback in the Rule() function to filter urls for 
the top 5 items.


在 2014-9-29，下午3:06，Chetan Motamarri <[email protected]> 写道：

> Hi,
> 
> I am new to use crawlspider... 
> 
> My problem is, I need to extract top 5 items data in this link 
> (http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). I 
> have done this like this:
> 
> start_urls = [ 
> 'http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=' 
> ]
> 
> and specified rules as 
> rules = (
>              
> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
>  callback='parse_items'),
>             )
> 
> Now it is crawling through all urls that starts with 
> "http://steamcommunity.com/sharedfiles/filedetails"; on the start_url page. 
> 
> My problem is it should crawl through only first 5 urls that starts with 
> "http://steamcommunity.com/sharedfiles/filedetails/";  on the start_url page. 
> Can we do this by crawlspider restrict or any other means ?
> 
> My code: 
> 
> class ScrapePriceSpider(CrawlSpider):
>     
>     name = 'ScrapeItems'     
>     allowed_domains = ['steamcommunity.com']     
>     start_urls = 
> ['http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext='
>  ]
>     
>     rules = (
>              
> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
>  callback='parse_items'),
>             )
> 
> 
>     def parse_items(self, response):
>            hxs = HtmlXPathSelector(response)       
> 
>             item = ExtractitemsItem()
> 
>             item["Item Name"]                      = 
> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>             item["Unique Visits"]                  = 
> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
>             item["Current Favorites"]          = 
> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>             return item
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is it possible to limit the number of links crawled by crawl spider ?

Reply via email to