Re: Is it possible to limit the number of links crawled by crawl spider ?

Paul Tremberth Mon, 29 Sep 2014 10:33:50 -0700

You can try with LinkExtractor and XPath:

LinkExtractor(restrict_xpaths=('(//a[re:test(@href, 
"^http://steamcommunity.com/sharedfiles/filedetails/";)])[position()<6]',))




On Monday, September 29, 2014 7:01:58 PM UTC+2, Chetan Motamarri wrote:
>
> Hi Paul ,
>
> It worked thank you very much, but it is not taking first 5 urls in that 
> start url page "
> http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems"; 
> instead 
> it is crawling random 5 links that starts with 
> "*http://steamcommunity.com/sharedfiles/filedetails/ 
> <http://steamcommunity.com/sharedfiles/filedetails/>*"; 
>
> Can we restrict crawlspider to *crawl the first 5 links on a page* that 
> starts with above url ?
>
> On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote:
>>
>> Hi,
>>
>> You can use process_links for this:
>>
>>
>> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
>>      process_links=lambda l: l[:5],
>>      callback='parse_items'),
>>
>>
>>
>>
>> On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote:
>>>
>>> Hi,
>>>
>>> I am new to use crawlspider... 
>>>
>>> My problem is, *I need to extract top 5 items data in this link* (
>>> http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). 
>>> I have done this like this:
>>>
>>> start_urls = [ 
>>> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>  
>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>  
>>> ]
>>>
>>> and specified rules as 
>>> rules = (
>>>              
>>> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/
>>>  
>>> <http://steamcommunity.com/sharedfiles/filedetails/>*",)), 
>>> callback='parse_items'),
>>>             )
>>>
>>> Now it is crawling through all urls that starts with "
>>> http://steamcommunity.com/sharedfiles/filedetails"; on the start_url 
>>> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>
>>> page. 
>>>
>>> My problem is it should crawl through only first 5 urls that starts with 
>>> "http://steamcommunity.com/sharedfiles/filedetails/";  on the start_url  
>>> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>page.
>>>  
>>> Can we do this by crawlspider restrict or any other means ?
>>>
>>> *My code: *
>>>
>>> class ScrapePriceSpider(CrawlSpider):
>>>     
>>>     name = 'ScrapeItems'     
>>>     allowed_domains = ['steamcommunity.com']     
>>>     start_urls = 
>>> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>  
>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>  
>>> ]
>>>     
>>>     rules = (
>>>              Rule(SgmlLinkExtractor(allow=("
>>> http://steamcommunity.com/sharedfiles/filedetails/";,)), 
>>> callback='parse_items'),
>>>             )
>>>
>>>
>>>     def parse_items(self, response):
>>>            hxs = HtmlXPathSelector(response)       
>>>
>>>             item = ExtractitemsItem()
>>>
>>>             item["Item Name"]                      = 
>>> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>>>             item["Unique Visits"]                  = 
>>> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
>>>             item["Current Favorites"]          = 
>>> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>>>             return item
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is it possible to limit the number of links crawled by crawl spider ?

Reply via email to