Hi Paul ,
It worked thank you very much, but it is not taking first 5 urls in that
start url page
"http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems"
instead
it is crawling random 5 links that starts with
"*http://steamcommunity.com/sharedfiles/filedetails/
<http://steamcommunity.com/sharedfiles/filedetails/>*";
Can we restrict crawlspider to *crawl the first 5 links on a page* that
starts with above url ?
On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote:
>
> Hi,
>
> You can use process_links for this:
>
>
> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)),
> process_links=lambda l: l[:5],
> callback='parse_items'),
>
>
>
>
> On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote:
>>
>> Hi,
>>
>> I am new to use crawlspider...
>>
>> My problem is, *I need to extract top 5 items data in this link* (
>> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems).
>> I have done this like this:
>>
>> start_urls = [
>> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>
>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>
>> ]
>>
>> and specified rules as
>> rules = (
>>
>> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/
>>
>> <http://steamcommunity.com/sharedfiles/filedetails/>*",)),
>> callback='parse_items'),
>> )
>>
>> Now it is crawling through all urls that starts with "
>> http://steamcommunity.com/sharedfiles/filedetails" on the start_url
>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>
>> page.
>>
>> My problem is it should crawl through only first 5 urls that starts with "
>> http://steamcommunity.com/sharedfiles/filedetails/" on the start_url
>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>page.
>> Can we do this by crawlspider restrict or any other means ?
>>
>> *My code: *
>>
>> class ScrapePriceSpider(CrawlSpider):
>>
>> name = 'ScrapeItems'
>> allowed_domains = ['steamcommunity.com']
>> start_urls =
>> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>
>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>
>> ]
>>
>> rules = (
>> Rule(SgmlLinkExtractor(allow=("
>> http://steamcommunity.com/sharedfiles/filedetails/",)),
>> callback='parse_items'),
>> )
>>
>>
>> def parse_items(self, response):
>> hxs = HtmlXPathSelector(response)
>>
>> item = ExtractitemsItem()
>>
>> item["Item Name"] =
>> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>> item["Unique Visits"] =
>> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()
>> item["Current Favorites"] =
>> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>> return item
>>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.