Hi Paul,
Thanks again bro. It is retrieving only first 3 items. But I want first 5
items. I don't know where it went wrong. Could you please help me. Here is
my code..
class ScrapePriceSpider(CrawlSpider):
name = 'ScrapeItems'
allowed_domains = ['steamcommunity.com']
start_urls =
['http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems']
rules = (
Rule(LinkExtractor(restrict_xpaths=('(//a[re:test(@href,
"^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',)),
process_links=lambda l: l[:5], callback='parse_items'),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
item = ExtractitemsItem()
uniqueVisits =
hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()
CurrentFavorites =
hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
itemname =
hxs.select("//div[@class='workshopItemTitle']/text()").extract()
item["Item"] = str(itemname)[3:-2]
item["UniqueVisits"] = str(uniqueVisits)[3:-2]
item["CurrentFavorites"] = str(CurrentFavorites)[3:-2]
return item
On Monday, September 29, 2014 10:33:23 AM UTC-7, Paul Tremberth wrote:
>
> You can try with LinkExtractor and XPath:
>
> LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^
> http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',))
>
>
>
> On Monday, September 29, 2014 7:01:58 PM UTC+2, Chetan Motamarri wrote:
>>
>> Hi Paul ,
>>
>> It worked thank you very much, but it is not taking first 5 urls in that
>> start url page "
>> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems"
>> instead
>> it is crawling random 5 links that starts with
>> "*http://steamcommunity.com/sharedfiles/filedetails/
>> <http://steamcommunity.com/sharedfiles/filedetails/>*";
>>
>> Can we restrict crawlspider to *crawl the first 5 links on a page* that
>> starts with above url ?
>>
>> On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote:
>>>
>>> Hi,
>>>
>>> You can use process_links for this:
>>>
>>>
>>> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)),
>>> process_links=lambda l: l[:5],
>>> callback='parse_items'),
>>>
>>>
>>>
>>>
>>> On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am new to use crawlspider...
>>>>
>>>> My problem is, *I need to extract top 5 items data in this link* (
>>>> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems).
>>>> I have done this like this:
>>>>
>>>> start_urls = [
>>>> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>>
>>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>>
>>>> ]
>>>>
>>>> and specified rules as
>>>> rules = (
>>>>
>>>> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/
>>>>
>>>> <http://steamcommunity.com/sharedfiles/filedetails/>*",)),
>>>> callback='parse_items'),
>>>> )
>>>>
>>>> Now it is crawling through all urls that starts with "
>>>> http://steamcommunity.com/sharedfiles/filedetails" on the start_url
>>>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>
>>>> page.
>>>>
>>>> My problem is it should crawl through only first 5 urls that starts
>>>> with "http://steamcommunity.com/sharedfiles/filedetails/" on the
>>>> start_url
>>>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>page.
>>>>
>>>> Can we do this by crawlspider restrict or any other means ?
>>>>
>>>> *My code: *
>>>>
>>>> class ScrapePriceSpider(CrawlSpider):
>>>>
>>>> name = 'ScrapeItems'
>>>> allowed_domains = ['steamcommunity.com']
>>>> start_urls =
>>>> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>>
>>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>>
>>>> ]
>>>>
>>>> rules = (
>>>> Rule(SgmlLinkExtractor(allow=("
>>>> http://steamcommunity.com/sharedfiles/filedetails/",)),
>>>> callback='parse_items'),
>>>> )
>>>>
>>>>
>>>> def parse_items(self, response):
>>>> hxs = HtmlXPathSelector(response)
>>>>
>>>> item = ExtractitemsItem()
>>>>
>>>> item["Item Name"] =
>>>> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>>>> item["Unique Visits"] =
>>>> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()
>>>>
>>>> item["Current Favorites"] =
>>>> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>>>> return item
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.