Hi,
 I am new to webscrapy. I need to crawl the headlines 
"http://www.thehindu.com/sport/cricket"; site. I want to follow those links 
and to extract <class= body> content from it.
 It gives the links, but doesn't follow links. I tried with scrapy tutorial 
and other online available programs. I can't make the result. Can you help? 

The code is given below.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from th.items import ThItem
from scrapy.http import Request, HtmlResponse
from scrapy.spider import Spider
import urlparse
import urllib2


class TheSpider(CrawlSpider):
    name="th"
    allowed_domains=["http://www.thehindu.com/";]
    start_urls=["http://www.thehindu.com/sport/cricket";]
    rules = 
[Rule(SgmlLinkExtractor(allow=[r'/sport/cricket/(.*?)/(.*?d{7}).ece^']), 
callback='parse_links'follow=True,),
             Rule(SgmlLinkExtractor(), follow= True,)]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        item=ThItem()
        items=[]
        for url in hxs.xpath('//div/h3/a/@href').extract():
            yield Request( url, meta={'item': item}, callback = 
self.parse_links)
    def parse_links(self,response):
        hxs = HtmlXPathSelector(response)
        item= response.meta['item']
        item['body'] = hxs.select('.//p/text()').extract()
        return item

Instead of yielding Request in parse( ), I used yield url. That output is 
attaching with this mail.

Thanks and regards,
Deepa

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Attachment: it.json
Description: Binary data

Reply via email to