Hi, I am new to webscrapy. I need to crawl the headlines "http://www.thehindu.com/sport/cricket" site. I want to follow those links and to extract <class= body> content from it. It gives the links, but doesn't follow links. I tried with scrapy tutorial and other online available programs. I can't make the result. Can you help?
The code is given below.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from th.items import ThItem
from scrapy.http import Request, HtmlResponse
from scrapy.spider import Spider
import urlparse
import urllib2
class TheSpider(CrawlSpider):
name="th"
allowed_domains=["http://www.thehindu.com/"]
start_urls=["http://www.thehindu.com/sport/cricket"]
rules =
[Rule(SgmlLinkExtractor(allow=[r'/sport/cricket/(.*?)/(.*?d{7}).ece^']),
callback='parse_links'follow=True,),
Rule(SgmlLinkExtractor(), follow= True,)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item=ThItem()
items=[]
for url in hxs.xpath('//div/h3/a/@href').extract():
yield Request( url, meta={'item': item}, callback =
self.parse_links)
def parse_links(self,response):
hxs = HtmlXPathSelector(response)
item= response.meta['item']
item['body'] = hxs.select('.//p/text()').extract()
return item
Instead of yielding Request in parse( ), I used yield url. That output is
attaching with this mail.
Thanks and regards,
Deepa
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
it.json
Description: Binary data
