allowed_domains = ["http://corteidh.or.cr/"] should be allowed_domains = ["corteidh.or.cr"]
Also you are overriding the parse method, there is a parse_start_url <http://doc.scrapy.org/en/master/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url> El jueves, 23 de octubre de 2014 07:38:17 UTC-2, [email protected] escribió: > > Hi everyone, > > I am new to Scrapy and I 'm using Scrapy 0.24 and I 'd like to scrape this > website > <http://www.corteidh.or.cr/CF/Jurisprudencia/Jurisprudencia_Search_avan.cfm?lang=es>. > > As you will see this pops-up a form so I just pressed "Buscar" and *saved > the results page locally in my hard-drive *to do some test without > sending requests to the website the whole time > > So I created a CrawlSpider to scrape the results page and to follow all > the links that say "Ver Ficha Técnica del Caso" > > My settings for the CrawlSpider are: > > name = "iachr" > allowed_domains = ["http://corteidh.or.cr/"] > start_urls = [ > "file:///D:/Jurisprudencia%20Buscador.htm" > ] > > > All the links I wanted to follow contain "ficha.cfm" so I have the > following rule > > rules = ( > # parse the technical files (for extracting case metadata) > Rule(LxmlLinkExtractor(allow=(r'http://corteidh > \.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)), callback='parse_2'), > ) > > However when I run the crawler it seems that it cannot follow the links, > and it just scrapes the start page. When I run > scrapy parse "file:///D:/Jurisprudencia%20Buscador.htm" > it returns no links. It can be something in the regex even though I 've > tried quite a few combinations > > My code looks like this > > class IachrSpider(CrawlSpider): > ''' > classdocs > ''' > name = "iachr" > allowed_domains = ["http://corteidh.or.cr/"] > start_urls = [ > "file:///D:/Jurisprudencia%20Buscador.htm" > ] > > rules = ( > # parse the technical files (for extracting case metadata) > Rule(LxmlLinkExtractor(allow=(r'http://corteidh > \.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)), callback='parse_2'), > ) > > > def __init__(self): > ''' > Constructor > ''' > > > def parse(self, response): > cases = response.xpath("//td/font/strong") > > for case in cases: > > item = IachrItem() > item['full_title'] = case.xpath('text()').extract() > yield item > > > def parse_2(self, response): > self.log("Parsing technical data at:" + response.url) > item = IachrItem() > item["techfile_title"] = response.xpath("//h2/title()") > self.log("Case title" + item["techfile_title"]) > > yield item > > Any insight would be appreciated. It might be something trivial that I > can't see now... > > Yannis > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
