allowed_domains = ["http://corteidh.or.cr/";]
 should be allowed_domains = ["corteidh.or.cr"]

Also you are overriding the parse method, there is a parse_start_url 
<http://doc.scrapy.org/en/master/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url>
 


El jueves, 23 de octubre de 2014 07:38:17 UTC-2, [email protected] escribió:
>
> Hi everyone,
>
> I am new to Scrapy and I 'm using Scrapy 0.24 and I 'd like to scrape this 
> website 
> <http://www.corteidh.or.cr/CF/Jurisprudencia/Jurisprudencia_Search_avan.cfm?lang=es>.
>  
> As you will see this pops-up a form so I just pressed "Buscar" and *saved 
> the results page locally in my hard-drive *to do some test without 
> sending requests to the website the whole time
>
> So I created a CrawlSpider to scrape the results page and to follow all 
> the links that say "Ver Ficha Técnica del Caso"
>
> My settings for the CrawlSpider are:
>
>     name = "iachr"
>     allowed_domains = ["http://corteidh.or.cr/";]
>     start_urls = [     
>         "file:///D:/Jurisprudencia%20Buscador.htm"
>     ]
>
>
> All the links I wanted to follow contain "ficha.cfm" so I have the 
> following rule
>
>     rules = ( 
>         # parse the technical files (for extracting case metadata)
>         Rule(LxmlLinkExtractor(allow=(r'http://corteidh
> \.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)),  callback='parse_2'),
>         )
>
> However when I run the crawler it seems that it cannot follow the links, 
> and it just scrapes the start page. When I run 
> scrapy parse "file:///D:/Jurisprudencia%20Buscador.htm"
> it returns no links. It can be something in the regex even though I 've 
> tried quite a few combinations
>
>  My code looks like this
>
> class IachrSpider(CrawlSpider):
>     '''
>     classdocs
>     '''
>     name = "iachr"
>     allowed_domains = ["http://corteidh.or.cr/";]
>     start_urls = [
>         "file:///D:/Jurisprudencia%20Buscador.htm"
>     ]
>     
>     rules = ( 
>         # parse the technical files (for extracting case metadata)
>         Rule(LxmlLinkExtractor(allow=(r'http://corteidh
> \.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)),  callback='parse_2'),
>         )
>         
>     
>     def __init__(self):
>         '''
>         Constructor
>         '''
>         
>        
>     def parse(self, response):
>         cases = response.xpath("//td/font/strong")
>         
>         for case in cases:
>             
>             item = IachrItem()
>             item['full_title'] = case.xpath('text()').extract()
>             yield item
>
>             
>     def parse_2(self, response):
>         self.log("Parsing technical data at:" + response.url)
>         item = IachrItem()
>         item["techfile_title"] = response.xpath("//h2/title()")
>         self.log("Case title" + item["techfile_title"])    
>         
>         yield item
>
> Any insight would be appreciated. It might be something trivial that I 
> can't see now...
>
> Yannis
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to