My CrawlSpider seems to igrore rules

ypanagis Thu, 23 Oct 2014 02:43:20 -0700

Hi everyone,

I am new to Scrapy and I 'm using Scrapy 0.24 and I 'd like to scrape this 
website 
<http://www.corteidh.or.cr/CF/Jurisprudencia/Jurisprudencia_Search_avan.cfm?lang=es>.
 
As you will see this pops-up a form so I just pressed "Buscar" and *saved 
the results page locally in my hard-drive *to do some test without sending 
requests to the website the whole time


So I created a CrawlSpider to scrape the results page and to follow all the 
links that say "Ver Ficha Técnica del Caso"

My settings for the CrawlSpider are:

    name = "iachr"
    allowed_domains = ["http://corteidh.or.cr/";]
    start_urls = [     
        "file:///D:/Jurisprudencia%20Buscador.htm"
    ]


All the links I wanted to follow contain "ficha.cfm" so I have the 
following rule

    rules = ( 
        # parse the technical files (for extracting case metadata)
        Rule(LxmlLinkExtractor(allow=(r
'http://corteidh\.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)),  callback=
'parse_2'),
        )

However when I run the crawler it seems that it cannot follow the links, 
and it just scrapes the start page. When I run 
scrapy parse "file:///D:/Jurisprudencia%20Buscador.htm"
it returns no links. It can be something in the regex even though I 've 
tried quite a few combinations

 My code looks like this

class IachrSpider(CrawlSpider):
    '''
    classdocs
    '''
    name = "iachr"
    allowed_domains = ["http://corteidh.or.cr/";]
    start_urls = [
        "file:///D:/Jurisprudencia%20Buscador.htm"
    ]
    
    rules = ( 
        # parse the technical files (for extracting case metadata)
        Rule(LxmlLinkExtractor(allow=(r
'http://corteidh\.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)),  callback=
'parse_2'),
        )
        
    
    def __init__(self):
        '''
        Constructor
        '''
        
       
    def parse(self, response):
        cases = response.xpath("//td/font/strong")
        
        for case in cases:
            
            item = IachrItem()
            item['full_title'] = case.xpath('text()').extract()
            yield item

            
    def parse_2(self, response):
        self.log("Parsing technical data at:" + response.url)
        item = IachrItem()
        item["techfile_title"] = response.xpath("//h2/title()")
        self.log("Case title" + item["techfile_title"])    
        
        yield item

Any insight would be appreciated. It might be something trivial that I 
can't see now...

Yannis

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

My CrawlSpider seems to igrore rules

Reply via email to