Hi everyone,
I am new to Scrapy and I 'm using Scrapy 0.24 and I 'd like to scrape this
website
<http://www.corteidh.or.cr/CF/Jurisprudencia/Jurisprudencia_Search_avan.cfm?lang=es>.
As you will see this pops-up a form so I just pressed "Buscar" and *saved
the results page locally in my hard-drive *to do some test without sending
requests to the website the whole time
So I created a CrawlSpider to scrape the results page and to follow all the
links that say "Ver Ficha Técnica del Caso"
My settings for the CrawlSpider are:
name = "iachr"
allowed_domains = ["http://corteidh.or.cr/"]
start_urls = [
"file:///D:/Jurisprudencia%20Buscador.htm"
]
All the links I wanted to follow contain "ficha.cfm" so I have the
following rule
rules = (
# parse the technical files (for extracting case metadata)
Rule(LxmlLinkExtractor(allow=(r
'http://corteidh\.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)), callback=
'parse_2'),
)
However when I run the crawler it seems that it cannot follow the links,
and it just scrapes the start page. When I run
scrapy parse "file:///D:/Jurisprudencia%20Buscador.htm"
it returns no links. It can be something in the regex even though I 've
tried quite a few combinations
My code looks like this
class IachrSpider(CrawlSpider):
'''
classdocs
'''
name = "iachr"
allowed_domains = ["http://corteidh.or.cr/"]
start_urls = [
"file:///D:/Jurisprudencia%20Buscador.htm"
]
rules = (
# parse the technical files (for extracting case metadata)
Rule(LxmlLinkExtractor(allow=(r
'http://corteidh\.or\.cr/CF/Jurisprudencia/ficha\.cfm.+',)), callback=
'parse_2'),
)
def __init__(self):
'''
Constructor
'''
def parse(self, response):
cases = response.xpath("//td/font/strong")
for case in cases:
item = IachrItem()
item['full_title'] = case.xpath('text()').extract()
yield item
def parse_2(self, response):
self.log("Parsing technical data at:" + response.url)
item = IachrItem()
item["techfile_title"] = response.xpath("//h2/title()")
self.log("Case title" + item["techfile_title"])
yield item
Any insight would be appreciated. It might be something trivial that I
can't see now...
Yannis
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.