I do. I've attached an extremely simple spider that crawls those links.
Hopefully the code will answer your questions, if not, feel free to ask any
more you may have.
As for why that particular xpath works on the page and not in scrapy shell,
my guess is that the data is loaded in with the webpage, so no AJAX. Then
some js does something to the dom, there's a lot of ads on those pages, so
I wouldn't be surprised.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from urlparse import urljoin
from os.path import join
from testing.items import TestingItem
class Portugal_spider(BaseSpider):
name='Port'
def __init__(self):
self.allowed_domains = ['www.mapadeportugal.net']
self.base_url = 'http://www.mapadeportugal.net'
self.start_urls = ['http://www.mapadeportugal.net/concelho.asp?c=1401']
def parse(self, response):
hxs = HtmlXPathSelector(response)
for menu_link in hxs.select("//td[@class='txtmedio' and @width='25%']/a/@href").extract():
menu_link = self.get_abs_url(menu_link)
yield Request(url=menu_link, callback=self.get_stuff)
def get_stuff(self, response):
hxs = HtmlXPathSelector(response)
item = TestingItem()
item['name'] = hxs.select("normalize-space(//td[@align='left']/p[@class='titulopagina'][1]/text())").extract()[0]
return item
def get_abs_url(self, link_frag):
return urljoin(self.base_url, link_frag)