Re: Table doesn't exist on shell or spider crawling - .asp website

Bill Ebeling Wed, 09 Apr 2014 09:06:09 -0700

I do.  I've attached an extremely simple spider that crawls those links.  
Hopefully the code will answer your questions, if not, feel free to ask any 
more you may have.


As for why that particular xpath works on the page and not in scrapy shell, 
my guess is that the data is loaded in with the webpage, so no AJAX. Then 
some js does something to the dom, there's a lot of ads on those pages, so 
I wouldn't be surprised.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from urlparse import urljoin
from os.path import join
from testing.items import TestingItem

class Portugal_spider(BaseSpider):
  name='Port'
  def __init__(self):
    self.allowed_domains = ['www.mapadeportugal.net']
    self.base_url = 'http://www.mapadeportugal.net'
    self.start_urls = ['http://www.mapadeportugal.net/concelho.asp?c=1401']

  def parse(self, response):
    hxs = HtmlXPathSelector(response)
    for menu_link in hxs.select("//td[@class='txtmedio' and @width='25%']/a/@href").extract():
      menu_link = self.get_abs_url(menu_link)
      yield Request(url=menu_link, callback=self.get_stuff)

  def get_stuff(self, response):
    hxs = HtmlXPathSelector(response)
    item = TestingItem()
    item['name'] = hxs.select("normalize-space(//td[@align='left']/p[@class='titulopagina'][1]/text())").extract()[0]

    return item 


  def get_abs_url(self, link_frag):
    return urljoin(self.base_url, link_frag)

Re: Table doesn't exist on shell or spider crawling - .asp website

Reply via email to