That's helpful, but I'm hung up on getting the spider to follow relative
links. I've tried a lot of things, but I think that I'm really close with
this:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse
class AfricanstudiesSpider(CrawlSpider):
name = "africanstudies"
allowed_domains = ["northwestern.edu/african-studies"]
start_urls = [
"http://www.northwestern.edu/african-studies/about/"
]
rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=
True),)
def parse_links(self, response):
sel = scrapy.Selector(response)
for href in sel.xpath('//a/@href').extract():
url = urlparse.urljoin(response.url, href)
yield Request(url, callback = self.parse_items,)
def parse_items(self, response):
self.log('Hi, this is an item page! %s' % response.url)
for sel in response.xpath('//div[2]/div[1]'):
item = AfricanstudiesItem()
item['url'] = response.url
item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').
extract()
item['desc'] = sel.xpath('div[4]/*').extract()
yield item
I can see from my logs that is skipping over the hard coded links from
other domains (as it should). I thought this bit of code would cause the
spider to recognize my relative links, but it does not.
Hopefully you can lend a hand and tell me what I'm doing wrong.
On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>
> Hi Tina!
>
> Your code looks good, except it's missing logic that would give scrapy
> more pages to crawl. (Scrapy won't grab links and crawl them by default;
> you have to indicate what you want to crawl.)
>
> I use one of two primary mechanisms:
>
> With the CrawlSpider, you can define a class variable called rules that
> defines rules for scrapy to consider when following links. Often, I will
> define these rules based on a LinkExtractor object, which allows you to
> specify things like callbacks (what method to use in parsing a particular
> link), filters (you can modify the URL to remove session variables, etc.),
> limitations on links to extract (full gamut of css and xpath selectors
> available). More information is at
> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>
> Sometimes, the rule-based link following just doesn't cut it. (If you're
> using the scrapy.Spider spider class, the rules options aren't implemented,
> so you have to do it this way.) If you yield a Request object from your
> parsing class, scrapy will add that to the queue to be scraped and
> processed.
>
> That make sense?
>
> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected] <javascript:>
> > wrote:
>
>> There has to be something really simple that I'm missing. I'm trying to
>> get it to crawl more than one page, but I'm using a section of the page as
>> a starting point for testing. I can't get it to crawl anything beyond the
>> index page. What am I doing wrong?
>>
>> import scrapy
>> from scrapy.contrib.spiders import CrawlSpider, Rule
>> from africanstudies.items import AfricanstudiesItem
>> from scrapy.contrib.linkextractors import LinkExtractor
>>
>> class DmozSpider(CrawlSpider):
>> name = "africanstudies"
>> allowed_domains = ["northwestern.edu"]
>> start_urls = [
>> "http://www.northwestern.edu/african-studies/about/"
>> ]
>>
>> def parse(self, response):
>> for sel in response.xpath('//div[2]/div[1]'):
>> item = AfricanstudiesItem()
>> item['url'] = response.url
>> item['title'] = sel.xpath(
>> 'div[3]/*[@id="green_title"]/text()').extract()
>> item['desc'] = sel.xpath('div[4]/*').extract()
>> yield item
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.