Just to update (and to serve as an archive for anyone searching for a 
similar answer), I was really close with the previous code snippets I 
listed. The problem was that the information contained in my callback was 
canceling out my rules. Here's my updated code (I'm only grabbing the URLs 
at this point) and it seems to work.

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from africanstudies.items import AfricanstudiesItem

class MySpider(CrawlSpider):
    name = 'africanstudies'
    allowed_domains = ['northwestern.edu']
    start_urls = ['http://www.northwestern.edu/african-studies']

    rules = (
        Rule(LinkExtractor(allow='african-studies'), follow=True, callback=
'parse_item'),

    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        item = AfricanstudiesItem()
        item['url'] = response.url
        return item

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to