Are you trying to crawl every link on northwestern.edu that is in the subdirectory african-studies? allowed_domains controls the domain name, not the path -- to limit to the /african-studies subdir, you'd put that information into the "allow" named parameter of the link extractor object.
Assuming that's what you're trying to accomplish, try this: > import scrapy > from scrapy.contrib.spiders import CrawlSpider, Rule > from africanstudies.items import AfricanstudiesItem > from scrapy.contrib.linkextractors import LinkExtractor > from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor > from scrapy.http import Request > import urlparse > > class AfricanstudiesSpider(CrawlSpider): > name = "africanstudies" > allowed_domains = ["northwestern.edu > <http://northwestern.edu/african-studies>"] > start_urls = [ > "http://www.northwestern.edu/african-studies/about/" > ] > > rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback= > 'parse_links',follow=True),) > > def parse_links(self, response): > links = response.xpath('//a/@href').extract() > for link in links: > url = urlparse.urljoin(response.url, link) > yield Request(url, callback = self.parse_items,) > > def parse_items(self, response): > self.log('Hi, this is an item page! %s' % response.url) > for sel in response.xpath('//div[2]/div[1]'): > item = AfricanstudiesItem() > item['url'] = response.url > item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()' > ).extract() > item['desc'] = sel.xpath('div[4]/*').extract() > yield item > > > > > > > On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote: >> >> That's helpful, but I'm hung up on getting the spider to follow relative >> links. I've tried a lot of things, but I think that I'm really close with >> this: >> >> import scrapy >> from scrapy.contrib.spiders import CrawlSpider, Rule >> from africanstudies.items import AfricanstudiesItem >> from scrapy.contrib.linkextractors import LinkExtractor >> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >> import urlparse >> >> class AfricanstudiesSpider(CrawlSpider): >> name = "africanstudies" >> allowed_domains = ["northwestern.edu/african-studies"] >> start_urls = [ >> "http://www.northwestern.edu/african-studies/about/" >> ] >> >> rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow= >> True),) >> >> def parse_links(self, response): >> sel = scrapy.Selector(response) >> for href in sel.xpath('//a/@href').extract(): >> url = urlparse.urljoin(response.url, href) >> yield Request(url, callback = self.parse_items,) >> >> def parse_items(self, response): >> self.log('Hi, this is an item page! %s' % response.url) >> for sel in response.xpath('//div[2]/div[1]'): >> item = AfricanstudiesItem() >> item['url'] = response.url >> item['title'] = sel.xpath('div[3]/*[@id=" >> green_title"]/text()').extract() >> item['desc'] = sel.xpath('div[4]/*').extract() >> yield item >> >> I can see from my logs that is skipping over the hard coded links from >> other domains (as it should). I thought this bit of code would cause the >> spider to recognize my relative links, but it does not. >> >> Hopefully you can lend a hand and tell me what I'm doing wrong. >> >> >> >> >> >> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote: >>> >>> Hi Tina! >>> >>> Your code looks good, except it's missing logic that would give scrapy >>> more pages to crawl. (Scrapy won't grab links and crawl them by default; >>> you have to indicate what you want to crawl.) >>> >>> I use one of two primary mechanisms: >>> >>> With the CrawlSpider, you can define a class variable called rules that >>> defines rules for scrapy to consider when following links. Often, I will >>> define these rules based on a LinkExtractor object, which allows you to >>> specify things like callbacks (what method to use in parsing a particular >>> link), filters (you can modify the URL to remove session variables, etc.), >>> limitations on links to extract (full gamut of css and xpath selectors >>> available). More information is at http://doc.scrapy.org/en/ >>> latest/topics/spiders.html#scrapy.contrib.spiders.Rule >>> >>> Sometimes, the rule-based link following just doesn't cut it. (If >>> you're using the scrapy.Spider spider class, the rules options aren't >>> implemented, so you have to do it this way.) If you yield a Request object >>> from your parsing class, scrapy will add that to the queue to be scraped >>> and processed. >>> >>> That make sense? >>> >>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote: >>> >>>> There has to be something really simple that I'm missing. I'm trying to >>>> get it to crawl more than one page, but I'm using a section of the page as >>>> a starting point for testing. I can't get it to crawl anything beyond the >>>> index page. What am I doing wrong? >>>> >>>> import scrapy >>>> from scrapy.contrib.spiders import CrawlSpider, Rule >>>> from africanstudies.items import AfricanstudiesItem >>>> from scrapy.contrib.linkextractors import LinkExtractor >>>> >>>> class DmozSpider(CrawlSpider): >>>> name = "africanstudies" >>>> allowed_domains = ["northwestern.edu"] >>>> start_urls = [ >>>> "http://www.northwestern.edu/african-studies/about/" >>>> ] >>>> >>>> def parse(self, response): >>>> for sel in response.xpath('//div[2]/div[1]'): >>>> item = AfricanstudiesItem() >>>> item['url'] = response.url >>>> item['title'] = sel.xpath('div[3]/*[@id=" >>>> green_title"]/text()').extract() >>>> item['desc'] = sel.xpath('div[4]/*').extract() >>>> yield item >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
