Thanks, that worked perfectly! On Wednesday, November 19, 2014 4:33:07 PM UTC-6, Travis Leleu wrote: > > Are you trying to crawl every link on northwestern.edu that is in the > subdirectory african-studies? allowed_domains controls the domain name, > not the path -- to limit to the /african-studies subdir, you'd put that > information into the "allow" named parameter of the link extractor object. > > Assuming that's what you're trying to accomplish, try this: > > >> import scrapy >> from scrapy.contrib.spiders import CrawlSpider, Rule >> from africanstudies.items import AfricanstudiesItem >> from scrapy.contrib.linkextractors import LinkExtractor >> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >> from scrapy.http import Request >> import urlparse >> >> class AfricanstudiesSpider(CrawlSpider): >> name = "africanstudies" >> allowed_domains = ["northwestern.edu >> <http://northwestern.edu/african-studies>"] >> start_urls = [ >> "http://www.northwestern.edu/african-studies/about/" >> ] >> >> rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback= >> 'parse_links',follow=True),) >> >> def parse_links(self, response): >> links = response.xpath('//a/@href').extract() >> for link in links: >> url = urlparse.urljoin(response.url, link) >> yield Request(url, callback = self.parse_items,) >> >> def parse_items(self, response): >> self.log('Hi, this is an item page! %s' % response.url) >> for sel in response.xpath('//div[2]/div[1]'): >> item = AfricanstudiesItem() >> item['url'] = response.url >> item['title'] = sel.xpath( >> 'div[3]/*[@id="green_title"]/text()').extract() >> item['desc'] = sel.xpath('div[4]/*').extract() >> yield item >> >> >> >> >> >> >> On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote: >>> >>> That's helpful, but I'm hung up on getting the spider to follow relative >>> links. I've tried a lot of things, but I think that I'm really close with >>> this: >>> >>> import scrapy >>> from scrapy.contrib.spiders import CrawlSpider, Rule >>> from africanstudies.items import AfricanstudiesItem >>> from scrapy.contrib.linkextractors import LinkExtractor >>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>> import urlparse >>> >>> class AfricanstudiesSpider(CrawlSpider): >>> name = "africanstudies" >>> allowed_domains = ["northwestern.edu/african-studies"] >>> start_urls = [ >>> "http://www.northwestern.edu/african-studies/about/" >>> ] >>> >>> rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow >>> =True),) >>> >>> def parse_links(self, response): >>> sel = scrapy.Selector(response) >>> for href in sel.xpath('//a/@href').extract(): >>> url = urlparse.urljoin(response.url, href) >>> yield Request(url, callback = self.parse_items,) >>> >>> def parse_items(self, response): >>> self.log('Hi, this is an item page! %s' % response.url) >>> for sel in response.xpath('//div[2]/div[1]'): >>> item = AfricanstudiesItem() >>> item['url'] = response.url >>> item['title'] = sel.xpath('div[3]/*[@id=" >>> green_title"]/text()').extract() >>> item['desc'] = sel.xpath('div[4]/*').extract() >>> yield item >>> >>> I can see from my logs that is skipping over the hard coded links from >>> other domains (as it should). I thought this bit of code would cause the >>> spider to recognize my relative links, but it does not. >>> >>> Hopefully you can lend a hand and tell me what I'm doing wrong. >>> >>> >>> >>> >>> >>> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote: >>>> >>>> Hi Tina! >>>> >>>> Your code looks good, except it's missing logic that would give scrapy >>>> more pages to crawl. (Scrapy won't grab links and crawl them by default; >>>> you have to indicate what you want to crawl.) >>>> >>>> I use one of two primary mechanisms: >>>> >>>> With the CrawlSpider, you can define a class variable called rules that >>>> defines rules for scrapy to consider when following links. Often, I will >>>> define these rules based on a LinkExtractor object, which allows you to >>>> specify things like callbacks (what method to use in parsing a particular >>>> link), filters (you can modify the URL to remove session variables, etc.), >>>> limitations on links to extract (full gamut of css and xpath selectors >>>> available). More information is at http://doc.scrapy.org/en/ >>>> latest/topics/spiders.html#scrapy.contrib.spiders.Rule >>>> >>>> Sometimes, the rule-based link following just doesn't cut it. (If >>>> you're using the scrapy.Spider spider class, the rules options aren't >>>> implemented, so you have to do it this way.) If you yield a Request >>>> object >>>> from your parsing class, scrapy will add that to the queue to be scraped >>>> and processed. >>>> >>>> That make sense? >>>> >>>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote: >>>> >>>>> There has to be something really simple that I'm missing. I'm trying >>>>> to get it to crawl more than one page, but I'm using a section of the >>>>> page >>>>> as a starting point for testing. I can't get it to crawl anything beyond >>>>> the index page. What am I doing wrong? >>>>> >>>>> import scrapy >>>>> from scrapy.contrib.spiders import CrawlSpider, Rule >>>>> from africanstudies.items import AfricanstudiesItem >>>>> from scrapy.contrib.linkextractors import LinkExtractor >>>>> >>>>> class DmozSpider(CrawlSpider): >>>>> name = "africanstudies" >>>>> allowed_domains = ["northwestern.edu"] >>>>> start_urls = [ >>>>> "http://www.northwestern.edu/african-studies/about/" >>>>> ] >>>>> >>>>> def parse(self, response): >>>>> for sel in response.xpath('//div[2]/div[1]'): >>>>> item = AfricanstudiesItem() >>>>> item['url'] = response.url >>>>> item['title'] = sel.xpath('div[3]/*[@id=" >>>>> green_title"]/text()').extract() >>>>> item['desc'] = sel.xpath('div[4]/*').extract() >>>>> yield item >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > >
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
