Actually, I was wrong, it's not working. It still is crawling sites outside of the subdirectory. moreover, i'm not able to get anything in the /about/ subdirectory.
On Thursday, November 20, 2014 10:23:40 AM UTC-6, Tina C wrote: > > Thanks, that worked perfectly! > > On Wednesday, November 19, 2014 4:33:07 PM UTC-6, Travis Leleu wrote: >> >> Are you trying to crawl every link on northwestern.edu that is in the >> subdirectory african-studies? allowed_domains controls the domain name, >> not the path -- to limit to the /african-studies subdir, you'd put that >> information into the "allow" named parameter of the link extractor object. >> >> Assuming that's what you're trying to accomplish, try this: >> >> >>> import scrapy >>> from scrapy.contrib.spiders import CrawlSpider, Rule >>> from africanstudies.items import AfricanstudiesItem >>> from scrapy.contrib.linkextractors import LinkExtractor >>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>> from scrapy.http import Request >>> import urlparse >>> >>> class AfricanstudiesSpider(CrawlSpider): >>> name = "africanstudies" >>> allowed_domains = ["northwestern.edu >>> <http://northwestern.edu/african-studies>"] >>> start_urls = [ >>> "http://www.northwestern.edu/african-studies/about/" >>> ] >>> >>> rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback= >>> 'parse_links',follow=True),) >>> >>> def parse_links(self, response): >>> links = response.xpath('//a/@href').extract() >>> for link in links: >>> url = urlparse.urljoin(response.url, link) >>> yield Request(url, callback = self.parse_items,) >>> >>> def parse_items(self, response): >>> self.log('Hi, this is an item page! %s' % response.url) >>> for sel in response.xpath('//div[2]/div[1]'): >>> item = AfricanstudiesItem() >>> item['url'] = response.url >>> item['title'] = sel.xpath( >>> 'div[3]/*[@id="green_title"]/text()').extract() >>> item['desc'] = sel.xpath('div[4]/*').extract() >>> yield item >>> >>> >>> >>> >>> >>> >>> On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote: >>>> >>>> That's helpful, but I'm hung up on getting the spider to follow >>>> relative links. I've tried a lot of things, but I think that I'm really >>>> close with this: >>>> >>>> import scrapy >>>> from scrapy.contrib.spiders import CrawlSpider, Rule >>>> from africanstudies.items import AfricanstudiesItem >>>> from scrapy.contrib.linkextractors import LinkExtractor >>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor >>>> import urlparse >>>> >>>> class AfricanstudiesSpider(CrawlSpider): >>>> name = "africanstudies" >>>> allowed_domains = ["northwestern.edu/african-studies"] >>>> start_urls = [ >>>> "http://www.northwestern.edu/african-studies/about/" >>>> ] >>>> >>>> rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links', >>>> follow=True),) >>>> >>>> def parse_links(self, response): >>>> sel = scrapy.Selector(response) >>>> for href in sel.xpath('//a/@href').extract(): >>>> url = urlparse.urljoin(response.url, href) >>>> yield Request(url, callback = self.parse_items,) >>>> >>>> def parse_items(self, response): >>>> self.log('Hi, this is an item page! %s' % response.url) >>>> for sel in response.xpath('//div[2]/div[1]'): >>>> item = AfricanstudiesItem() >>>> item['url'] = response.url >>>> item['title'] = sel.xpath('div[3]/*[@id=" >>>> green_title"]/text()').extract() >>>> item['desc'] = sel.xpath('div[4]/*').extract() >>>> yield item >>>> >>>> I can see from my logs that is skipping over the hard coded links from >>>> other domains (as it should). I thought this bit of code would cause the >>>> spider to recognize my relative links, but it does not. >>>> >>>> Hopefully you can lend a hand and tell me what I'm doing wrong. >>>> >>>> >>>> >>>> >>>> >>>> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote: >>>>> >>>>> Hi Tina! >>>>> >>>>> Your code looks good, except it's missing logic that would give scrapy >>>>> more pages to crawl. (Scrapy won't grab links and crawl them by default; >>>>> you have to indicate what you want to crawl.) >>>>> >>>>> I use one of two primary mechanisms: >>>>> >>>>> With the CrawlSpider, you can define a class variable called rules >>>>> that defines rules for scrapy to consider when following links. Often, I >>>>> will define these rules based on a LinkExtractor object, which allows you >>>>> to specify things like callbacks (what method to use in parsing a >>>>> particular link), filters (you can modify the URL to remove session >>>>> variables, etc.), limitations on links to extract (full gamut of css and >>>>> xpath selectors available). More information is at >>>>> http://doc.scrapy.org/en/latest/topics/spiders.html# >>>>> scrapy.contrib.spiders.Rule >>>>> >>>>> Sometimes, the rule-based link following just doesn't cut it. (If >>>>> you're using the scrapy.Spider spider class, the rules options aren't >>>>> implemented, so you have to do it this way.) If you yield a Request >>>>> object >>>>> from your parsing class, scrapy will add that to the queue to be scraped >>>>> and processed. >>>>> >>>>> That make sense? >>>>> >>>>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote: >>>>> >>>>>> There has to be something really simple that I'm missing. I'm trying >>>>>> to get it to crawl more than one page, but I'm using a section of the >>>>>> page >>>>>> as a starting point for testing. I can't get it to crawl anything beyond >>>>>> the index page. What am I doing wrong? >>>>>> >>>>>> import scrapy >>>>>> from scrapy.contrib.spiders import CrawlSpider, Rule >>>>>> from africanstudies.items import AfricanstudiesItem >>>>>> from scrapy.contrib.linkextractors import LinkExtractor >>>>>> >>>>>> class DmozSpider(CrawlSpider): >>>>>> name = "africanstudies" >>>>>> allowed_domains = ["northwestern.edu"] >>>>>> start_urls = [ >>>>>> "http://www.northwestern.edu/african-studies/about/" >>>>>> ] >>>>>> >>>>>> def parse(self, response): >>>>>> for sel in response.xpath('//div[2]/div[1]'): >>>>>> item = AfricanstudiesItem() >>>>>> item['url'] = response.url >>>>>> item['title'] = sel.xpath('div[3]/*[@id=" >>>>>> green_title"]/text()').extract() >>>>>> item['desc'] = sel.xpath('div[4]/*').extract() >>>>>> yield item >>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "scrapy-users" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
