Re: A little help for a new scrapy user?

Tina C Wed, 19 Nov 2014 13:17:07 -0800

That's helpful, but I'm hung up on getting the spider to follow relative 
links. I've tried a lot of things, but I think that I'm really close with 
this:


import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse

class AfricanstudiesSpider(CrawlSpider):
    name = "africanstudies"
    allowed_domains = ["northwestern.edu/african-studies"]
    start_urls = [
        "http://www.northwestern.edu/african-studies/about/";
    ]
    
    rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=
True),)
    
    def parse_links(self, response):
        sel = scrapy.Selector(response)
        for href in sel.xpath('//a/@href').extract():
            url = urlparse.urljoin(response.url, href)
            yield Request(url, callback = self.parse_items,)
            
       def parse_items(self, response):
           self.log('Hi, this is an item page! %s' % response.url)
        for sel in response.xpath('//div[2]/div[1]'):
            item = AfricanstudiesItem()
            item['url'] = response.url
            item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').
extract()      
            item['desc'] = sel.xpath('div[4]/*').extract()      
            yield item

I can see from my logs that is skipping over the hard coded links from 
other domains (as it should). I thought this bit of code would cause the 
spider to recognize my relative links, but it does not.

Hopefully you can lend a hand and tell me what I'm doing wrong.





On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>
> Hi Tina!
>
> Your code looks good, except it's missing logic that would give scrapy 
> more pages to crawl.  (Scrapy won't grab links and crawl them by default; 
> you have to indicate what you want to crawl.)
>
> I use one of two primary mechanisms:
>
> With the CrawlSpider, you can define a class variable called rules that 
> defines rules for scrapy to consider when following links.  Often, I will 
> define these rules based on a LinkExtractor object, which allows you to 
> specify things like callbacks (what method to use in parsing a particular 
> link), filters (you can modify the URL to remove session variables, etc.), 
> limitations on links to extract (full gamut of css and xpath selectors 
> available).  More information is at 
> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>
> Sometimes, the rule-based link following just doesn't cut it.  (If you're 
> using the scrapy.Spider spider class, the rules options aren't implemented, 
> so you have to do it this way.)  If you yield a Request object from your 
> parsing class, scrapy will add that to the queue to be scraped and 
> processed.
>
> That make sense?
>
> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected] <javascript:>
> > wrote:
>
>> There has to be something really simple that I'm missing. I'm trying to 
>> get it to crawl more than one page, but I'm using a section of the page as 
>> a starting point for testing. I can't get it to crawl anything beyond the 
>> index page. What am I doing wrong?
>>
>> import scrapy
>> from scrapy.contrib.spiders import CrawlSpider, Rule
>> from africanstudies.items import AfricanstudiesItem
>> from scrapy.contrib.linkextractors import LinkExtractor
>>
>> class DmozSpider(CrawlSpider):
>>     name = "africanstudies"
>>     allowed_domains = ["northwestern.edu"]
>>     start_urls = [
>>         "http://www.northwestern.edu/african-studies/about/";
>>     ]
>>
>>     def parse(self, response):
>>         for sel in response.xpath('//div[2]/div[1]'):
>>             item = AfricanstudiesItem()
>>             item['url'] = response.url
>>             item['title'] = sel.xpath(
>> 'div[3]/*[@id="green_title"]/text()').extract()      
>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>             yield item
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to