Re: A little help for a new scrapy user?

Tina C Wed, 19 Nov 2014 14:04:28 -0800

So, I have it crawling, but it doesn't crawl the correct area/site. If I 
use 'allowed_domains', it doesn't crawl anything. If I remove it, it crawls 
too many things. Here's the updated code:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
import urlparse


class AfricanstudiesSpider(CrawlSpider):
    name = "africanstudies"
    allowed_domains = ["northwestern.edu/african-studies"]
    start_urls = [
        "http://www.northwestern.edu/african-studies/about/";
    ]
    
    rules = (Rule(LinkExtractor(allow=(r'')),callback='parse_links',follow=
True),)
    
    def parse_links(self, response):
        links = response.xpath('//a/@href').extract()
        for link in links:
            url = urlparse.urljoin(response.url, link)
            yield Request(url, callback = self.parse_items,)
       
    def parse_items(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        for sel in response.xpath('//div[2]/div[1]'):
            item = AfricanstudiesItem()
            item['url'] = response.url
            item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').
extract()
            item['desc'] = sel.xpath('div[4]/*').extract()
            yield item






On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote:
>
> That's helpful, but I'm hung up on getting the spider to follow relative 
> links. I've tried a lot of things, but I think that I'm really close with 
> this:
>
> import scrapy
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from africanstudies.items import AfricanstudiesItem
> from scrapy.contrib.linkextractors import LinkExtractor
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> import urlparse
>
> class AfricanstudiesSpider(CrawlSpider):
>     name = "africanstudies"
>     allowed_domains = ["northwestern.edu/african-studies"]
>     start_urls = [
>         "http://www.northwestern.edu/african-studies/about/";
>     ]
>     
>     rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=
> True),)
>     
>     def parse_links(self, response):
>         sel = scrapy.Selector(response)
>         for href in sel.xpath('//a/@href').extract():
>             url = urlparse.urljoin(response.url, href)
>             yield Request(url, callback = self.parse_items,)
>             
>        def parse_items(self, response):
>            self.log('Hi, this is an item page! %s' % response.url)
>         for sel in response.xpath('//div[2]/div[1]'):
>             item = AfricanstudiesItem()
>             item['url'] = response.url
>             item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()'
> ).extract()      
>             item['desc'] = sel.xpath('div[4]/*').extract()      
>             yield item
>
> I can see from my logs that is skipping over the hard coded links from 
> other domains (as it should). I thought this bit of code would cause the 
> spider to recognize my relative links, but it does not.
>
> Hopefully you can lend a hand and tell me what I'm doing wrong.
>
>
>
>
>
> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>>
>> Hi Tina!
>>
>> Your code looks good, except it's missing logic that would give scrapy 
>> more pages to crawl.  (Scrapy won't grab links and crawl them by default; 
>> you have to indicate what you want to crawl.)
>>
>> I use one of two primary mechanisms:
>>
>> With the CrawlSpider, you can define a class variable called rules that 
>> defines rules for scrapy to consider when following links.  Often, I will 
>> define these rules based on a LinkExtractor object, which allows you to 
>> specify things like callbacks (what method to use in parsing a particular 
>> link), filters (you can modify the URL to remove session variables, etc.), 
>> limitations on links to extract (full gamut of css and xpath selectors 
>> available).  More information is at 
>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>>
>> Sometimes, the rule-based link following just doesn't cut it.  (If you're 
>> using the scrapy.Spider spider class, the rules options aren't implemented, 
>> so you have to do it this way.)  If you yield a Request object from your 
>> parsing class, scrapy will add that to the queue to be scraped and 
>> processed.
>>
>> That make sense?
>>
>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:
>>
>>> There has to be something really simple that I'm missing. I'm trying to 
>>> get it to crawl more than one page, but I'm using a section of the page as 
>>> a starting point for testing. I can't get it to crawl anything beyond the 
>>> index page. What am I doing wrong?
>>>
>>> import scrapy
>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>> from africanstudies.items import AfricanstudiesItem
>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>
>>> class DmozSpider(CrawlSpider):
>>>     name = "africanstudies"
>>>     allowed_domains = ["northwestern.edu"]
>>>     start_urls = [
>>>         "http://www.northwestern.edu/african-studies/about/";
>>>     ]
>>>
>>>     def parse(self, response):
>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>             item = AfricanstudiesItem()
>>>             item['url'] = response.url
>>>             item['title'] = sel.xpath(
>>> 'div[3]/*[@id="green_title"]/text()').extract()      
>>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>>             yield item
>>>
>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to