Re: A little help for a new scrapy user?

Tina C Thu, 20 Nov 2014 08:24:17 -0800

Thanks, that worked perfectly!

On Wednesday, November 19, 2014 4:33:07 PM UTC-6, Travis Leleu wrote:
>
> Are you trying to crawl every link on northwestern.edu that is in the 
> subdirectory african-studies?  allowed_domains controls the domain name, 
> not the path -- to limit to the /african-studies subdir, you'd put that 
> information into the "allow" named parameter of the link extractor object.
>
> Assuming that's what you're trying to accomplish, try this:
>
>
>> import scrapy
>> from scrapy.contrib.spiders import CrawlSpider, Rule
>> from africanstudies.items import AfricanstudiesItem
>> from scrapy.contrib.linkextractors import LinkExtractor
>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>> from scrapy.http import Request
>> import urlparse
>>
>> class AfricanstudiesSpider(CrawlSpider):
>>     name = "africanstudies"
>>     allowed_domains = ["northwestern.edu 
>> <http://northwestern.edu/african-studies>"]
>>     start_urls = [
>>         "http://www.northwestern.edu/african-studies/about/";
>>     ]
>>     
>>     rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback=
>> 'parse_links',follow=True),)
>>     
>>     def parse_links(self, response):
>>         links = response.xpath('//a/@href').extract()
>>         for link in links:
>>             url = urlparse.urljoin(response.url, link)
>>             yield Request(url, callback = self.parse_items,)
>>        
>>     def parse_items(self, response):
>>         self.log('Hi, this is an item page! %s' % response.url)
>>         for sel in response.xpath('//div[2]/div[1]'):
>>             item = AfricanstudiesItem()
>>             item['url'] = response.url
>>             item['title'] = sel.xpath(
>> 'div[3]/*[@id="green_title"]/text()').extract()
>>             item['desc'] = sel.xpath('div[4]/*').extract()
>>             yield item
>>
>>
>>
>>
>>
>>
>> On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote:
>>>
>>> That's helpful, but I'm hung up on getting the spider to follow relative 
>>> links. I've tried a lot of things, but I think that I'm really close with 
>>> this:
>>>
>>> import scrapy
>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>> from africanstudies.items import AfricanstudiesItem
>>> from scrapy.contrib.linkextractors import LinkExtractor
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> import urlparse
>>>
>>> class AfricanstudiesSpider(CrawlSpider):
>>>     name = "africanstudies"
>>>     allowed_domains = ["northwestern.edu/african-studies"]
>>>     start_urls = [
>>>         "http://www.northwestern.edu/african-studies/about/";
>>>     ]
>>>     
>>>     rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow
>>> =True),)
>>>     
>>>     def parse_links(self, response):
>>>         sel = scrapy.Selector(response)
>>>         for href in sel.xpath('//a/@href').extract():
>>>             url = urlparse.urljoin(response.url, href)
>>>             yield Request(url, callback = self.parse_items,)
>>>             
>>>        def parse_items(self, response):
>>>            self.log('Hi, this is an item page! %s' % response.url)
>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>             item = AfricanstudiesItem()
>>>             item['url'] = response.url
>>>             item['title'] = sel.xpath('div[3]/*[@id="
>>> green_title"]/text()').extract()      
>>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>>             yield item
>>>
>>> I can see from my logs that is skipping over the hard coded links from 
>>> other domains (as it should). I thought this bit of code would cause the 
>>> spider to recognize my relative links, but it does not.
>>>
>>> Hopefully you can lend a hand and tell me what I'm doing wrong.
>>>
>>>
>>>
>>>
>>>
>>> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>>>>
>>>> Hi Tina!
>>>>
>>>> Your code looks good, except it's missing logic that would give scrapy 
>>>> more pages to crawl.  (Scrapy won't grab links and crawl them by default; 
>>>> you have to indicate what you want to crawl.)
>>>>
>>>> I use one of two primary mechanisms:
>>>>
>>>> With the CrawlSpider, you can define a class variable called rules that 
>>>> defines rules for scrapy to consider when following links.  Often, I will 
>>>> define these rules based on a LinkExtractor object, which allows you to 
>>>> specify things like callbacks (what method to use in parsing a particular 
>>>> link), filters (you can modify the URL to remove session variables, etc.), 
>>>> limitations on links to extract (full gamut of css and xpath selectors 
>>>> available).  More information is at http://doc.scrapy.org/en/
>>>> latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>>>>
>>>> Sometimes, the rule-based link following just doesn't cut it.  (If 
>>>> you're using the scrapy.Spider spider class, the rules options aren't 
>>>> implemented, so you have to do it this way.)  If you yield a Request 
>>>> object 
>>>> from your parsing class, scrapy will add that to the queue to be scraped 
>>>> and processed.
>>>>
>>>> That make sense?
>>>>
>>>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:
>>>>
>>>>> There has to be something really simple that I'm missing. I'm trying 
>>>>> to get it to crawl more than one page, but I'm using a section of the 
>>>>> page 
>>>>> as a starting point for testing. I can't get it to crawl anything beyond 
>>>>> the index page. What am I doing wrong?
>>>>>
>>>>> import scrapy
>>>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>>>> from africanstudies.items import AfricanstudiesItem
>>>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>>>
>>>>> class DmozSpider(CrawlSpider):
>>>>>     name = "africanstudies"
>>>>>     allowed_domains = ["northwestern.edu"]
>>>>>     start_urls = [
>>>>>         "http://www.northwestern.edu/african-studies/about/";
>>>>>     ]
>>>>>
>>>>>     def parse(self, response):
>>>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>>>             item = AfricanstudiesItem()
>>>>>             item['url'] = response.url
>>>>>             item['title'] = sel.xpath('div[3]/*[@id="
>>>>> green_title"]/text()').extract()      
>>>>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>>>>             yield item
>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to