Re: A little help for a new scrapy user?

Tina C Thu, 20 Nov 2014 14:04:38 -0800

Actually, I was wrong, it's not working. It still is crawling sites outside 
of the subdirectory. moreover, i'm not able to get anything in the /about/ 
subdirectory.


On Thursday, November 20, 2014 10:23:40 AM UTC-6, Tina C wrote:
>
> Thanks, that worked perfectly!
>
> On Wednesday, November 19, 2014 4:33:07 PM UTC-6, Travis Leleu wrote:
>>
>> Are you trying to crawl every link on northwestern.edu that is in the 
>> subdirectory african-studies?  allowed_domains controls the domain name, 
>> not the path -- to limit to the /african-studies subdir, you'd put that 
>> information into the "allow" named parameter of the link extractor object.
>>
>> Assuming that's what you're trying to accomplish, try this:
>>
>>
>>> import scrapy
>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>> from africanstudies.items import AfricanstudiesItem
>>> from scrapy.contrib.linkextractors import LinkExtractor
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> from scrapy.http import Request
>>> import urlparse
>>>
>>> class AfricanstudiesSpider(CrawlSpider):
>>>     name = "africanstudies"
>>>     allowed_domains = ["northwestern.edu 
>>> <http://northwestern.edu/african-studies>"]
>>>     start_urls = [
>>>         "http://www.northwestern.edu/african-studies/about/";
>>>     ]
>>>     
>>>     rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback=
>>> 'parse_links',follow=True),)
>>>     
>>>     def parse_links(self, response):
>>>         links = response.xpath('//a/@href').extract()
>>>         for link in links:
>>>             url = urlparse.urljoin(response.url, link)
>>>             yield Request(url, callback = self.parse_items,)
>>>        
>>>     def parse_items(self, response):
>>>         self.log('Hi, this is an item page! %s' % response.url)
>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>             item = AfricanstudiesItem()
>>>             item['url'] = response.url
>>>             item['title'] = sel.xpath(
>>> 'div[3]/*[@id="green_title"]/text()').extract()
>>>             item['desc'] = sel.xpath('div[4]/*').extract()
>>>             yield item
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote:
>>>>
>>>> That's helpful, but I'm hung up on getting the spider to follow 
>>>> relative links. I've tried a lot of things, but I think that I'm really 
>>>> close with this:
>>>>
>>>> import scrapy
>>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>>> from africanstudies.items import AfricanstudiesItem
>>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>>> import urlparse
>>>>
>>>> class AfricanstudiesSpider(CrawlSpider):
>>>>     name = "africanstudies"
>>>>     allowed_domains = ["northwestern.edu/african-studies"]
>>>>     start_urls = [
>>>>         "http://www.northwestern.edu/african-studies/about/";
>>>>     ]
>>>>     
>>>>     rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',
>>>> follow=True),)
>>>>     
>>>>     def parse_links(self, response):
>>>>         sel = scrapy.Selector(response)
>>>>         for href in sel.xpath('//a/@href').extract():
>>>>             url = urlparse.urljoin(response.url, href)
>>>>             yield Request(url, callback = self.parse_items,)
>>>>             
>>>>        def parse_items(self, response):
>>>>            self.log('Hi, this is an item page! %s' % response.url)
>>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>>             item = AfricanstudiesItem()
>>>>             item['url'] = response.url
>>>>             item['title'] = sel.xpath('div[3]/*[@id="
>>>> green_title"]/text()').extract()      
>>>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>>>             yield item
>>>>
>>>> I can see from my logs that is skipping over the hard coded links from 
>>>> other domains (as it should). I thought this bit of code would cause the 
>>>> spider to recognize my relative links, but it does not.
>>>>
>>>> Hopefully you can lend a hand and tell me what I'm doing wrong.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>>>>>
>>>>> Hi Tina!
>>>>>
>>>>> Your code looks good, except it's missing logic that would give scrapy 
>>>>> more pages to crawl.  (Scrapy won't grab links and crawl them by default; 
>>>>> you have to indicate what you want to crawl.)
>>>>>
>>>>> I use one of two primary mechanisms:
>>>>>
>>>>> With the CrawlSpider, you can define a class variable called rules 
>>>>> that defines rules for scrapy to consider when following links.  Often, I 
>>>>> will define these rules based on a LinkExtractor object, which allows you 
>>>>> to specify things like callbacks (what method to use in parsing a 
>>>>> particular link), filters (you can modify the URL to remove session 
>>>>> variables, etc.), limitations on links to extract (full gamut of css and 
>>>>> xpath selectors available).  More information is at 
>>>>> http://doc.scrapy.org/en/latest/topics/spiders.html#
>>>>> scrapy.contrib.spiders.Rule
>>>>>
>>>>> Sometimes, the rule-based link following just doesn't cut it.  (If 
>>>>> you're using the scrapy.Spider spider class, the rules options aren't 
>>>>> implemented, so you have to do it this way.)  If you yield a Request 
>>>>> object 
>>>>> from your parsing class, scrapy will add that to the queue to be scraped 
>>>>> and processed.
>>>>>
>>>>> That make sense?
>>>>>
>>>>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:
>>>>>
>>>>>> There has to be something really simple that I'm missing. I'm trying 
>>>>>> to get it to crawl more than one page, but I'm using a section of the 
>>>>>> page 
>>>>>> as a starting point for testing. I can't get it to crawl anything beyond 
>>>>>> the index page. What am I doing wrong?
>>>>>>
>>>>>> import scrapy
>>>>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>>>>> from africanstudies.items import AfricanstudiesItem
>>>>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>>>>
>>>>>> class DmozSpider(CrawlSpider):
>>>>>>     name = "africanstudies"
>>>>>>     allowed_domains = ["northwestern.edu"]
>>>>>>     start_urls = [
>>>>>>         "http://www.northwestern.edu/african-studies/about/";
>>>>>>     ]
>>>>>>
>>>>>>     def parse(self, response):
>>>>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>>>>             item = AfricanstudiesItem()
>>>>>>             item['url'] = response.url
>>>>>>             item['title'] = sel.xpath('div[3]/*[@id="
>>>>>> green_title"]/text()').extract()      
>>>>>>             item['desc'] = sel.xpath('div[4]/*').extract()      
>>>>>>             yield item
>>>>>>
>>>>>>
>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "scrapy-users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to