Re: A little help for a new scrapy user?

Travis Leleu Wed, 19 Nov 2014 14:34:18 -0800

Are you trying to crawl every link on northwestern.edu that is in the
subdirectory african-studies?  allowed_domains controls the domain name,
not the path -- to limit to the /african-studies subdir, you'd put that
information into the "allow" named parameter of the link extractor object.


Assuming that's what you're trying to accomplish, try this:


> import scrapy
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from africanstudies.items import AfricanstudiesItem
> from scrapy.contrib.linkextractors import LinkExtractor
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from scrapy.http import Request
> import urlparse
>
> class AfricanstudiesSpider(CrawlSpider):
>     name = "africanstudies"
>     allowed_domains = ["northwestern.edu
> <http://northwestern.edu/african-studies>"]
>     start_urls = [
>         "http://www.northwestern.edu/african-studies/about/";
>     ]
>
>     rules = (Rule(LinkExtractor(allow=(r'african-studies')),callback=
> 'parse_links',follow=True),)
>
>     def parse_links(self, response):
>         links = response.xpath('//a/@href').extract()
>         for link in links:
>             url = urlparse.urljoin(response.url, link)
>             yield Request(url, callback = self.parse_items,)
>
>     def parse_items(self, response):
>         self.log('Hi, this is an item page! %s' % response.url)
>         for sel in response.xpath('//div[2]/div[1]'):
>             item = AfricanstudiesItem()
>             item['url'] = response.url
>             item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()'
> ).extract()
>             item['desc'] = sel.xpath('div[4]/*').extract()
>             yield item
>
>
>
>
>
>
> On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote:
>>
>> That's helpful, but I'm hung up on getting the spider to follow relative
>> links. I've tried a lot of things, but I think that I'm really close with
>> this:
>>
>> import scrapy
>> from scrapy.contrib.spiders import CrawlSpider, Rule
>> from africanstudies.items import AfricanstudiesItem
>> from scrapy.contrib.linkextractors import LinkExtractor
>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>> import urlparse
>>
>> class AfricanstudiesSpider(CrawlSpider):
>>     name = "africanstudies"
>>     allowed_domains = ["northwestern.edu/african-studies"]
>>     start_urls = [
>>         "http://www.northwestern.edu/african-studies/about/";
>>     ]
>>
>>     rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=
>> True),)
>>
>>     def parse_links(self, response):
>>         sel = scrapy.Selector(response)
>>         for href in sel.xpath('//a/@href').extract():
>>             url = urlparse.urljoin(response.url, href)
>>             yield Request(url, callback = self.parse_items,)
>>
>>        def parse_items(self, response):
>>            self.log('Hi, this is an item page! %s' % response.url)
>>         for sel in response.xpath('//div[2]/div[1]'):
>>             item = AfricanstudiesItem()
>>             item['url'] = response.url
>>             item['title'] = sel.xpath('div[3]/*[@id="
>> green_title"]/text()').extract()
>>             item['desc'] = sel.xpath('div[4]/*').extract()
>>             yield item
>>
>> I can see from my logs that is skipping over the hard coded links from
>> other domains (as it should). I thought this bit of code would cause the
>> spider to recognize my relative links, but it does not.
>>
>> Hopefully you can lend a hand and tell me what I'm doing wrong.
>>
>>
>>
>>
>>
>> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>>>
>>> Hi Tina!
>>>
>>> Your code looks good, except it's missing logic that would give scrapy
>>> more pages to crawl.  (Scrapy won't grab links and crawl them by default;
>>> you have to indicate what you want to crawl.)
>>>
>>> I use one of two primary mechanisms:
>>>
>>> With the CrawlSpider, you can define a class variable called rules that
>>> defines rules for scrapy to consider when following links.  Often, I will
>>> define these rules based on a LinkExtractor object, which allows you to
>>> specify things like callbacks (what method to use in parsing a particular
>>> link), filters (you can modify the URL to remove session variables, etc.),
>>> limitations on links to extract (full gamut of css and xpath selectors
>>> available).  More information is at http://doc.scrapy.org/en/
>>> latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>>>
>>> Sometimes, the rule-based link following just doesn't cut it.  (If
>>> you're using the scrapy.Spider spider class, the rules options aren't
>>> implemented, so you have to do it this way.)  If you yield a Request object
>>> from your parsing class, scrapy will add that to the queue to be scraped
>>> and processed.
>>>
>>> That make sense?
>>>
>>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:
>>>
>>>> There has to be something really simple that I'm missing. I'm trying to
>>>> get it to crawl more than one page, but I'm using a section of the page as
>>>> a starting point for testing. I can't get it to crawl anything beyond the
>>>> index page. What am I doing wrong?
>>>>
>>>> import scrapy
>>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>>> from africanstudies.items import AfricanstudiesItem
>>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>>
>>>> class DmozSpider(CrawlSpider):
>>>>     name = "africanstudies"
>>>>     allowed_domains = ["northwestern.edu"]
>>>>     start_urls = [
>>>>         "http://www.northwestern.edu/african-studies/about/";
>>>>     ]
>>>>
>>>>     def parse(self, response):
>>>>         for sel in response.xpath('//div[2]/div[1]'):
>>>>             item = AfricanstudiesItem()
>>>>             item['url'] = response.url
>>>>             item['title'] = sel.xpath('div[3]/*[@id="
>>>> green_title"]/text()').extract()
>>>>             item['desc'] = sel.xpath('div[4]/*').extract()
>>>>             yield item
>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to