So, I have it crawling, but it doesn't crawl the correct area/site. If I
use 'allowed_domains', it doesn't crawl anything. If I remove it, it crawls
too many things. Here's the updated code:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from africanstudies.items import AfricanstudiesItem
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
import urlparse
class AfricanstudiesSpider(CrawlSpider):
name = "africanstudies"
allowed_domains = ["northwestern.edu/african-studies"]
start_urls = [
"http://www.northwestern.edu/african-studies/about/"
]
rules = (Rule(LinkExtractor(allow=(r'')),callback='parse_links',follow=
True),)
def parse_links(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
url = urlparse.urljoin(response.url, link)
yield Request(url, callback = self.parse_items,)
def parse_items(self, response):
self.log('Hi, this is an item page! %s' % response.url)
for sel in response.xpath('//div[2]/div[1]'):
item = AfricanstudiesItem()
item['url'] = response.url
item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()').
extract()
item['desc'] = sel.xpath('div[4]/*').extract()
yield item
On Wednesday, November 19, 2014 3:15:49 PM UTC-6, Tina C wrote:
>
> That's helpful, but I'm hung up on getting the spider to follow relative
> links. I've tried a lot of things, but I think that I'm really close with
> this:
>
> import scrapy
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from africanstudies.items import AfricanstudiesItem
> from scrapy.contrib.linkextractors import LinkExtractor
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> import urlparse
>
> class AfricanstudiesSpider(CrawlSpider):
> name = "africanstudies"
> allowed_domains = ["northwestern.edu/african-studies"]
> start_urls = [
> "http://www.northwestern.edu/african-studies/about/"
> ]
>
> rules = (Rule(LinkExtractor(allow=(r)),callback='parse_links',follow=
> True),)
>
> def parse_links(self, response):
> sel = scrapy.Selector(response)
> for href in sel.xpath('//a/@href').extract():
> url = urlparse.urljoin(response.url, href)
> yield Request(url, callback = self.parse_items,)
>
> def parse_items(self, response):
> self.log('Hi, this is an item page! %s' % response.url)
> for sel in response.xpath('//div[2]/div[1]'):
> item = AfricanstudiesItem()
> item['url'] = response.url
> item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()'
> ).extract()
> item['desc'] = sel.xpath('div[4]/*').extract()
> yield item
>
> I can see from my logs that is skipping over the hard coded links from
> other domains (as it should). I thought this bit of code would cause the
> spider to recognize my relative links, but it does not.
>
> Hopefully you can lend a hand and tell me what I'm doing wrong.
>
>
>
>
>
> On Tuesday, November 18, 2014 3:36:46 PM UTC-6, Travis Leleu wrote:
>>
>> Hi Tina!
>>
>> Your code looks good, except it's missing logic that would give scrapy
>> more pages to crawl. (Scrapy won't grab links and crawl them by default;
>> you have to indicate what you want to crawl.)
>>
>> I use one of two primary mechanisms:
>>
>> With the CrawlSpider, you can define a class variable called rules that
>> defines rules for scrapy to consider when following links. Often, I will
>> define these rules based on a LinkExtractor object, which allows you to
>> specify things like callbacks (what method to use in parsing a particular
>> link), filters (you can modify the URL to remove session variables, etc.),
>> limitations on links to extract (full gamut of css and xpath selectors
>> available). More information is at
>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
>>
>> Sometimes, the rule-based link following just doesn't cut it. (If you're
>> using the scrapy.Spider spider class, the rules options aren't implemented,
>> so you have to do it this way.) If you yield a Request object from your
>> parsing class, scrapy will add that to the queue to be scraped and
>> processed.
>>
>> That make sense?
>>
>> On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:
>>
>>> There has to be something really simple that I'm missing. I'm trying to
>>> get it to crawl more than one page, but I'm using a section of the page as
>>> a starting point for testing. I can't get it to crawl anything beyond the
>>> index page. What am I doing wrong?
>>>
>>> import scrapy
>>> from scrapy.contrib.spiders import CrawlSpider, Rule
>>> from africanstudies.items import AfricanstudiesItem
>>> from scrapy.contrib.linkextractors import LinkExtractor
>>>
>>> class DmozSpider(CrawlSpider):
>>> name = "africanstudies"
>>> allowed_domains = ["northwestern.edu"]
>>> start_urls = [
>>> "http://www.northwestern.edu/african-studies/about/"
>>> ]
>>>
>>> def parse(self, response):
>>> for sel in response.xpath('//div[2]/div[1]'):
>>> item = AfricanstudiesItem()
>>> item['url'] = response.url
>>> item['title'] = sel.xpath(
>>> 'div[3]/*[@id="green_title"]/text()').extract()
>>> item['desc'] = sel.xpath('div[4]/*').extract()
>>> yield item
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.