Re: A little help for a new scrapy user?

Travis Leleu Tue, 18 Nov 2014 13:36:55 -0800

Hi Tina!

Your code looks good, except it's missing logic that would give scrapy more
pages to crawl.  (Scrapy won't grab links and crawl them by default; you
have to indicate what you want to crawl.)

I use one of two primary mechanisms:

With the CrawlSpider, you can define a class variable called rules that
defines rules for scrapy to consider when following links.  Often, I will
define these rules based on a LinkExtractor object, which allows you to
specify things like callbacks (what method to use in parsing a particular
link), filters (you can modify the URL to remove session variables, etc.),
limitations on links to extract (full gamut of css and xpath selectors
available).  More information is at
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule

Sometimes, the rule-based link following just doesn't cut it.  (If you're
using the scrapy.Spider spider class, the rules options aren't implemented,
so you have to do it this way.)  If you yield a Request object from your
parsing class, scrapy will add that to the queue to be scraped and
processed.

That make sense?

On Tue, Nov 18, 2014 at 11:54 AM, Tina C <[email protected]> wrote:

> There has to be something really simple that I'm missing. I'm trying to
> get it to crawl more than one page, but I'm using a section of the page as
> a starting point for testing. I can't get it to crawl anything beyond the
> index page. What am I doing wrong?
>
> import scrapy
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from africanstudies.items import AfricanstudiesItem
> from scrapy.contrib.linkextractors import LinkExtractor
>
> class DmozSpider(CrawlSpider):
>     name = "africanstudies"
>     allowed_domains = ["northwestern.edu"]
>     start_urls = [
>         "http://www.northwestern.edu/african-studies/about/";
>     ]
>
>     def parse(self, response):
>         for sel in response.xpath('//div[2]/div[1]'):
>             item = AfricanstudiesItem()
>             item['url'] = response.url
>             item['title'] = sel.xpath('div[3]/*[@id="green_title"]/text()'
> ).extract()
>             item['desc'] = sel.xpath('div[4]/*').extract()
>             yield item
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to