Re: A little help for a new scrapy user?

Erik Schafer Tue, 02 Dec 2014 10:40:59 -0800

Sorry to necro this / bump, but this thread was incredibly helpful in 
getting my first crawlspider running.

I'm really disappointed in the documentation for scrapy, because there are 
some serious errors.  

The documentation states that the allow value of the link extractor object 
takes 

> a single regular expression (or list of regular expressions) that the 
> (absolute) urls must match in order to be extracted. If not given (or 
> empty), it will match all links.

which is simply false.  It seems to take a regular expression that the url 
minus the fully qualified domain name must match, but I'm not sure, since 
I'm totally new to scrapy and can't trust the documentation.

I also still don't know why follow=true needs to be explicitly included (it 
did not work without follow=true for me) given that the documentation 
states follow=none defaults to true.

I don't understand why callback=none will use the default parse method to 
recursivley crawl matched urls, callback='mycallback' will not, but 
follow=true callback='mycallback' (seems to) invoke both callbacks.

IMO this example of a simple recursive crawlspider should be in the 
documentation.

Finally, the only thing I have to add is that unless your allow string is 
formatted as *r*'regex' it will not understand slashes as escape characters.

On Friday, November 21, 2014 4:52:58 PM UTC-5, Tina C wrote:
>
> Just to update (and to serve as an archive for anyone searching for a 
> similar answer), I was really close with the previous code snippets I 
> listed. The problem was that the information contained in my callback was 
> canceling out my rules. Here's my updated code (I'm only grabbing the URLs 
> at this point) and it seems to work.
>
> import scrapy
> from scrapy.contrib.spiders import CrawlSpider, Rule
> from scrapy.contrib.linkextractors import LinkExtractor
> from africanstudies.items import AfricanstudiesItem
>
> class MySpider(CrawlSpider):
>     name = 'africanstudies'
>     allowed_domains = ['northwestern.edu']
>     start_urls = ['http://www.northwestern.edu/african-studies']
>
>     rules = (
>         Rule(LinkExtractor(allow='african-studies'), follow=True, callback
> ='parse_item'),
>
>     )
>
>     def parse_item(self, response):
>         self.log('Hi, this is an item page! %s' % response.url)
>         item = AfricanstudiesItem()
>         item['url'] = response.url
>         return item
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: A little help for a new scrapy user?

Reply via email to