Sorry to necro this / bump, but this thread was incredibly helpful in getting my first crawlspider running.
I'm really disappointed in the documentation for scrapy, because there are some serious errors. The documentation states that the allow value of the link extractor object takes > a single regular expression (or list of regular expressions) that the > (absolute) urls must match in order to be extracted. If not given (or > empty), it will match all links. which is simply false. It seems to take a regular expression that the url minus the fully qualified domain name must match, but I'm not sure, since I'm totally new to scrapy and can't trust the documentation. I also still don't know why follow=true needs to be explicitly included (it did not work without follow=true for me) given that the documentation states follow=none defaults to true. I don't understand why callback=none will use the default parse method to recursivley crawl matched urls, callback='mycallback' will not, but follow=true callback='mycallback' (seems to) invoke both callbacks. IMO this example of a simple recursive crawlspider should be in the documentation. Finally, the only thing I have to add is that unless your allow string is formatted as *r*'regex' it will not understand slashes as escape characters. On Friday, November 21, 2014 4:52:58 PM UTC-5, Tina C wrote: > > Just to update (and to serve as an archive for anyone searching for a > similar answer), I was really close with the previous code snippets I > listed. The problem was that the information contained in my callback was > canceling out my rules. Here's my updated code (I'm only grabbing the URLs > at this point) and it seems to work. > > import scrapy > from scrapy.contrib.spiders import CrawlSpider, Rule > from scrapy.contrib.linkextractors import LinkExtractor > from africanstudies.items import AfricanstudiesItem > > class MySpider(CrawlSpider): > name = 'africanstudies' > allowed_domains = ['northwestern.edu'] > start_urls = ['http://www.northwestern.edu/african-studies'] > > rules = ( > Rule(LinkExtractor(allow='african-studies'), follow=True, callback > ='parse_item'), > > ) > > def parse_item(self, response): > self.log('Hi, this is an item page! %s' % response.url) > item = AfricanstudiesItem() > item['url'] = response.url > return item > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
