EDIT---
In order for CrawlSpider to follow links, the following is required inside 
parse_item:

    def parse_item(self, response):
        # Scrape here
        ...
        # CrawlSpider defines this method to return all scraped urls.
        yield from self.parse(response)
 
Source: http://stackoverflow.com/a/28904922/424438

On Tuesday, May 24, 2016 at 8:44:50 PM UTC+2, Antoine Brunel wrote:
>
> Hello,
>
> I'm willing to implement a scraper that derives both from SitemapSpider 
> & CrawlSpider to find all the possible urls of a website.
> The following code seems to work perfectly, I'm just asking for an 
> external opinion, just in case I'm missing something:
>
>  
>
>> class SeoScraperSpider(SitemapSpider, CrawlSpider):
>>     name = "megascraper"
>>     rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', 
>> follow=True), )
>>     sitemap_rules = [ ('', 'parse_item'), ]
>>
>  
>
>     def __init__(self, domains=None, urls=None, sitemaps=None, *args, 
>> **kwargs):
>>         super(SeoScraperSpider, self).__init__(*args, **kwargs)
>>         self.allowed_domains = [domains]
>>         self.sitemap_urls = [sitemaps]
>>         with open(urls) as csv_file:
>>             self.start_urls = [url.strip() for url in 
>> csv_file.readlines()]
>>
>  
>
>     def start_requests(self):
>>         # Required for SitemapSpider
>>         requests = list(super(SeoScraperSpider, self).start_requests())
>>         requests += [Request(url) for url in self.start_urls]        
>>         return requests
>>
>  
>
>     def parse_item(self, response):
>>         # Scrape here
>>         ...
>
>
> Thanks!
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to