Inheriting from SitemapSpider & CrawlSpider

Antoine Brunel Tue, 24 May 2016 11:45:07 -0700

Hello,

I'm willing to implement a scraper that derives both from SitemapSpider 
& CrawlSpider to find all the possible urls of a website.
The following code seems to work perfectly, I'm just asking for an external 
opinion, just in case I'm missing something:


 

> class SeoScraperSpider(SitemapSpider, CrawlSpider):
>     name = "megascraper"
>     rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', 
> follow=True), )
>     sitemap_rules = [ ('', 'parse_item'), ]
>
 

    def __init__(self, domains=None, urls=None, sitemaps=None, *args, 
> **kwargs):
>         super(SeoScraperSpider, self).__init__(*args, **kwargs)
>         self.allowed_domains = [domains]
>         self.sitemap_urls = [sitemaps]
>         with open(urls) as csv_file:
>             self.start_urls = [url.strip() for url in csv_file.readlines()]
>
 

    def start_requests(self):
>         # Required for SitemapSpider
>         requests = list(super(SeoScraperSpider, self).start_requests())
>         requests += [Request(url) for url in self.start_urls]        
>         return requests
>
 

    def parse_item(self, response):
>         # Scrape here
>         ...


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Inheriting from SitemapSpider & CrawlSpider

Reply via email to