Re: CrawlSpider fails to follow rule for some websites

Michele Coscia Wed, 05 Nov 2014 09:13:36 -0800

Ok, so in the end I used just a normal Spider.

For anyone wondering, this is my parse function now:


    def parse(self, response):
        pages_done = self.crawler.stats.get_value(
'downloader/response_count')
        pages_todo = self.crawler.stats.get_value('scheduler/enqueued') - 
self.crawler.stats.get_value('downloader/response_count')
        log.msg("URL: %s (%s) Crawled %d pages. To Crawl: %d" % (self.
start_urls[0], self.url_id, pages_done, pages_todo), spider = self)
        #import ipdb
        #ipdb.set_trace()
        soup = BeautifulSoup(response._body, "html5lib")
        links = []
        for tag in self.tags:
           for a in soup.find_all(tag):
              for attr in self.attrs:
                 if attr in a.attrs:
                    href = a.attrs[attr]
                    if href.startswith("http"):
                       links.append(href)
                    href = urlparse.urljoin(response.url, href)
                    href_parts = urlparse.urlparse(href.replace('\t', '').
replace('\r', '').replace('\n', '').replace(' ', '+'))
                    if re.match(self.allow, href_parts.path):
                       yield Request(href)
        for script in soup(["script", "style"]):
           script.extract()
        item = DomainItem()
        item["url"] = response.url
        #item["text"] = re.sub(r'\s{2,}', ' ', remove_tags(' 
'.join(response.xpath('//body//text()').extract()))).strip()
        item["text"] = soup.get_text()
        item["links"] = links
        self.crawler.stats.inc_value('pages_crawled')
        yield item


I created this extension of Spider by passing it an extra "allow" parameter 
that is being used to check if the path satisfies my constraint. I do not 
check the domain as it will be taken care of automatically by the standard 
"allowed_domains" check of scrapy. I also pass "tags" and "attrs" that are 
being used in the bs loop to make sure I'm capturing all the tags and attrs 
that might contain a link to follow. In this way this Spider behaves very 
closely to a CrawlSpider.

One open issue is that it is apparently downloading and try to return as 
items also urls whose mime type is application/pdf. I did not change the 
DEFAULT_REQUEST_HEADERS, so I am a bit puzzled as why this is happening.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to