Ok, so in the end I used just a normal Spider.
For anyone wondering, this is my parse function now:
def parse(self, response):
pages_done = self.crawler.stats.get_value(
'downloader/response_count')
pages_todo = self.crawler.stats.get_value('scheduler/enqueued') -
self.crawler.stats.get_value('downloader/response_count')
log.msg("URL: %s (%s) Crawled %d pages. To Crawl: %d" % (self.
start_urls[0], self.url_id, pages_done, pages_todo), spider = self)
#import ipdb
#ipdb.set_trace()
soup = BeautifulSoup(response._body, "html5lib")
links = []
for tag in self.tags:
for a in soup.find_all(tag):
for attr in self.attrs:
if attr in a.attrs:
href = a.attrs[attr]
if href.startswith("http"):
links.append(href)
href = urlparse.urljoin(response.url, href)
href_parts = urlparse.urlparse(href.replace('\t', '').
replace('\r', '').replace('\n', '').replace(' ', '+'))
if re.match(self.allow, href_parts.path):
yield Request(href)
for script in soup(["script", "style"]):
script.extract()
item = DomainItem()
item["url"] = response.url
#item["text"] = re.sub(r'\s{2,}', ' ', remove_tags('
'.join(response.xpath('//body//text()').extract()))).strip()
item["text"] = soup.get_text()
item["links"] = links
self.crawler.stats.inc_value('pages_crawled')
yield item
I created this extension of Spider by passing it an extra "allow" parameter
that is being used to check if the path satisfies my constraint. I do not
check the domain as it will be taken care of automatically by the standard
"allowed_domains" check of scrapy. I also pass "tags" and "attrs" that are
being used in the bs loop to make sure I'm capturing all the tags and attrs
that might contain a link to follow. In this way this Spider behaves very
closely to a CrawlSpider.
One open issue is that it is apparently downloading and try to return as
items also urls whose mime type is application/pdf. I did not change the
DEFAULT_REQUEST_HEADERS, so I am a bit puzzled as why this is happening.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.