Hi guys,
I have a spider which crawl thousands of post. The requirement is the post
must have a contact email. If spider detects no valid email within the
post, spider should discard the page and move to next in-queued page.
here is the code
# this is the individual ad page
def parse_an_ad(self, response):
reply = re.search("\/reply\/.+/\d+", response.body)
try:
link = urlparse.urljoin(response.url, reply.group())
except:
#what to do here to tell spider discard the current page and
move onto next in-queued page
hxs = Selector(response)
post_title = hxs.xpath('//h2/text()').extract()[1].strip()
description_list =
hxs.xpath('//section[@id=\'postingbody\']/text()').extract()
description = ''.join(description_list).strip()
yield Request(link, callback=self.parse_reply_page,
meta={'post_title':post_title,'link':response.url,
'description':description})
I tried
* raise CloseSpider("no contact info is found") - this will kill the spider
which I dont want to
* raise IgnoreRequest() - this give me Spider error processing because
IgnoreRequest should be call within scheduler or middleware
What do I do ?
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.