how to tell spider drop current request and move on if a condition is met ?

Duy Nguyen Mon, 30 Jun 2014 22:53:07 -0700

Hi guys, 

I have a spider which crawl thousands of post. The requirement is the post 
must have a contact email. If spider detects no valid email within the 
post, spider should discard the page and move to next in-queued page.


here is the code

    # this is the individual ad page 
    def parse_an_ad(self, response):
        reply = re.search("\/reply\/.+/\d+", response.body)
        try:
            link = urlparse.urljoin(response.url, reply.group())
        except:
           #what to do here to tell spider discard the current page and 
move onto next in-queued page

        hxs = Selector(response)
        post_title = hxs.xpath('//h2/text()').extract()[1].strip()
        description_list = 
hxs.xpath('//section[@id=\'postingbody\']/text()').extract()
        description = ''.join(description_list).strip()

        yield Request(link, callback=self.parse_reply_page, 
meta={'post_title':post_title,'link':response.url, 
'description':description})


I tried

* raise CloseSpider("no contact info is found") - this will kill the spider 
which I dont want to

* raise IgnoreRequest() - this give me Spider error processing because 
IgnoreRequest should be call within scheduler or middleware

What do I do ?

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

how to tell spider drop current request and move on if a condition is met ?

Reply via email to