closing of spiders w.r.t. pending requests

drew Mon, 12 May 2014 14:14:31 -0700

Hello,

I'd like the ability to cancel spiders before they are finished, and 
obviously, there are many ways to accomplish this.  I.e., I can send a 
SIGINT or SIGTERM to the spider, and I see the default signal handler for 
those causes a "graceful shutdown" on the first signal received and a more 
"forceful shutdown" on the second.  Of course, I could use scrapyd, but 
scrapyd seems to simply send a SIGTERM, so my following question does not 
apply to scrapyd, I think..


When the spider is cancelled with a "graceful shutdown", the behavior seems 
to be as follows: whatever Request objects remain in the queue will be 
completed (and associated callbacks called), and only then will the spider 
be closed and any registered handlers for the signals.spider_closed event 
called.  What I'm really looking for, however, is a faster "graceful 
shutdown" whereby the queue is first emptied, no more Request callbacks 
executed, and the spider is closed "immediately."  How can that be achieved?

For example, note how in the attached example, that if a SIGINT is received 
during the first parse() call (with 3 sleeps inserted so there's time do so 
in testing), the spider will be closed when that single parse() call 
completes, as start_urls only contained 1 URL.  However, at the end of the 
first parse() call, I add 4 Request objects into the queue (either via the 
"yield technique" or "return list technique"), so if a SIGINT is received 
after that first parse() completes, the spider will not be closed until 4 
more parse() calls complete, one for each Request added.  Is there any way 
to avoid this behavior, so the spider can be closed immediately, without 
worrying about those 4 pending requests?

Thanks a bunch,
Drew


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

import time
import signal

class Spidey(Spider):
    name = "spidey"
    allowed_domains = ["abc.com"]
    start_urls = [
        "http://www.abc.com/";
    ]

    I = 0
    X = [ "http://www.abc.com/shows/"; + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ]

    def __init__(self, *args, **kwargs):
        super(Spidey, self).__init__(*args, **kwargs)

        #signal.signal( signal.SIGTERM, self.yo )
        #signal.signal( signal.SIGINT, self.yo )

        dispatcher.connect(self.close, signals.spider_closed)


    def yo(self, signum, _):
        self.log("yoyo!")


    def close(self, spider, reason):
        self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) )


    def parse(self, response):
        
        for i in range(3):
            self.log("hi there!")
            time.sleep(1)

        self.log( "more requests please!!!" )

        if self.I == 0:
            self.I = 1

            for x in self.X:
                yield Request(x, callback=self.parse)

            #return [ Request(x) for x in self.X ]

        #else:
        #  return []

closing of spiders w.r.t. pending requests

Reply via email to