Hello,
I'd like the ability to cancel spiders before they are finished, and
obviously, there are many ways to accomplish this. I.e., I can send a
SIGINT or SIGTERM to the spider, and I see the default signal handler for
those causes a "graceful shutdown" on the first signal received and a more
"forceful shutdown" on the second. Of course, I could use scrapyd, but
scrapyd seems to simply send a SIGTERM, so my following question does not
apply to scrapyd, I think..
When the spider is cancelled with a "graceful shutdown", the behavior seems
to be as follows: whatever Request objects remain in the queue will be
completed (and associated callbacks called), and only then will the spider
be closed and any registered handlers for the signals.spider_closed event
called. What I'm really looking for, however, is a faster "graceful
shutdown" whereby the queue is first emptied, no more Request callbacks
executed, and the spider is closed "immediately." How can that be achieved?
For example, note how in the attached example, that if a SIGINT is received
during the first parse() call (with 3 sleeps inserted so there's time do so
in testing), the spider will be closed when that single parse() call
completes, as start_urls only contained 1 URL. However, at the end of the
first parse() call, I add 4 Request objects into the queue (either via the
"yield technique" or "return list technique"), so if a SIGINT is received
after that first parse() completes, the spider will not be closed until 4
more parse() calls complete, one for each Request added. Is there any way
to avoid this behavior, so the spider can be closed immediately, without
worrying about those 4 pending requests?
Thanks a bunch,
Drew
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
import time
import signal
class Spidey(Spider):
name = "spidey"
allowed_domains = ["abc.com"]
start_urls = [
"http://www.abc.com/"
]
I = 0
X = [ "http://www.abc.com/shows/" + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ]
def __init__(self, *args, **kwargs):
super(Spidey, self).__init__(*args, **kwargs)
#signal.signal( signal.SIGTERM, self.yo )
#signal.signal( signal.SIGINT, self.yo )
dispatcher.connect(self.close, signals.spider_closed)
def yo(self, signum, _):
self.log("yoyo!")
def close(self, spider, reason):
self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) )
def parse(self, response):
for i in range(3):
self.log("hi there!")
time.sleep(1)
self.log( "more requests please!!!" )
if self.I == 0:
self.I = 1
for x in self.X:
yield Request(x, callback=self.parse)
#return [ Request(x) for x in self.X ]
#else:
# return []