Ok, I realized the behavior I was looking for can be accomplished by using
DownloaderMiddleware and then overriding the default SIGINT and SIGTERM
handlers (see attached code). My only questions left are:
1) is there a better way to do this without overriding the POSIX signal
handlers?
2) is this (overriding the POSIX signal handlers) definitely safe?
On Monday, May 12, 2014 2:10:52 PM UTC-7, drew wrote:
>
> Hello,
>
> I'd like the ability to cancel spiders before they are finished, and
> obviously, there are many ways to accomplish this. I.e., I can send a
> SIGINT or SIGTERM to the spider, and I see the default signal handler for
> those causes a "graceful shutdown" on the first signal received and a more
> "forceful shutdown" on the second. Of course, I could use scrapyd, but
> scrapyd seems to simply send a SIGTERM, so my following question does not
> apply to scrapyd, I think..
>
> When the spider is cancelled with a "graceful shutdown", the behavior
> seems to be as follows: whatever Request objects remain in the queue will
> be completed (and associated callbacks called), and only then will the
> spider be closed and any registered handlers for the signals.spider_closed
> event called. What I'm really looking for, however, is a faster "graceful
> shutdown" whereby the queue is first emptied, no more Request callbacks
> executed, and the spider is closed "immediately." How can that be achieved?
>
> For example, note how in the attached example, that if a SIGINT is
> received during the first parse() call (with 3 sleeps inserted so there's
> time do so in testing), the spider will be closed when that single parse()
> call completes, as start_urls only contained 1 URL. However, at the end of
> the first parse() call, I add 4 Request objects into the queue (either via
> the "yield technique" or "return list technique"), so if a SIGINT is
> received after that first parse() completes, the spider will not be closed
> until 4 more parse() calls complete, one for each Request added. Is there
> any way to avoid this behavior, so the spider can be closed immediately,
> without worrying about those 4 pending requests?
>
> Thanks a bunch,
> Drew
>
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
from scrapy.spider import Spider
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
import time
import signal
class Spidey(Spider):
name = "spidey"
allowed_domains = ["abc.go.com"]
start_urls = [
"http://abc.go.com/"
]
I = 0
X = [ "http://abc.go.com/shows/" + str(x) for x in [ "black-box", "castle", "the-chew", "nashville" ] ]
closing = 0
def __init__(self, *args, **kwargs):
super(Spidey, self).__init__(*args, **kwargs)
self.term_handler = signal.signal(signal.SIGTERM, self.term_handler)
self.int_handler = signal.signal(signal.SIGINT, self.int_handler)
dispatcher.connect(self.close, signals.spider_closed)
def int_handler(self, signum, handler):
self.log('got SIGINT !!!')
self.closing = 1
self.int_handler(signum, handler)
def term_handler(self, signum, handler):
self.log('got SIGTERM !!!')
self.closing = 1
self.term_handler(signum, handler)
def close(self, spider, reason):
self.log("close! spider[%s] reason[%s]" % ( str(spider), str(reason) ) )
def parse(self, response):
for i in range(3):
self.log("hi there!")
time.sleep(1)
self.log( "more requests please!!!" )
if self.I == 0:
self.I = 1
for x in self.X:
yield Request(x, callback=self.parse)
#return [ Request(x) for x in self.X ]
#else:
#return []
from scrapy.exceptions import IgnoreRequest
class CancelMiddleware():
def process_request(self, request, spider):
print "Middleware.request(): closing[%d]" % spider.closing
if spider.closing:
raise IgnoreRequest()
def process_response(self, request, response, spider):
print "Middleware.response(): closing[%d]" % spider.closing
if spider.closing:
raise IgnoreRequest()
return response