I wrote a pipeline to simulate using Twisted adbapi with a single
connection in the conn pool and with a ridiculously slow db that just
blocks. runInteraction just calls a blocking function that has a countdown
printout ticking down between sleeps (with the amount of delay depending on
which item we're at)
class MySQL_adbapi_delay(object):
def __init__(self, dbpool):
self.dbpool = dbpool
self.icnt = 0
def close_spider(self, spider):
""" Cleanup function, called after crawing has finished to close
open
objects.
Close ConnectionPool. """
spider.log('Closing connection pool...')
self.dbpool.close()
# See scrapy.middleware.py. If pipeline class has from_crawler
# or from settings construction with these is preferred.
@classmethod
def from_settings(cls, scrapy_settings):
dbargs = dict(
host=scrapy_settings['DB_HOST'],
db=scrapy_settings['DB_NAME'],
user=scrapy_settings['DB_USER'],
passwd=scrapy_settings['DB_PWD'],
charset='utf8',
use_unicode=True,
)
# cp_max controls number of conns in pool
dbpool = adbapi.ConnectionPool('MySQLdb', cp_max=1, **dbargs)
return cls(dbpool)
def process_item(self, item, spider):
# Run db query in the thread pool
d = self.dbpool.runInteraction(self._do_blocking, item, spider)
d.addErrback(self._handle_error, item, spider)
# At the end return the item in case of success or failure
d.addBoth(lambda _: item)
# return the deferred instead the item. This makes the engine to
# process next item (according to CONCURRENT_ITEMS setting) after
this
# operation (deferred) has finished.
return d
def _do_blocking(self, curs, item, spider):
log.error('Begin block with delay... ')
if self.icnt == 0:
# Big block on first item delay
BOOM = 25
else:
# Small delay for next items
BOOM = 1
while BOOM > 0:
time.sleep(2)
log.error('BOOM:%i' % BOOM)
BOOM += -1
self.icnt += 1
log.error('Finished delay... ')
def _handle_error(self, failure, item, spider):
"""Handle occurred on db interaction."""
# do nothing, just log
log.error(failure)
log.error('The item that failed to be written was item: %s' % item)
The way I understand `runInteraction` is that it will do blocking database
operations in separate threads, which trigger callbacks in the originating
thread when they complete. In the meantime, the original thread can
continue doing normal work, like servicing other requests. So despite this
blocking code within the runInteraction, Scrapy should be able to continue
grabbing requests, feeding their responses to spider callbacks which then
load items.
I tested this with 2 different spiders however. In the first, what I expect
to happen happens; scrapy requests are interleaved between printouts of
BOOM countdowns throughout the crawl as twisted flips between the
asynchronous tasks when it sees one or the other isn't blocking. When the
output eventually pauses, telnetting in shows the same number of Requests
as Items and HtmlResponses ...Scrapy has done all the other tasks in can
do, and now must wait for the blocking db thread. Great.
In the second, when the output of the crawler stops, there are 200 requests
(correct) but only 28 of them have been converted into responses and items,
and the spider is clearly hanging now. The artificial timer draws all the
focus, and the countdown runs down sequentially ("boom: 20, boom: 19...,
boom: 1"). Only after the pipeline thread completes its task, does scrapy
go out and fetch any other requests.
Why does this happen? Why does twisted lock in to the blocking thread when
there are requests still to be processed that it could get on with? What
could be the reason this happens in one spider and not another?
(I find this happens exactly the same with my real slow db btw)
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.