I wrote a pipeline to simulate using Twisted adbapi with a single 
connection in the conn pool and with a ridiculously slow db that just 
blocks. runInteraction just calls a blocking function that has a countdown 
printout ticking down between sleeps (with the amount of delay depending on 
which item we're at)

class MySQL_adbapi_delay(object):

    def __init__(self, dbpool):
        self.dbpool = dbpool
        self.icnt = 0

    def close_spider(self, spider):
        """ Cleanup function, called after crawing has finished to close 
open
            objects.
            Close ConnectionPool. """
        spider.log('Closing connection pool...')
        self.dbpool.close()

    # See scrapy.middleware.py. If pipeline class has from_crawler
    # or from settings construction with these is preferred.
    @classmethod
    def from_settings(cls, scrapy_settings):
        dbargs = dict(
            host=scrapy_settings['DB_HOST'],
            db=scrapy_settings['DB_NAME'],
            user=scrapy_settings['DB_USER'],
            passwd=scrapy_settings['DB_PWD'],
            charset='utf8',
            use_unicode=True,
        )
        # cp_max controls number of conns in pool
        dbpool = adbapi.ConnectionPool('MySQLdb', cp_max=1,  **dbargs)
        return cls(dbpool)

    def process_item(self, item, spider):
        # Run db query in the thread pool
        d = self.dbpool.runInteraction(self._do_blocking, item, spider)
        d.addErrback(self._handle_error, item, spider)
        # At the end return the item in case of success or failure
        d.addBoth(lambda _: item)
        # return the deferred instead the item. This makes the engine to
        # process next item (according to CONCURRENT_ITEMS setting) after 
this
        # operation (deferred) has finished.
        return d

    def _do_blocking(self, curs, item, spider):

        log.error('Begin block with delay... ')
        if self.icnt == 0:
            # Big block on first item delay
            BOOM = 25
        else:
            # Small delay for next items
            BOOM = 1
        while BOOM > 0:
            time.sleep(2)
            log.error('BOOM:%i' % BOOM)
            BOOM += -1
        self.icnt += 1
        log.error('Finished delay... ')

    def _handle_error(self, failure, item, spider):
        """Handle occurred on db interaction."""
        # do nothing, just log
        log.error(failure)
        log.error('The item that failed to be written was item: %s' % item)



The way I understand `runInteraction` is that it will do blocking database 
operations in separate threads, which trigger callbacks in the originating 
thread when they complete. In the meantime, the original thread can 
continue doing normal work, like servicing other requests. So despite this 
blocking code within the runInteraction, Scrapy should be able to continue 
grabbing requests, feeding their responses to spider callbacks which then 
load items.

I tested this with 2 different spiders however. In the first, what I expect 
to happen happens; scrapy requests are interleaved between printouts of 
BOOM countdowns throughout the crawl as twisted flips between the 
asynchronous tasks when it sees one or the other isn't blocking. When the 
output eventually pauses, telnetting in shows the same number of Requests 
as Items and HtmlResponses ...Scrapy has done all the other tasks in can 
do, and now must wait for the blocking db thread. Great.

In the second, when the output of the crawler stops, there are 200 requests 
(correct) but only 28 of them have been converted into responses and items, 
and the spider is clearly hanging now. The artificial timer draws all the 
focus, and the countdown runs down sequentially ("boom: 20, boom: 19..., 
boom: 1"). Only after the pipeline thread completes its task, does scrapy 
go out and fetch any other requests.

Why does this happen? Why does twisted lock in to the blocking thread when 
there are requests still to be processed that it could get on with? What 
could be the reason this happens in one spider and not another? 

(I find this happens exactly the same with my real slow db btw)

 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to