There is actually a simpler way to simulate this kind of blocking
class ThreadBlocker(object):
"""
runInteraction starts a new thread for each interaction, and
you can simulate a limited number of connection pools using this
class and REACTOR_THREADPOOL_SIZE = 1 say
"""
def __init__(self):
self.icnt = 0
def blocker(self, item, delay):
log.error('begin blocking for: %s' % delay)
if self.icnt == 0:
# Big initial delay
BOOM = delay
else:
# Small delay for next items
BOOM = 2
while BOOM > 0:
time.sleep(2)
log.error('BOOM:%i' % BOOM)
BOOM += -1
self.icnt += 1
return True
def finished(result):
log.error('success finished: %s' % result)
def process_item(self, item, spider):
d1 = threads.deferToThread(self.blocker, item, 50)
d1.addCallback(self.finished)
d1.addBoth(lambda _: item)
return d1
Use this as the pipeline, and set REACTOR_THREADPOOL_MAXSIZE = 1 (or
whatever db conn pool size you normally have). I actually see results
independent of my spiders, so long as each has enough items to scrape.
Scrapy seems to behave in the following way: the work between processing
new requests to get new responses and items is interleaved with the
blocking pipeline up until a point. When there are around 30 or 40
items/htmlresponses, it stops processing new requests and attention is
entirely on the blocking pipeline. This seems to continue until the items
get processed, at which point twisted will seem to start diving time
between processing new requests to get new responses/items, and blocking
pipelines.
Does this seem correct?
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.