I'm hoping to write a program that will read any number of urls from stdin (1 per line), download them, and process them. So far my script (below) works well for small numbers of urls. However, it does not scale to more than 200 urls or so, because it issues HTTP requests for all of the urls simultaneously, and terminates after 25 seconds. Ideally, I'd like this script to download at most 50 pages in parallel, and to time out if and only if any HTTP request is not answered in 3 seconds. What changes do I need to make?
Is Twisted the best library for me to be using? I do like Twisted, but it seems more suited to batch mode operations. Is there some way that I could continue registering url requests while the reactor is running? Is there a way to specify a time out per page request, rather than for a batch of pages requests? Thanks! #------------------------------------------------- from twisted.internet import reactor from twisted.web import client import re, urllib, sys, time def extract(html): #do some processing on html, writing to stdout def printError(failure): print >> sys.stderr, "Error:", failure.getErrorMessage( ) def stopReactor(): print "Now stopping reactor..." reactor.stop() for url in sys.stdin: url = url.rstrip() client.getPage(url).addCallback(extract).addErrback(printError) reactor.callLater(25, stopReactor) reactor.run() -- http://mail.python.org/mailman/listinfo/python-list