> On Aug 6, 2016, at 03:48, Randomcoder <randomcod...@gmail.com> wrote:
> 
> Hello,
> 
> I've been working on a small Twisted program.

Cool, thanks for using Twisted.

> The program makes HTTP requests to a large number of feeds.
> Twisted is used to speed up the entire process.
> After the feeds are fetched, they're parsed. Finally they should be
> written to a database (to simplify the code, that part is left out).

Thanks for including examples, so we know exactly what you're talking about! :)

> Feeds are fetched in parallel using gatherResults, and a batch is
> built. Then all batches are again gathered into a set of batches,
> a DeferredList is built out of those. A semaphore controls both the
> batch-level list of deferreds, and a semaphore controls the entire batch
> list deferred.
> 
> Currently, the program works ok on 100-150 feeds, and BATCH_SIZE between
> 5 and 20.

This all seems pretty reasonable and following best practices and such...

> However, I notice the program starts to hang for a long time, when the
> number of feeds goes over 150-200.

Two key questions: what do you mean by "hang" and what is "a long time"?  Do 
you mean it's totally unresponsive, or do you mean it's just failing to make 
progress on downloading more feeds?

> 
> To be more precise, at the end of running the program, messages
> like these are printed, but the program seems to not be very active:
> 
>    Stopping factory <twisted.web.client._HTTP11ClientFactory instance at 
> 0x7f0b7d5f3908>
> 
> It seems like this is the cleanup phase.

This just means that it is finished making connections.  We have to do some 
clean-up around the usefulness of these log messages, sorry :-\.

> I've read what I could find on the topic. I wasn't able to make progress
> on it, so I'm posting to the mailing list to ask if someone has encountered 
> this
> before. Maybe it's a common pitfall or issue that other people have also
> bumped into.

Right now, my guess is this: some of the sites you're contacting have very slow 
proxies, or for some other reason let you connect to them, but then hang when 
sent requests.  If you're simultaneously requesting stuff from a very large 
number of different sites, this is sort of inevitably bound to happen, either 
based on network problems, or issues with the sites themselves.  I suspect you 
thought that the connectTimeout argument to Agent would save you from this, but 
that timeout is just for making the initial underlying TCP connection, not 
receiving a full response.  What you actually want to do is cancel the Deferred 
returned by Agent.request.

Luckily, https://treq.readthedocs.io/en/latest/ 
<https://treq.readthedocs.io/en/latest/> already implements this high-level 
timeout functionality for you, in the form of the 'timeout=' argument it 
accepts.  If you give that a try, do you see more connections timing out as it 
runs, rather than "hanging" the process for long periods of time?

As long as I'm looking at your code, as a way of thanking you for providing 
such a nice specific runnable example, I have a few other random thoughts which 
may be useful to you:

- I see you're importing psycopg.  Do you know about 
https://txpostgres.readthedocs.io/en/latest/ 
<https://txpostgres.readthedocs.io/en/latest/> ?  You can talk to postgres 
asynchronously with Twisted.
- d.addCallback(lambda out: out).addCallback(lambda resp: 
client.readBody(resp)) can be much more briefly spelled 
"d.addCallback(client.readBody)". d.addErrback(lambda err: err) does nothing 
and can just be removed.
- BrowserLikePolicyForHTTPS() is the default, so you don't need to pass that.
- clean_up_and_exit will only be called if batchesDef doesn't fail, and if it 
does fail, it will swallow the exception message.  Rather than manually calling 
`reactor.stop`, you probably want to use react(), 
<https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react
 
<https://twistedmatrix.com/documents/16.3.0/api/twisted.internet.task.html#react>>.
  This way your function is an API that anyone who wants to use it can call - 
it just returns a Deferred when it's done - but your __main__ block calls 
react() which will both start and stop the reactor, as well as reporting errors 
if there's a problem while still shutting down.

Hope some of that code review is helpful - let us know if the treq timeout 
solves the problem or if the issue is somewhere else!

-glyph
_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Reply via email to