On 4/27/2012 6:25 PM, Adam Skutt wrote:
On Apr 27, 2:54 pm, John Nagle<na...@animats.com>  wrote:
      I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.

How exactly are you determining that this is the case?

   Found the problem.  The threads, after doing their compute
intensive work of examining pages, stored some URLs they'd found.
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.  On CentOS, "getaddrinfo()" at the
glibc level doesn't always cache locally (ref
https://bugzilla.redhat.com/show_bug.cgi?id=576801).  Python
doesn't cache either.  So huge numbers of DNS requests were being
made.  For some pages being scanned, many of the domains required
accessing a rather slow  DNS server.  The combination of thousands
of instances of the same domain, a slow DNS server, and no caching
slowed the crawler down severely.

   Added a local cache in the program to prevent this.
Performance much improved.

                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to