On 4/27/2012 6:25 PM, Adam Skutt wrote:
On Apr 27, 2:54 pm, John Nagle<[email protected]> wrote:I have a multi-threaded CPython program, which has up to four threads. One thread is simply a wait loop monitoring the other three and waiting for them to finish, so it can give them more work to do. When the work threads, which read web pages and then parse them, are compute-bound, I've had the monitoring thread starved of CPU time for as long as 120 seconds.How exactly are you determining that this is the case?
Found the problem. The threads, after doing their compute intensive work of examining pages, stored some URLs they'd found. The code that stored them looked them up with "getaddrinfo()", and did this while a lock was set. On CentOS, "getaddrinfo()" at the glibc level doesn't always cache locally (ref https://bugzilla.redhat.com/show_bug.cgi?id=576801). Python doesn't cache either. So huge numbers of DNS requests were being made. For some pages being scanned, many of the domains required accessing a rather slow DNS server. The combination of thousands of instances of the same domain, a slow DNS server, and no caching slowed the crawler down severely. Added a local cache in the program to prevent this. Performance much improved. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
