Re: CPython thread starvation
In article <7xipgj8vxh@ruckus.brouhaha.com>, Paul Rubin wrote: > Roy Smith writes: > > I agree that application-level name cacheing is "wrong", but sometimes > > doing it the wrong way just makes sense. I could whip up a simple > > cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the > > environment (both technology and bureaucracy), getting a cacheing > > nameserver installed might take anywhere from 5 minutes to a few days to ... > > IMHO this really isn't one of those times. The in-app wrapper would > only be usable to just that process, and we already know that the OP has > multiple processes running the same app on the same machine. They would > benefit from being able to share the cache, so now your wrapper gets > more complicated. So, use memcache. Trivial to set up, easy Python integration, and it has the expiration mechanism built in. Not to mention it has a really cute web site (http://memcached.org/). > Also, since this is a production application, doing something in 5 > minutes is less important than making it solid and configurable. Maybe. On the other hand, the time you save with a 5 minute solution can be spent solving other, harder, problems. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 4/28/2012 1:04 PM, Paul Rubin wrote: Roy Smith writes: I agree that application-level name cacheing is "wrong", but sometimes doing it the wrong way just makes sense. I could whip up a simple cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the environment (both technology and bureaucracy), getting a cacheing nameserver installed might take anywhere from 5 minutes to a few days to ... IMHO this really isn't one of those times. The in-app wrapper would only be usable to just that process, and we already know that the OP has multiple processes running the same app on the same machine. They would benefit from being able to share the cache, so now your wrapper gets more complicated. If it's not a nameserver then it's something that fills in for one. And then, since the application appears to be a large scale web spider, it probably wants to run on a cluster, and the cache should be shared across all the machines. So you really probably want an industrial strength nameserver with a big persistent cache, and maybe a smaller local cache because of high locality when crawling specific sites, etc. Each process is analyzing one web site, and has its own cache. Once the site is analyzed, which usually takes about a minute, the cache disappears. Multiple threads are reading multiple pages from the web site during that time. A local cache is enough to fix the huge overhead problem of doing a DNS lookup for every link found. One site with a vast number of links took over 10 hours to analyze before this fix; now it takes about four minutes. That solved the problem. We can probably get an additional minor performance boost with a real local DNS daemon, and will probably configure one. We recently changed servers from Red Hat to CentOS, and management from CPanel to Webmin. Before the change, we had a local DNS daemon with cacheing, so we didn't have this problem. Webmin's defaults tend to be on the minimal side. The DNS information is used mostly to help decide whether two URLs actually point to the same IP address, as part of deciding whether a link is on-site or off-site. Most of those links will never be read. We're not crawling the entire site, just looking at likely pages to find the name and address of the business behind the site. (It's part of our "Know who you're dealing with" system, SiteTruth.) John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
Roy Smith writes: > I agree that application-level name cacheing is "wrong", but sometimes > doing it the wrong way just makes sense. I could whip up a simple > cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the > environment (both technology and bureaucracy), getting a cacheing > nameserver installed might take anywhere from 5 minutes to a few days to ... IMHO this really isn't one of those times. The in-app wrapper would only be usable to just that process, and we already know that the OP has multiple processes running the same app on the same machine. They would benefit from being able to share the cache, so now your wrapper gets more complicated. If it's not a nameserver then it's something that fills in for one. And then, since the application appears to be a large scale web spider, it probably wants to run on a cluster, and the cache should be shared across all the machines. So you really probably want an industrial strength nameserver with a big persistent cache, and maybe a smaller local cache because of high locality when crawling specific sites, etc. Also, since this is a production application, doing something in 5 minutes is less important than making it solid and configurable. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On Sun, Apr 29, 2012 at 12:27 AM, Danyel Lawson wrote: > I'm glad I thought of it. ;) But the trick is to use port 5353 and set > a really short timeout on responses in the config for the DNS cache. I don't think false timeouts are any better than true ones, if you actually know the true ones. But sure, whatever you need. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
I'm glad I thought of it. ;) But the trick is to use port 5353 and set a really short timeout on responses in the config for the DNS cache. On Sat, Apr 28, 2012 at 10:15 AM, Chris Angelico wrote: > On Sat, Apr 28, 2012 at 11:46 PM, Danyel Lawson > wrote: >> The DNS lookup is one of those things that may make sense to run as a >> separate daemon process that listens on a socket. > > Yeah, it does. One that listens on port 53, TCP and UDP, perhaps. :) > > You've just recommended installing a separate caching resolver. > > ChrisA > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On Sat, Apr 28, 2012 at 11:46 PM, Danyel Lawson wrote: > The DNS lookup is one of those things that may make sense to run as a > separate daemon process that listens on a socket. Yeah, it does. One that listens on port 53, TCP and UDP, perhaps. :) You've just recommended installing a separate caching resolver. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
Sprinkle time.sleep(0) liberally throughout your code where you think natural processing breaks should be. Even in while loops. It's lame but is the only way to make Python multithreading task switch fairly. Your compute intensive tasks need a time.sleep(0) in their loops. This prevents starvation and makes overall processing and responsiveness seem properly multithreaded. This is a hand optimization so you have to play with the location and amount of time.sleep(0)s. You'll know when you've found a problematic spot when the queues stop growing/overflowing. Put the dns lookup on a separate thread pool with it's own growing queue with lots of time.sleep(0)s sprinkled in. The dns lookups don't have to be real time and you can easily cache them with a timestamp attached. This is the thread pool where more is better and threads should be aggressively terminated for having a long running process time. This also requires lots of hand tuning for dynamically managing the number of threads needed to process the queue in a reasonable time if you find it hard to aggressively kill threads. I think there is a way to launch threads that only give them a maximum lifetime. The problem you will hit while tuning may require allocating more file handles for all the hung sockets. The DNS lookup is one of those things that may make sense to run as a separate daemon process that listens on a socket. You make one connection that feeds in the ip addresses. The daemon process feeds back ip address/host name combinations out of order. Your main process/connection thread builds a serialized access dict with timestamps. The main processes threads make their requests asynchronously and sleep while waiting for the response to appear in the dict. They terminate after a certain time if they don't see their response. Requires hand/algorithmic tweaking for this to work correctly across different machines. On Fri, Apr 27, 2012 at 2:54 PM, John Nagle wrote: > I have a multi-threaded CPython program, which has up to four > threads. One thread is simply a wait loop monitoring the other > three and waiting for them to finish, so it can give them more > work to do. When the work threads, which read web pages and > then parse them, are compute-bound, I've had the monitoring thread > starved of CPU time for as long as 120 seconds. > It's sleeping for 0.5 seconds, then checking on the other threads > and for new work do to, so the work thread isn't using much > compute time. > > I know that the CPython thread dispatcher sucks, but I didn't > realize it sucked that bad. Is there a preference for running > threads at the head of the list (like UNIX, circa 1979) or > something like that? > > (And yes, I know about "multiprocessing". These threads are already > in one of several service processes. I don't want to launch even more > copies of the Python interpreter. The threads are usually I/O bound, > but when they hit unusually long web pages, they go compute-bound > during parsing.) > > Setting "sys.setcheckinterval" from the default to 1 seems > to have little effect. This is on Windows 7. > > John Nagle > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
In article <7xy5pgqwto@ruckus.brouhaha.com>, Paul Rubin wrote: > John Nagle writes: > >I may do that to prevent the stall. But the real problem was all > > those DNS requests. Parallizing them wouldn't help much when it took > > hours to grind through them all. > > True dat. But building a DNS cache into the application seems like a > kludge. Unless the number of requests is insane, running a caching > nameserver on the local box seems cleaner. I agree that application-level name cacheing is "wrong", but sometimes doing it the wrong way just makes sense. I could whip up a simple cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the environment (both technology and bureaucracy), getting a cacheing nameserver installed might take anywhere from 5 minutes to a few days to kicking a dead whale down the beach (if you need to involve your corporate IT department) to it just ain't happening (if you need to involve your corporate IT department). Doing DNS cacheing correctly is non-trivial. In fact, if you're building it on top of getaddrinfo(), it may be impossible, since I don't think getaddrinfo() exposes all the data you need (i.e. TTL values). But, doing a half-assed job of cache expiration is better than not expiring your cache at all. I would suggest (from experience) that if you build a getaddrinfo() wrapper, you have cache entries time out after a fairly short time. From the problem description, it sounds like using a 1-minute timeout would get 99% of the benefit and might keep you from doing some bizarre things. PS -- I've also learned by experience that nscd can mess up. If DNS starts doing stuff that doesn't make sense, my first line of attack is usually killing and restarting the local nscd. Often enough, that solves the problem, and it rarely causes any problems that anybody would notice. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 4/27/2012 9:55 PM, Paul Rubin wrote: John Nagle writes: I may do that to prevent the stall. But the real problem was all those DNS requests. Parallizing them wouldn't help much when it took hours to grind through them all. True dat. But building a DNS cache into the application seems like a kludge. Unless the number of requests is insane, running a caching nameserver on the local box seems cleaner. I know. When I have a bit more time, I'll figure out why CentOS 5 and Webmin didn't set up a caching DNS resolver by default. Sometimes the number of requests IS insane. When the system hits a page with a thousand links, it has to resolve all of them. (Beyond a thousand links, we classify it as link spam and stop. The record so far is a page with over 10,000 links.) John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
John Nagle writes: >I may do that to prevent the stall. But the real problem was all > those DNS requests. Parallizing them wouldn't help much when it took > hours to grind through them all. True dat. But building a DNS cache into the application seems like a kludge. Unless the number of requests is insane, running a caching nameserver on the local box seems cleaner. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 4/27/2012 9:20 PM, Paul Rubin wrote: John Nagle writes: The code that stored them looked them up with "getaddrinfo()", and did this while a lock was set. Don't do that!! Added a local cache in the program to prevent this. Performance much improved. Better to release the lock while the getaddrinfo is running, if you can. I may do that to prevent the stall. But the real problem was all those DNS requests. Parallizing them wouldn't help much when it took hours to grind through them all. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
John Nagle writes: > The code that stored them looked them up with "getaddrinfo()", and > did this while a lock was set. Don't do that!! >Added a local cache in the program to prevent this. > Performance much improved. Better to release the lock while the getaddrinfo is running, if you can. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On Sat, Apr 28, 2012 at 1:35 PM, John Nagle wrote: > On CentOS, "getaddrinfo()" at the > glibc level doesn't always cache locally (ref > https://bugzilla.redhat.com/show_bug.cgi?id=576801). Python > doesn't cache either. How do you manage your local cache? The Python getaddrinfo function doesn't return a positive TTL (much less a negative one). Do you pick an arbitrary TTL, or cache indefinitely? I had the same issue in a PHP server (yeah I know, but I was maintaining a project that someone else started) - fortunately there is a PHP function that gives a TTL on all successful lookups, though it still won't for failures. I couldn't find anything on cache control anywhere in the Python socket module docs. Perhaps the simplest option is to throw down a local BIND to manage the caching for you, but that does seem a little like overkill. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 4/27/2012 6:25 PM, Adam Skutt wrote: On Apr 27, 2:54 pm, John Nagle wrote: I have a multi-threaded CPython program, which has up to four threads. One thread is simply a wait loop monitoring the other three and waiting for them to finish, so it can give them more work to do. When the work threads, which read web pages and then parse them, are compute-bound, I've had the monitoring thread starved of CPU time for as long as 120 seconds. How exactly are you determining that this is the case? Found the problem. The threads, after doing their compute intensive work of examining pages, stored some URLs they'd found. The code that stored them looked them up with "getaddrinfo()", and did this while a lock was set. On CentOS, "getaddrinfo()" at the glibc level doesn't always cache locally (ref https://bugzilla.redhat.com/show_bug.cgi?id=576801). Python doesn't cache either. So huge numbers of DNS requests were being made. For some pages being scanned, many of the domains required accessing a rather slow DNS server. The combination of thousands of instances of the same domain, a slow DNS server, and no caching slowed the crawler down severely. Added a local cache in the program to prevent this. Performance much improved. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On Apr 27, 2:54 pm, John Nagle wrote: > I have a multi-threaded CPython program, which has up to four > threads. One thread is simply a wait loop monitoring the other > three and waiting for them to finish, so it can give them more > work to do. When the work threads, which read web pages and > then parse them, are compute-bound, I've had the monitoring thread > starved of CPU time for as long as 120 seconds. How exactly are you determining that this is the case? > I know that the CPython thread dispatcher sucks, but I didn't > realize it sucked that bad. Is there a preference for running > threads at the head of the list (like UNIX, circa 1979) or > something like that? Not in CPython, which is at the mercy of what the operating system does. Under the covers, CPython uses a semaphore on Windows, which do not have FIFO ordering as per http://msdn.microsoft.com/en-us/library/windows/desktop/ms685129(v=vs.85).aspx. As a result, I think your thread is succumbing to the same issues that impact signal delivery as described on 22-24 and 35-41 of http://www.dabeaz.com/python/GIL.pdf. I'm not sure there's any easy or reliable way to "fix" that from your code. I am not a WinAPI programmer though, and I'd suggest finding one to help you out. It doesn't appear possible to change the scheduling policy for semaphore programatically, and I don't know closely they pay any attention to thread priority. That's just a guess though, and finding out for sure would take some low-level debugging. However, it seems to be the most probable situation assuming your code is correct. > > (And yes, I know about "multiprocessing". These threads are already > in one of several service processes. I don't want to launch even more > copies of the Python interpreter. Why? There's little harm in launching more instances. Processes have some additional startup and memory overhead compared to threads, but I can't imagine it woudl be an issue. Given what you're trying to do, I'd expect to run out of other resources long before I ran out of memory because I created too many processes or threads. > The threads are usually I/O bound, > but when they hit unusually long web pages, they go compute-bound > during parsing.) If your concern is being CPU oversubscribed by using lots of processes, I suspect it's probably misplaced. A whole mess of CPU- bound tasks is pretty much the easiest case for a scheduler to handle. Adam -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 27/04/2012 23:30, Dennis Lee Bieber wrote: Oh, continuation thought... If the workers are calling into C-language operations, unless those operations release the GIL, it doesn't matter what the OS or Python thread switch timings are. The OS may interrupt the thread (running C-language code), pass control to the Python interpreter which finds the GIL is locked, and just blocks -- control passes back to the interrupted thread. Any long-running C-language function should release the GIL while doing things with local data -- and reacquire the GIL when it needs to manipulate Python data structures or returning... The OP mentioned parsing webpages. If that involves the re module at some point, it doesn't release the GIL while it's looking for matches. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
John Nagle writes: >I know that the CPython thread dispatcher sucks, but I didn't > realize it sucked that bad. Is there a preference for running > threads at the head of the list (like UNIX, circa 1979) or > something like that? I think it's left up to the OS thread scheduler, Windows in your case. See http://www.dabeaz.com/python/NewGIL.pdf starting around slide 18. One idea that comes to mind is putting a periodic interrupt and signal handler into your main thread, to make sure the GIL gets released every so often. -- http://mail.python.org/mailman/listinfo/python-list
Re: CPython thread starvation
On 4/27/2012 20:54, John Nagle wrote: I have a multi-threaded CPython program, which has up to four threads. One thread is simply a wait loop monitoring the other three and waiting for them to finish, so it can give them more work to do. When the work threads, which read web pages and then parse them, are compute-bound, I've had the monitoring thread starved of CPU time for as long as 120 seconds. It's sleeping for 0.5 seconds, then checking on the other threads and for new work do to, so the work thread isn't using much compute time. How exactly are these waiting and checking performed? Kiuhnm -- http://mail.python.org/mailman/listinfo/python-list
CPython thread starvation
I have a multi-threaded CPython program, which has up to four threads. One thread is simply a wait loop monitoring the other three and waiting for them to finish, so it can give them more work to do. When the work threads, which read web pages and then parse them, are compute-bound, I've had the monitoring thread starved of CPU time for as long as 120 seconds. It's sleeping for 0.5 seconds, then checking on the other threads and for new work do to, so the work thread isn't using much compute time. I know that the CPython thread dispatcher sucks, but I didn't realize it sucked that bad. Is there a preference for running threads at the head of the list (like UNIX, circa 1979) or something like that? (And yes, I know about "multiprocessing". These threads are already in one of several service processes. I don't want to launch even more copies of the Python interpreter. The threads are usually I/O bound, but when they hit unusually long web pages, they go compute-bound during parsing.) Setting "sys.setcheckinterval" from the default to 1 seems to have little effect. This is on Windows 7. John Nagle -- http://mail.python.org/mailman/listinfo/python-list