Re: CPython thread starvation

2012-04-29 Thread Roy Smith
In article <7xipgj8vxh@ruckus.brouhaha.com>,
 Paul Rubin  wrote:

> Roy Smith  writes:
> > I agree that application-level name cacheing is "wrong", but sometimes 
> > doing it the wrong way just makes sense.  I could whip up a simple 
> > cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the 
> > environment (both technology and bureaucracy), getting a cacheing 
> > nameserver installed might take anywhere from 5 minutes to a few days to ...
> 
> IMHO this really isn't one of those times.  The in-app wrapper would
> only be usable to just that process, and we already know that the OP has
> multiple processes running the same app on the same machine.  They would
> benefit from being able to share the cache, so now your wrapper gets
> more complicated.

So, use memcache.  Trivial to set up, easy Python integration, and it 
has the expiration mechanism built in.  Not to mention it has a really 
cute web site (http://memcached.org/).

> Also, since this is a production application, doing something in 5
> minutes is less important than making it solid and configurable.

Maybe.  On the other hand, the time you save with a 5 minute solution 
can be spent solving other, harder, problems.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-29 Thread John Nagle

On 4/28/2012 1:04 PM, Paul Rubin wrote:

Roy Smith  writes:

I agree that application-level name cacheing is "wrong", but sometimes
doing it the wrong way just makes sense.  I could whip up a simple
cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the
environment (both technology and bureaucracy), getting a cacheing
nameserver installed might take anywhere from 5 minutes to a few days to ...


IMHO this really isn't one of those times.  The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine.  They would
benefit from being able to share the cache, so now your wrapper gets
more complicated.  If it's not a nameserver then it's something that
fills in for one.  And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines.  So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.


Each process is analyzing one web site, and has its own cache.
Once the site is analyzed, which usually takes about a minute,
the cache disappears.  Multiple threads are reading multiple pages
from the web site during that time.

A local cache is enough to fix the huge overhead problem of
doing a DNS lookup for every link found.  One site with a vast
number of links took over 10 hours to analyze before this fix;
now it takes about four minutes.  That solved the problem.
We can probably get an additional minor performance boost with a real
local DNS daemon, and will probably configure one.

We recently changed servers from Red Hat to CentOS, and management
from CPanel to Webmin.  Before the change, we had a local DNS daemon
with cacheing, so we didn't have this problem.  Webmin's defaults
tend to be on the minimal side.

The DNS information is used mostly to help decide whether two URLs
actually point to the same IP address, as part of deciding whether a
link is on-site or off-site.  Most of those links will never be read.
We're not crawling the entire site, just looking at likely pages to
find the name and address of the business behind the site.  (It's
part of our "Know who you're dealing with" system, SiteTruth.)

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Paul Rubin
Roy Smith  writes:
> I agree that application-level name cacheing is "wrong", but sometimes 
> doing it the wrong way just makes sense.  I could whip up a simple 
> cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the 
> environment (both technology and bureaucracy), getting a cacheing 
> nameserver installed might take anywhere from 5 minutes to a few days to ...

IMHO this really isn't one of those times.  The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine.  They would
benefit from being able to share the cache, so now your wrapper gets
more complicated.  If it's not a nameserver then it's something that
fills in for one.  And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines.  So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.

Also, since this is a production application, doing something in 5
minutes is less important than making it solid and configurable.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Chris Angelico
On Sun, Apr 29, 2012 at 12:27 AM, Danyel Lawson  wrote:
> I'm glad I thought of it. ;) But the trick is to use port 5353 and set
> a really short timeout on responses in the config for the DNS cache.

I don't think false timeouts are any better than true ones, if you
actually know the true ones. But sure, whatever you need.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Danyel Lawson
I'm glad I thought of it. ;) But the trick is to use port 5353 and set
a really short timeout on responses in the config for the DNS cache.

On Sat, Apr 28, 2012 at 10:15 AM, Chris Angelico  wrote:
> On Sat, Apr 28, 2012 at 11:46 PM, Danyel Lawson  
> wrote:
>> The DNS lookup is one of those things that may make sense to run as a
>> separate daemon process that listens on a socket.
>
> Yeah, it does. One that listens on port 53, TCP and UDP, perhaps. :)
>
> You've just recommended installing a separate caching resolver.
>
> ChrisA
> --
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Chris Angelico
On Sat, Apr 28, 2012 at 11:46 PM, Danyel Lawson  wrote:
> The DNS lookup is one of those things that may make sense to run as a
> separate daemon process that listens on a socket.

Yeah, it does. One that listens on port 53, TCP and UDP, perhaps. :)

You've just recommended installing a separate caching resolver.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Danyel Lawson
Sprinkle time.sleep(0) liberally throughout your code where you think
natural processing breaks should be.  Even in while loops. It's lame
but is the only way to make Python multithreading task switch fairly.
Your compute intensive tasks need a time.sleep(0) in their loops. This
prevents starvation and makes overall processing and responsiveness
seem properly multithreaded. This is a hand optimization so you have
to play with the location and amount of time.sleep(0)s. You'll know
when you've found a problematic spot when the queues stop
growing/overflowing.

Put the dns lookup on a separate thread pool with it's own growing
queue with lots of time.sleep(0)s sprinkled in. The dns lookups don't
have to be real time and you can easily cache them with a timestamp
attached. This is the thread pool where more is better and threads
should be aggressively terminated for having a long running process
time. This also requires lots of hand tuning for dynamically managing
the number of threads needed to process the queue in a reasonable time
if you find it hard to aggressively kill threads. I think there is a
way to launch threads that only give them a maximum lifetime. The
problem you will hit while tuning may require allocating more file
handles for all the hung sockets.

The DNS lookup is one of those things that may make sense to run as a
separate daemon process that listens on a socket. You make one
connection that feeds in the ip addresses. The daemon process feeds
back ip address/host name combinations out of order. Your main
process/connection thread builds a serialized access dict with
timestamps. The main processes threads make their requests
asynchronously and sleep while waiting for the response to appear in
the dict. They terminate after a certain time if they don't see their
response. Requires hand/algorithmic tweaking for this to work
correctly across different machines.

On Fri, Apr 27, 2012 at 2:54 PM, John Nagle  wrote:
>    I have a multi-threaded CPython program, which has up to four
> threads.  One thread is simply a wait loop monitoring the other
> three and waiting for them to finish, so it can give them more
> work to do.  When the work threads, which read web pages and
> then parse them, are compute-bound, I've had the monitoring thread
> starved of CPU time for as long as 120 seconds.
> It's sleeping for 0.5 seconds, then checking on the other threads
> and for new work do to, so the work thread isn't using much
> compute time.
>
>   I know that the CPython thread dispatcher sucks, but I didn't
> realize it sucked that bad.  Is there a preference for running
> threads at the head of the list (like UNIX, circa 1979) or
> something like that?
>
>   (And yes, I know about "multiprocessing".  These threads are already
> in one of several service processes.  I don't want to launch even more
> copies of the Python interpreter.  The threads are usually I/O bound,
> but when they hit unusually long web pages, they go compute-bound
> during parsing.)
>
>   Setting "sys.setcheckinterval" from the default to 1 seems
> to have little effect.  This is on Windows 7.
>
>                                John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-28 Thread Roy Smith
In article <7xy5pgqwto@ruckus.brouhaha.com>,
 Paul Rubin  wrote:

> John Nagle  writes:
> >I may do that to prevent the stall.  But the real problem was all
> > those DNS requests.  Parallizing them wouldn't help much when it took
> > hours to grind through them all.
> 
> True dat.  But building a DNS cache into the application seems like a
> kludge.  Unless the number of requests is insane, running a caching
> nameserver on the local box seems cleaner.

I agree that application-level name cacheing is "wrong", but sometimes 
doing it the wrong way just makes sense.  I could whip up a simple 
cacheing wrapper around getaddrinfo() in 5 minutes.  Depending on the 
environment (both technology and bureaucracy), getting a cacheing 
nameserver installed might take anywhere from 5 minutes to a few days to 
kicking a dead whale down the beach (if you need to involve your 
corporate IT department) to it just ain't happening (if you need to 
involve your corporate IT department).

Doing DNS cacheing correctly is non-trivial.  In fact, if you're 
building it on top of getaddrinfo(), it may be impossible, since I don't 
think getaddrinfo() exposes all the data you need (i.e. TTL values).  
But, doing a half-assed job of cache expiration is better than not 
expiring your cache at all.  I would suggest (from experience) that if 
you build a getaddrinfo() wrapper, you have cache entries time out after 
a fairly short time.  From the problem description, it sounds like using 
a 1-minute timeout would get 99% of the benefit and might keep you from 
doing some bizarre things.

PS -- I've also learned by experience that nscd can mess up.  If DNS 
starts doing stuff that doesn't make sense, my first line of attack is 
usually killing and restarting the local nscd.  Often enough, that 
solves the problem, and it rarely causes any problems that anybody would 
notice.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 9:55 PM, Paul Rubin wrote:

John Nagle  writes:

I may do that to prevent the stall.  But the real problem was all
those DNS requests.  Parallizing them wouldn't help much when it took
hours to grind through them all.


True dat.  But building a DNS cache into the application seems like a
kludge.  Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.


   I know.  When I have a bit more time, I'll figure out why
CentOS 5 and Webmin didn't set up a caching DNS resolver by
default.

   Sometimes the number of requests IS insane.  When the
system hits a page with a thousand links, it has to resolve
all of them.  (Beyond a thousand links, we classify it as
link spam and stop.  The record so far is a page with over
10,000 links.)

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Paul Rubin
John Nagle  writes:
>I may do that to prevent the stall.  But the real problem was all
> those DNS requests.  Parallizing them wouldn't help much when it took
> hours to grind through them all.

True dat.  But building a DNS cache into the application seems like a
kludge.  Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 9:20 PM, Paul Rubin wrote:

John Nagle  writes:


The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.


Don't do that!!


Added a local cache in the program to prevent this.
Performance much improved.


Better to release the lock while the getaddrinfo is running, if you can.


   I may do that to prevent the stall.  But the real problem was all
those DNS requests.  Parallizing them wouldn't help much when it took
hours to grind through them all.

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Paul Rubin
John Nagle  writes:

> The code that stored them looked them up with "getaddrinfo()", and
> did this while a lock was set.

Don't do that!!

>Added a local cache in the program to prevent this.
> Performance much improved.

Better to release the lock while the getaddrinfo is running, if you can.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Chris Angelico
On Sat, Apr 28, 2012 at 1:35 PM, John Nagle  wrote:
> On CentOS, "getaddrinfo()" at the
> glibc level doesn't always cache locally (ref
> https://bugzilla.redhat.com/show_bug.cgi?id=576801).  Python
> doesn't cache either.

How do you manage your local cache? The Python getaddrinfo function
doesn't return a positive TTL (much less a negative one). Do you pick
an arbitrary TTL, or cache indefinitely?

I had the same issue in a PHP server (yeah I know, but I was
maintaining a project that someone else started) - fortunately there
is a PHP function that gives a TTL on all successful lookups, though
it still won't for failures. I couldn't find anything on cache control
anywhere in the Python socket module docs.

Perhaps the simplest option is to throw down a local BIND to manage
the caching for you, but that does seem a little like overkill.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread John Nagle

On 4/27/2012 6:25 PM, Adam Skutt wrote:

On Apr 27, 2:54 pm, John Nagle  wrote:

  I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.


How exactly are you determining that this is the case?


   Found the problem.  The threads, after doing their compute
intensive work of examining pages, stored some URLs they'd found.
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.  On CentOS, "getaddrinfo()" at the
glibc level doesn't always cache locally (ref
https://bugzilla.redhat.com/show_bug.cgi?id=576801).  Python
doesn't cache either.  So huge numbers of DNS requests were being
made.  For some pages being scanned, many of the domains required
accessing a rather slow  DNS server.  The combination of thousands
of instances of the same domain, a slow DNS server, and no caching
slowed the crawler down severely.

   Added a local cache in the program to prevent this.
Performance much improved.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Adam Skutt
On Apr 27, 2:54 pm, John Nagle  wrote:
>      I have a multi-threaded CPython program, which has up to four
> threads.  One thread is simply a wait loop monitoring the other
> three and waiting for them to finish, so it can give them more
> work to do.  When the work threads, which read web pages and
> then parse them, are compute-bound, I've had the monitoring thread
> starved of CPU time for as long as 120 seconds.

How exactly are you determining that this is the case?

>     I know that the CPython thread dispatcher sucks, but I didn't
> realize it sucked that bad.  Is there a preference for running
> threads at the head of the list (like UNIX, circa 1979) or
> something like that?

Not in CPython, which is at the mercy of what the operating system
does.  Under the covers, CPython uses a semaphore on Windows, which do
not have FIFO ordering as per 
http://msdn.microsoft.com/en-us/library/windows/desktop/ms685129(v=vs.85).aspx.
As a result, I think your thread is succumbing to the same issues that
impact signal delivery as described on 22-24 and 35-41 of
http://www.dabeaz.com/python/GIL.pdf.

I'm not sure there's any easy or reliable way to "fix" that from your
code.  I am not a WinAPI programmer though, and I'd suggest finding
one to help you out.  It doesn't appear possible to change the
scheduling policy for semaphore programatically, and I don't know
closely they pay any attention to thread priority.

That's just a guess though, and finding out for sure would take some
low-level debugging.  However, it seems to be the most probable
situation assuming your code is correct.

>
>     (And yes, I know about "multiprocessing".  These threads are already
> in one of several service processes.  I don't want to launch even more
> copies of the Python interpreter.

Why? There's little harm in launching more instances.  Processes have
some additional startup and memory overhead compared to threads, but I
can't imagine it woudl be an issue.  Given what you're trying to do,
I'd expect to run out of other resources long before I ran out of
memory because I created too many processes or threads.

> The threads are usually I/O bound,
> but when they hit unusually long web pages, they go compute-bound
> during parsing.)

If your concern is being CPU oversubscribed by using lots of
processes, I suspect it's probably misplaced.  A whole mess of CPU-
bound tasks is pretty much the easiest case for a scheduler to
handle.

Adam
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread MRAB

On 27/04/2012 23:30, Dennis Lee Bieber wrote:


Oh, continuation thought...

If the workers are calling into C-language operations, unless those
operations release the GIL, it doesn't matter what the OS or Python
thread switch timings are. The OS may interrupt the thread (running
C-language code), pass control to the Python interpreter which finds the
GIL is locked, and just blocks -- control passes back to the interrupted
thread.

Any long-running C-language function should release the GIL while
doing things with local data -- and reacquire the GIL when it needs to
manipulate Python data structures or returning...


The OP mentioned parsing webpages. If that involves the re module at
some point, it doesn't release the GIL while it's looking for matches.
--
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Paul Rubin
John Nagle  writes:
>I know that the CPython thread dispatcher sucks, but I didn't
> realize it sucked that bad.  Is there a preference for running
> threads at the head of the list (like UNIX, circa 1979) or
> something like that?

I think it's left up to the OS thread scheduler, Windows in your case.
See  http://www.dabeaz.com/python/NewGIL.pdf  starting around slide 18.

One idea that comes to mind is putting a periodic interrupt and signal
handler into your main thread, to make sure the GIL gets released every
so often.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: CPython thread starvation

2012-04-27 Thread Kiuhnm

On 4/27/2012 20:54, John Nagle wrote:

I have a multi-threaded CPython program, which has up to four
threads. One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do. When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.
It's sleeping for 0.5 seconds, then checking on the other threads
and for new work do to, so the work thread isn't using much
compute time.


How exactly are these waiting and checking performed?

Kiuhnm
--
http://mail.python.org/mailman/listinfo/python-list


CPython thread starvation

2012-04-27 Thread John Nagle

I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.
It's sleeping for 0.5 seconds, then checking on the other threads
and for new work do to, so the work thread isn't using much
compute time.

   I know that the CPython thread dispatcher sucks, but I didn't
realize it sucked that bad.  Is there a preference for running
threads at the head of the list (like UNIX, circa 1979) or
something like that?

   (And yes, I know about "multiprocessing".  These threads are already
in one of several service processes.  I don't want to launch even more
copies of the Python interpreter.  The threads are usually I/O bound,
but when they hit unusually long web pages, they go compute-bound
during parsing.)

   Setting "sys.setcheckinterval" from the default to 1 seems
to have little effect.  This is on Windows 7.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list