On 09/03/2010 10:05 PM, Jeff Squyres wrote:
On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote:
Backing off the polling rate requires more application-specific logic like that
offered below, so it is a little difficult for us to implement at the MPI
library level. Not saying we eventually won't - just not sure anyone quite
knows how to do so in a generalized form.
FWIW, we've *talked* about this kind of stuff among the developers -- it's at least
somewhat similar to the "backoff to blocking communications instead of polling
communications" issues. That work in particular has been discussed for a long time
but never implemented.
Are your jobs hanging because of deadlock (i.e., application error), or
infrastructure error? If they're hanging because of deadlock, there are some
PMPI-based tools that might be able to help.
These are application deadlocks (like the well-known VASP calling MPI_Finalize
when
it should be calling MPI_Abort!). But I'm asking as a system manager with
dozens of
apps run by dozens of users hanging and not being noticed for a day or two
because
users are not attentive and, from outside the job, everything looks OK. So the
problem
is detection. Are you suggesting there are PMPI approaches we could apply to
every
production job on the system?
I now have a hack to opal_progress that seems to do what we want without any
impact
on performance in the "good" case. It basically involves keeping count of the
number
of contiguous calls to opal_progress with no events completed. When that hits
a large
number (eg 10^9), sleeping (maybe up to a second) on every, say, 10^3-10^4
passes
through opal_progress seems to do "the right thing". (Obviously, any event
completion
resets everything to spinning.) There are a few magic numbers there that need
to
be overrideable by users. Please let me know if this idea is blatantly flawed.
Thanks,
David