Looks like it works.
Aurelien
Le 6 mars 08 à 10:36, Ralph Castain a écrit :
I believe I have at least helped reduce this with r17761. I added the
ability for procs to detect that their "lifeline" connection (either
the HNP
for unity routed, or their local daemon for tree) has been lost and
I believe I have at least helped reduce this with r17761. I added the
ability for procs to detect that their "lifeline" connection (either the HNP
for unity routed, or their local daemon for tree) has been lost and
gracefully abort.
Let me know if that helps
Ralph
On 3/4/08 9:37 PM, "Aurélien B
Wow, this took 4.5 hours to get through our Lab's email filter! You must
have been very bad recently. ;-)) Probably because you are being mean to my
poor little orteds...
We still don't have a reliable way for mpirun to detect that orteds have
crashed. I am working on some methods right now that
Scenario 2 is definitely one of those we have been experienced (we are
making some changes to orte and this lead some orted to crash). I will
try to find a way to reproduce easily the other one, where aborted MPI
processes are left behind (but no orted).
Thanks,
Aurelien
Le 5 mars 08 à 08
Awesome. I haven't been seeing this behavior, but I won't swear that it is
anywhere near fully tested.
A couple of possibilities come to mind:
1. are you building threaded? If so, then all bets are off. The new release
of orte depends heavily on libevent. As George pointed out on the Tues
telecon
I noticed that the new release of orte is not as good as it used to be
to cleanup the mess left by crashed/aborted mpi processes. Recently We
have been experiencing a lot of zombie or live locked processes
running on the cluster nodes and disturbing following experiments. I
didn't really ha