Looks like it works.
Aurelien
Le 6 mars 08 à 10:36, Ralph Castain a écrit :
I believe I have at least helped reduce this with r17761. I added the
ability for procs to detect that their "lifeline" connection (either
the HNP
for unity routed, or their local daemon for tree) has been lost and
gracefully abort.
Let me know if that helps
Ralph
On 3/4/08 9:37 PM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
wrote:
I noticed that the new release of orte is not as good as it used to
be
to cleanup the mess left by crashed/aborted mpi processes. Recently
We
have been experiencing a lot of zombie or live locked processes
running on the cluster nodes and disturbing following experiments. I
didn't really had time to investigate the issue, maybe ralph can
set a
ticket if he is able to reproduce this.
Aurelien
--
* Dr. Aurélien Bouteiller
* Sr. Research Associate at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel