Re: [OMPI devel] Orte cleanup

2008-03-07 Thread Aurélien Bouteiller
Looks like it works. Aurelien Le 6 mars 08 à 10:36, Ralph Castain a écrit : I believe I have at least helped reduce this with r17761. I added the ability for procs to detect that their "lifeline" connection (either the HNP for unity routed, or their local daemon for tree) has been lost and

Re: [OMPI devel] Orte cleanup

2008-03-06 Thread Ralph Castain
I believe I have at least helped reduce this with r17761. I added the ability for procs to detect that their "lifeline" connection (either the HNP for unity routed, or their local daemon for tree) has been lost and gracefully abort. Let me know if that helps Ralph On 3/4/08 9:37 PM, "Aurélien B

Re: [OMPI devel] Orte cleanup

2008-03-05 Thread Ralph H Castain
Wow, this took 4.5 hours to get through our Lab's email filter! You must have been very bad recently. ;-)) Probably because you are being mean to my poor little orteds... We still don't have a reliable way for mpirun to detect that orteds have crashed. I am working on some methods right now that

Re: [OMPI devel] Orte cleanup

2008-03-05 Thread Aurélien Bouteiller
Scenario 2 is definitely one of those we have been experienced (we are making some changes to orte and this lead some orted to crash). I will try to find a way to reproduce easily the other one, where aborted MPI processes are left behind (but no orted). Thanks, Aurelien Le 5 mars 08 à 08

Re: [OMPI devel] Orte cleanup

2008-03-05 Thread Ralph H Castain
Awesome. I haven't been seeing this behavior, but I won't swear that it is anywhere near fully tested. A couple of possibilities come to mind: 1. are you building threaded? If so, then all bets are off. The new release of orte depends heavily on libevent. As George pointed out on the Tues telecon

[OMPI devel] Orte cleanup

2008-03-04 Thread Aurélien Bouteiller
I noticed that the new release of orte is not as good as it used to be to cleanup the mess left by crashed/aborted mpi processes. Recently We have been experiencing a lot of zombie or live locked processes running on the cluster nodes and disturbing following experiments. I didn't really ha