Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Ralph, the application still hangs, i attached new logs. on slurm0, if i /sbin/ifconfig eth0:1 down then the application does not hang any more Cheers, Gilles On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote: > I appear to have this fixed now - please give the current trunk (r31949 or >

[OMPI devel] OMPI timing fix

2014-06-04 Thread Artem Polyakov
Here is quick fix of OMPI timing facility. Currently first measurement is bogus because OMPI_PROC_MY_NAME is not initialized at the time of first ompistart setup: *time from start to completion of rte_init 1348381643658244 usec* time from completion of rte_init to modex 17585 usec time to execute

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Ralph Castain
Ah crud - I see what's going on. This is an issue of a message coming in on one interface that needs to get transferred to another one for relay. Looks like that mechanism is broken, which is causing us to issue another show_help, which gets caught in the same loop again. I'll work on it - may

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Thanks Ralf, for the time being, i just found a workaround --mca oob_tcp_if_include eth0 Generally speaking, is openmpi doing the wiser thing ? here is what i mean : the cluster i work on (4k+ nodes) each node has two ip interfaces : * eth0 (gigabit ethernet) : because of the cluster size, sever

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Ralph Castain
Well, the problem is that we can't simply decide that anything called "ib.." is an IB port and should be ignored. There is no naming rule regarding IP interfaces that I've ever heard about that would allow us to make such an assumption, though I admit most people let the system create default na