Ralph,
the application still hangs, i attached new logs.
on slurm0, if i /sbin/ifconfig eth0:1 down
then the application does not hang any more
Cheers,
Gilles
On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote:
> I appear to have this fixed now - please give the current trunk (r31949 or
>
Here is quick fix of OMPI timing facility. Currently first measurement is
bogus because OMPI_PROC_MY_NAME is not initialized at the time of first
ompistart setup:
*time from start to completion of rte_init 1348381643658244 usec*
time from completion of rte_init to modex 17585 usec
time to execute
Ah crud - I see what's going on. This is an issue of a message coming in on one
interface that needs to get transferred to another one for relay. Looks like
that mechanism is broken, which is causing us to issue another show_help, which
gets caught in the same loop again.
I'll work on it - may
Thanks Ralf,
for the time being, i just found a workaround
--mca oob_tcp_if_include eth0
Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
* eth0 (gigabit ethernet) : because of the cluster size, sever
Well, the problem is that we can't simply decide that anything called "ib.." is
an IB port and should be ignored. There is no naming rule regarding IP
interfaces that I've ever heard about that would allow us to make such an
assumption, though I admit most people let the system create default na