Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Ralph Castain
On Jun 6, 2014, at 7:11 AM, Jeff Squyres (jsquyres) wrote: > Looks like Ralph's simpler solution fit the bill. Yeah, but I still am unhappy with it. It's about the stupidest connection model you can imagine. What happens is this: * a process constructs its URI - this is done by creating a str

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Jeff Squyres (jsquyres)
On Jun 5, 2014, at 9:16 PM, Gilles Gouaillardet wrote: > i work on a 4k+ nodes cluster with a very decent gigabit ethernet > network (reasonable oversubscription + switches > from a reputable vendor you are familiar with ;-) ) > my experience is that IPoIB can be very slow at establishing a > co

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Ralph Castain
Kewl - thanks! On Jun 5, 2014, at 9:28 PM, Gilles Gouaillardet wrote: > Ralph, > > sorry for my poor understanding ... > > i tried r31956 and it solved both issues : > - MPI_Abort does not hang any more if nodes are on different eth0 subnets > - MPI_Init does not hang any more if hosts have d

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Gilles Gouaillardet
Ralph, sorry for my poor understanding ... i tried r31956 and it solved both issues : - MPI_Abort does not hang any more if nodes are on different eth0 subnets - MPI_Init does not hang any more if hosts have different number of IB ports this likely explains why you are having trouble replicating

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
I keep explaining that we don't "discard" anything, but there really isn't any point to continuing trying to explain the system. With the announced intention of completing the move of the BTLs to OPAL, I no longer need the multi-module complexity in the OOB/TCP. So I have removed it and gone bac

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet
Jeff, as pointed by Ralph, i do wish using eth0 for oob messages. i work on a 4k+ nodes cluster with a very decent gigabit ethernet network (reasonable oversubscription + switches from a reputable vendor you are familiar with ;-) ) my experience is that IPoIB can be very slow at establishing a co

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
On Jun 5, 2014, at 7:09 AM, Ralph Castain wrote: > Okay, before you go chasing this, let me explain that we already try to > address this issue in the TCP oob. When we need to connect to someone, we do > the following: > > 1. if we have a direct connection available, we hand the message to th

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Okay, before you go chasing this, let me explain that we already try to address this issue in the TCP oob. When we need to connect to someone, we do the following: 1. if we have a direct connection available, we hand the message to the software module assigned to that NIC 2. if none of the ava

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Because Gilles wants to avoid using IB for TCP messages, and using eth0 also solves the problem (the messages just route) On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) wrote: > Another random thought for Gilles situation: why not oob-TCP-if-include ib0? > (And not eth0) > > That should

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
Another random thought for Gilles situation: why not oob-TCP-if-include ib0? (And not eth0) That should solve his problem, but not the larger issue I raised in my previous email. Sent from my phone. No type good. On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" mailto:gilles.gouaillar...@gm

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
That raises a larger issue -- what about Ethernet-only clusters that span multiple IP/L3 subnets? This is a scenario that Cisco definitely wants to enable/support. The usnic BTL, for example, can handle this scenario. We hadn't previously considered the TCP oob component effects in this scena

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Ralph Castain
Well, the problem is that we can't simply decide that anything called "ib.." is an IB port and should be ignored. There is no naming rule regarding IP interfaces that I've ever heard about that would allow us to make such an assumption, though I admit most people let the system create default na

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Thanks Ralf, for the time being, i just found a workaround --mca oob_tcp_if_include eth0 Generally speaking, is openmpi doing the wiser thing ? here is what i mean : the cluster i work on (4k+ nodes) each node has two ip interfaces : * eth0 (gigabit ethernet) : because of the cluster size, sever

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Ralph Castain
Ah crud - I see what's going on. This is an issue of a message coming in on one interface that needs to get transferred to another one for relay. Looks like that mechanism is broken, which is causing us to issue another show_help, which gets caught in the same loop again. I'll work on it - may

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Ralph, the application still hangs, i attached new logs. on slurm0, if i /sbin/ifconfig eth0:1 down then the application does not hang any more Cheers, Gilles On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote: > I appear to have this fixed now - please give the current trunk (r31949 or >

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Ralph Castain
I appear to have this fixed now - please give the current trunk (r31949 or above) a spin to see if I got it for you too. On Jun 3, 2014, at 6:06 AM, Ralph Castain wrote: > You can leave it running - I just needed to know. If mpirun sees slurm (i.e., > you were running inside a slurm allocatio

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Ralph Castain
You can leave it running - I just needed to know. If mpirun sees slurm (i.e., you were running inside a slurm allocation), it will use it. On Jun 3, 2014, at 5:43 AM, Gilles Gouaillardet wrote: > Ralph, > > slurm is installed and running on both nodes. > > that being said, there is no runni

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
Ralph, slurm is installed and running on both nodes. that being said, there is no running job on any node so unless mpirun automagically detects slurm is up and running, i assume i am running under rsh. i can run the test again after i stop slurm if needed, but that will not happen before tomorr

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Ralph Castain
On Jun 3, 2014, at 3:06 AM, Gilles Gouaillardet wrote: > Ralph, > > i get no more complains about rtc :-) > > but MPI_Abort still hangs :-( > > i reviewed my configuration and the hang is not related to one node having > one IB port and the other node having two IB ports. > > the two nodes

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
Ralph, i get no more complains about rtc :-) but MPI_Abort still hangs :-( i reviewed my configuration and the hang is not related to one node having one IB port and the other node having two IB ports. the two nodes can establish TCP connections via : - eth0 (but they are *not* on the same subn

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Ralph, i will try this tomorrow Cheers, Gilles On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain wrote: > I think I have this fixed with r31928, but have no way to test it on my > machine. Please see if it works for you. > > > On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote: > > This i

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
I think I have this fixed with r31928, but have no way to test it on my machine. Please see if it works for you. On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote: > This is indeed the problem - we are trying to send a message and don't know > how to get it somewhere. I'll break the loop, and t

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Jeff, from the FAQ, openmpi should work on nodes who have different number of IB ports (at least since v1.2) about IB ports on the same subnet, all i was able to find is explanation about why i get this warning : WARNING: There are more than one active ports on host '%s', but the default

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
This is indeed the problem - we are trying to send a message and don't know how to get it somewhere. I'll break the loop, and then ask that you run this again with -mca oob_base_verbose 10 so we can see the intended recipient. On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet wrote: > #7 0x000

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
I'm AFK but let me reply about the IB thing: double ports/multi rail is a good thing. It's not a good thing if they're on the same subnet. Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I can't see it well enough on the small screen of my phone, but I think there's a q on t

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Jeff, On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) wrote: > On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > i faced a bit different problem, but that is 100% reproductible : > > - i launch mpirun (no batch manager) from a node with one I

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet wrote: > i faced a bit different problem, but that is 100% reproductible : > - i launch mpirun (no batch manager) from a node with one IB port > - i use -host node01,node02 where node01 and node02 both have two IB port on > the > same subnet FW

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Rolf, i faced a bit different problem, but that is 100% reproductible : - i launch mpirun (no batch manager) from a node with one IB port - i use -host node01,node02 where node01 and node02 both have two IB port on the same subnet by default, this will hang. if this is a "feature" (e.g. openmpi