Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
I keep explaining that we don't "discard" anything, but there really isn't any point to continuing trying to explain the system. With the announced intention of completing the move of the BTLs to OPAL, I no longer need the multi-module complexity in the OOB/TCP. So I have removed it and gone bac

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet
Jeff, as pointed by Ralph, i do wish using eth0 for oob messages. i work on a 4k+ nodes cluster with a very decent gigabit ethernet network (reasonable oversubscription + switches from a reputable vendor you are familiar with ;-) ) my experience is that IPoIB can be very slow at establishing a co

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
On Jun 5, 2014, at 7:09 AM, Ralph Castain wrote: > Okay, before you go chasing this, let me explain that we already try to > address this issue in the TCP oob. When we need to connect to someone, we do > the following: > > 1. if we have a direct connection available, we hand the message to th

Re: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Hjelm, Nathan T
Coll/ml does disqualify itself if processes are not bound. The problem here is there is an inconsistency between the two sides of the intercommunicator. I can write a quick fix for 1.8.2. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of Gille

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Okay, before you go chasing this, let me explain that we already try to address this issue in the TCP oob. When we need to connect to someone, we do the following: 1. if we have a direct connection available, we hand the message to the software module assigned to that NIC 2. if none of the ava

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Because Gilles wants to avoid using IB for TCP messages, and using eth0 also solves the problem (the messages just route) On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) wrote: > Another random thought for Gilles situation: why not oob-TCP-if-include ib0? > (And not eth0) > > That should

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
Another random thought for Gilles situation: why not oob-TCP-if-include ib0? (And not eth0) That should solve his problem, but not the larger issue I raised in my previous email. Sent from my phone. No type good. On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" mailto:gilles.gouaillar...@gm

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
That raises a larger issue -- what about Ethernet-only clusters that span multiple IP/L3 subnets? This is a scenario that Cisco definitely wants to enable/support. The usnic BTL, for example, can handle this scenario. We hadn't previously considered the TCP oob component effects in this scena

[OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Gilles Gouaillardet
Folks, on my single socket four cores VM (no batch manager), i am running the intercomm_create test from the ibm test suite. mpirun -np 1 ./intercomm_create => OK mpirun -np 2 ./intercomm_create => HANG :-( mpirun -np 2 --mca coll ^ml ./intercomm_create => OK basically, this first two tasks w

[OMPI devel] RFC: Move the Open MPI communication infrastructure in OPAL

2014-06-05 Thread George Bosilca
WHAT:Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL WHY: All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/ins