[OMPI devel] RFC: Move the Open MPI communication infrastructure in OPAL

2014-06-05 Thread George Bosilca
WHAT:Open our low-level communication infrastructure by moving all
necessary components
   (btl/rcache/allocator/mpool) down in OPAL

WHY: All the components required for inter-process communications are
currently deeply integrated in the OMPI
  layer. Several groups/institutions have express interest
in having a more generic communication
  infrastructure, without all the OMPI layer dependencies.
This communication layer should be made
  available at a different software level, available to
all layers in the Open MPI software stack. As an
  example, our ORTE layer could replace the current OOB
and instead use the BTL directly, gaining
  access to more reactive network interfaces than TCP.
Similarly, external software libraries could take
  advantage of our highly optimized AM (active message)
communication layer for their own purpose.

  UTK with support from Sandia, developped a version of
Open MPI where the entire communication
  infrastucture has been moved down to OPAL
(btl/rcache/allocator/mpool). Most of the moved
  components have been updated to match the new schema,
with few exceptions (mainly BTLs
  where I have no way of compiling/testing them). Thus,
the completion of this RFC is tied to
  being able to completing this move for all BTLs. For
this we need help from the rest of the Open MPI
  community, especially those supporting some of the BTLs.
A non-exhaustive list of BTLs that
  qualify here is: mx, portals4, scif, udapl, ugni, usnic.

WHERE:  bitbucket.org/bosilca/ompi-btl (updated today with respect to
trunk r31952)

TIMEOUT: After all the BTLs have been amended to match the new
location and usage. We will discuss
  the last bits regarding this RFC at the Open MPI
developers meeting in Chicago, June 24-26. The
  RFC will become final only after the meeting.


[OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Gilles Gouaillardet
Folks,

on my single socket four cores VM (no batch manager), i am running the
intercomm_create test from the ibm test suite.

mpirun -np 1 ./intercomm_create
=> OK

mpirun -np 2 ./intercomm_create
=> HANG :-(

mpirun -np 2 --mca coll ^ml  ./intercomm_create
=> OK

basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks)
followed by MPI_Intercomm_merge
and the 4 spawned tasks will call MPI_Intercomm_merge followed by
MPI_Intercomm_create

i digged a bit into that issue and found two distinct issues :

1) binding :
tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK
tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have
expected [2-3]
tasks[4-5] (second spawn) are not bound at all => ODD again, could have
made sense if tasks[2-3] were bound on cores [2-3]
i observe the same behaviour  with the --oversubscribe mpirun parameter

2) coll/ml
coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks)
i suspect coll/ml is unable to handle unbound tasks.
if i am correct, should coll/ml detect this and simply automatically
disqualify itself ?

Cheers,

Gilles


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
That raises a larger issue -- what about Ethernet-only clusters that span 
multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
enable/support.

The usnic BTL, for example, can handle this scenario.  We hadn't previously 
considered the TCP oob component effects in this scenario -- oops.

Hmm.

The usnic BTL both does lazy connections (so to speak...) and uses a 
connectivity checker to ensure that it can actually communicate with each peer. 
 In this way, OMPI has a way of knowing whether process A can communicate with 
process B, even if A and B have effectively unrelated IP addresses (i.e., 
they're not on the same IP subnet).

I don't think the TCP oob will be able to use this same kind of strategy.

As a simple solution, there could be an TCP oob MCA param that says "regardless 
of peer IP address, I can connect to them" (i.e., assume IP routing will make 
everything work out ok).

That doesn't seem like a good overall solution, however -- it doesn't 
necessarily fit in the "it just works out of the box" philosophy that we like 
to have in OMPI.

Let me take this back to some IP experts here and see if someone can come up 
with a better idea.



On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:

> Well, the problem is that we can't simply decide that anything called "ib.." 
> is an IB port and should be ignored. There is no naming rule regarding IP 
> interfaces that I've ever heard about that would allow us to make such an 
> assumption, though I admit most people let the system create default names 
> and thus would get something like an "ib..".
> 
> So we leave it up to the sys admin to configure the system based on their 
> knowledge of what they want to use. On the big clusters at the labs, we 
> commonly put MCA params in the default param file for this purpose as we 
> *don't* want OOB traffic going over the IB fabric.
> 
> But that's the sys admin's choice, not a requirement. I've seen organizations 
> that do it the other way because their Ethernet is really slow.
> 
> In this case, the problem is really in the OOB itself. The local proc is 
> connecting to its local daemon via eth0, which is fine. When it sends a 
> message to mpirun on a different proc, that message goes from the app to the 
> daemon via eth0. The daemon looks for mpirun in its contact list, and sees 
> that it has a direct link to mpirun via this nifty "ib0" interface - and so 
> it uses that one to relay the message along.
> 
> This is where we are hitting the problem - the OOB isn't correctly doing the 
> transfer between those two interfaces like it should. So it is a bug that we 
> need to fix, regardless of any other actions (e.g., if it was an eth1 that 
> was the direct connection, we would still want to transfer the message to the 
> other interface).
> 
> HTH
> Ralph
> 
> On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet 
>  wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only impacts systems 
>> with mismatched interfaces, which is why we aren't gene

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
(And not eth0)

That should solve his problem, but not the larger issue I raised in my previous 
email.

Sent from my phone. No type good.

On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
mailto:gilles.gouaillar...@gmail.com>> wrote:

Thanks Ralf,

for the time being, i just found a workaround
--mca oob_tcp_if_include eth0

Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
 * eth0 (gigabit ethernet) : because of the cluster size, several subnets are 
used.
 * ib0 (IP over IB) : only one subnet
i can easily understand such a large cluster is not so common, but on the other 
hand i do not believe the IP configuration (subnetted gigE and single subnet 
IPoIB) can be called exotic.

if nodes from different eth0 subnets are used, and if i understand correctly 
your previous replies, orte will "discard" eth0 because nodes cannot contact 
each other "directly".
directly means the nodes are not on the same subnet. that being said, they can 
communicate via IP thanks to IP routing (i mean IP routing, i do *not* mean 
orte routing).
that means orte communications will use IPoIB which might not be the best thing 
to do since establishing an IPoIB connection can be long (especially at scale 
*and* if the arp table is not populated)

is my understanding correct so far ?

bottom line, i would have expected openmpi uses eth0 regardless IP routing is 
required, and ib0 is simply not used (or eventually used as a fallback option)

this leads to my next question : is the current default ok ? if not should we 
change it and how ?
/*
imho :
 - IP routing is not always a bad/slow thing
 - gigE can sometimes be better than IPoIB)
*/

i am fine if at the end :
- this issue is fixed
- we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 the 
default if this is really thought to be best for the cluster. (and i can try to 
draft a faq if needed)

Cheers,

Gilles

On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:

I'll work on it - may take a day or two to really fix. Only impacts systems 
with mismatched interfaces, which is why we aren't generally seeing it.

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/14972.php


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
solves the problem (the messages just route)

On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  wrote:

> Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
> (And not eth0)
> 
> That should solve his problem, but not the larger issue I raised in my 
> previous email. 
> 
> Sent from my phone. No type good. 
> 
> On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
>  wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only impacts systems 
>> with mismatched interfaces, which is why we aren't generally seeing it.
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14977.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Okay, before you go chasing this, let me explain that we already try to address 
this issue in the TCP oob. When we need to connect to someone, we do the 
following:

1. if we have a direct connection available, we hand the message to the 
software module assigned to that NIC

2. if none of the available NICs match the target's subnet, then we assign the 
message to the software module for the first NIC in the system - i.e., the one 
with the lowest kernel index - and let it try to send the message. We expect 
the OS to know how to route the connection.

3. if that fails for some reason, then we'll try assign it to the software 
module for the next NIC in the system, continuing down this path until every 
module has had a chance to try.

4. if no TCP module can send it, we bump it back up to the OOB framework to see 
if another component can send it. At the moment, we don't have one, but that 
will shortly change.

My intention is to be a little more intelligent on step #2. At the very least, 
I'd like to see us find the closest subnet match - just check tuples to see who 
has the most matches. So if the target is on 10.1.2.3 and I have two NICs 
10.2.3.x and 192.168.2.y, then I should pick the first one since it at least 
matches something.

If your IP experts have a better solution, please pass it along! What is 
causing the problem here is that the message comes in on one NIC that doesn't 
have a direct connection to the target, and the "hop" mechanism isn't working 
correctly (kicks into an infinite loop).



On Jun 5, 2014, at 4:27 AM, Jeff Squyres (jsquyres)  wrote:

> That raises a larger issue -- what about Ethernet-only clusters that span 
> multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
> enable/support.
> 
> The usnic BTL, for example, can handle this scenario.  We hadn't previously 
> considered the TCP oob component effects in this scenario -- oops.
> 
> Hmm.
> 
> The usnic BTL both does lazy connections (so to speak...) and uses a 
> connectivity checker to ensure that it can actually communicate with each 
> peer.  In this way, OMPI has a way of knowing whether process A can 
> communicate with process B, even if A and B have effectively unrelated IP 
> addresses (i.e., they're not on the same IP subnet).
> 
> I don't think the TCP oob will be able to use this same kind of strategy.
> 
> As a simple solution, there could be an TCP oob MCA param that says 
> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> routing will make everything work out ok).
> 
> That doesn't seem like a good overall solution, however -- it doesn't 
> necessarily fit in the "it just works out of the box" philosophy that we like 
> to have in OMPI.
> 
> Let me take this back to some IP experts here and see if someone can come up 
> with a better idea.
> 
> 
> 
> On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:
> 
>> Well, the problem is that we can't simply decide that anything called "ib.." 
>> is an IB port and should be ignored. There is no naming rule regarding IP 
>> interfaces that I've ever heard about that would allow us to make such an 
>> assumption, though I admit most people let the system create default names 
>> and thus would get something like an "ib..".
>> 
>> So we leave it up to the sys admin to configure the system based on their 
>> knowledge of what they want to use. On the big clusters at the labs, we 
>> commonly put MCA params in the default param file for this purpose as we 
>> *don't* want OOB traffic going over the IB fabric.
>> 
>> But that's the sys admin's choice, not a requirement. I've seen 
>> organizations that do it the other way because their Ethernet is really slow.
>> 
>> In this case, the problem is really in the OOB itself. The local proc is 
>> connecting to its local daemon via eth0, which is fine. When it sends a 
>> message to mpirun on a different proc, that message goes from the app to the 
>> daemon via eth0. The daemon looks for mpirun in its contact list, and sees 
>> that it has a direct link to mpirun via this nifty "ib0" interface - and so 
>> it uses that one to relay the message along.
>> 
>> This is where we are hitting the problem - the OOB isn't correctly doing the 
>> transfer between those two interfaces like it should. So it is a bug that we 
>> need to fix, regardless of any other actions (e.g., if it was an eth1 that 
>> was the direct connection, we would still want to transfer the message to 
>> the other interface).
>> 
>> HTH
>> Ralph
>> 
>> On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Thanks Ralf,
>>> 
>>> for the time being, i just found a workaround
>>> --mca oob_tcp_if_include eth0
>>> 
>>> Generally speaking, is openmpi doing the wiser thing ?
>>> here is what i mean :
>>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>> * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>>> are used.
>>> * ib0 (IP over IB) : only one subnet

Re: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Hjelm, Nathan T
Coll/ml does disqualify itself if processes are not bound. The problem here is 
there is an inconsistency between the two sides of the intercommunicator. I can 
write a quick fix for 1.8.2.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
[gilles.gouaillar...@gmail.com]
Sent: Thursday, June 05, 2014 1:20 AM
To: Open MPI Developers
Subject: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

Folks,

on my single socket four cores VM (no batch manager), i am running the 
intercomm_create test from the ibm test suite.

mpirun -np 1 ./intercomm_create
=> OK

mpirun -np 2 ./intercomm_create
=> HANG :-(

mpirun -np 2 --mca coll ^ml  ./intercomm_create
=> OK

basically, this first two tasks will call twice MPI_Comm_spawn(2 tasks) 
followed by MPI_Intercomm_merge
and the 4 spawned tasks will call MPI_Intercomm_merge followed by 
MPI_Intercomm_create

i digged a bit into that issue and found two distinct issues :

1) binding :
tasks [0-1] (launched with mpirun) are bound on cores [0-1] => OK
tasks[2-3] (first spawn) are bound on cores [0-1] => ODD, i would have expected 
[2-3]
tasks[4-5] (second spawn) are not bound at all => ODD again, could have made 
sense if tasks[2-3] were bound on cores [2-3]
i observe the same behaviour  with the --oversubscribe mpirun parameter

2) coll/ml
coll/ml hangs when -np 2 (total 6 tasks, including 2 unbound tasks)
i suspect coll/ml is unable to handle unbound tasks.
if i am correct, should coll/ml detect this and simply automatically disqualify 
itself ?

Cheers,

Gilles



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain

On Jun 5, 2014, at 7:09 AM, Ralph Castain  wrote:

> Okay, before you go chasing this, let me explain that we already try to 
> address this issue in the TCP oob. When we need to connect to someone, we do 
> the following:
> 
> 1. if we have a direct connection available, we hand the message to the 
> software module assigned to that NIC
> 
> 2. if none of the available NICs match the target's subnet, then we assign 
> the message to the software module for the first NIC in the system - i.e., 
> the one with the lowest kernel index - and let it try to send the message. We 
> expect the OS to know how to route the connection.
> 
> 3. if that fails for some reason, then we'll try assign it to the software 
> module for the next NIC in the system, continuing down this path until every 
> module has had a chance to try.

Actually, this isn't quite correct. The NIC we assigned it to will cycle across 
all of the known connection addresses for the intended target, trying each in 
turn. If *none* of those successfully connect, then the module declares that it 
is unable to make the connection.

At that point, we let the next software module try. This has always bothered me 
a bit as I don't see how it can succeed if the first one failed - the OS is 
going to decide which NIC to send the connection request across anyway. All we 
are doing is assigning the thread that will make the connection request. So 
long as that thread tries all the connection addresses, it shouldn't matter 
which thread makes the attempt.

Point being: we can probably just let the one thread make the attempt and give 
up if it fails on all known addresses for the target. We can then bounce it up 
to the OOB framework and let someone else try with a different transport, 
should one be available for that target. This would simplify the logic.


> 
> 4. if no TCP module can send it, we bump it back up to the OOB framework to 
> see if another component can send it. At the moment, we don't have one, but 
> that will shortly change.
> 
> My intention is to be a little more intelligent on step #2. At the very 
> least, I'd like to see us find the closest subnet match - just check tuples 
> to see who has the most matches. So if the target is on 10.1.2.3 and I have 
> two NICs 10.2.3.x and 192.168.2.y, then I should pick the first one since it 
> at least matches something.
> 
> If your IP experts have a better solution, please pass it along! What is 
> causing the problem here is that the message comes in on one NIC that doesn't 
> have a direct connection to the target, and the "hop" mechanism isn't working 
> correctly (kicks into an infinite loop).
> 
> 
> 
> On Jun 5, 2014, at 4:27 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> That raises a larger issue -- what about Ethernet-only clusters that span 
>> multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
>> enable/support.
>> 
>> The usnic BTL, for example, can handle this scenario.  We hadn't previously 
>> considered the TCP oob component effects in this scenario -- oops.
>> 
>> Hmm.
>> 
>> The usnic BTL both does lazy connections (so to speak...) and uses a 
>> connectivity checker to ensure that it can actually communicate with each 
>> peer.  In this way, OMPI has a way of knowing whether process A can 
>> communicate with process B, even if A and B have effectively unrelated IP 
>> addresses (i.e., they're not on the same IP subnet).
>> 
>> I don't think the TCP oob will be able to use this same kind of strategy.
>> 
>> As a simple solution, there could be an TCP oob MCA param that says 
>> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
>> routing will make everything work out ok).
>> 
>> That doesn't seem like a good overall solution, however -- it doesn't 
>> necessarily fit in the "it just works out of the box" philosophy that we 
>> like to have in OMPI.
>> 
>> Let me take this back to some IP experts here and see if someone can come up 
>> with a better idea.
>> 
>> 
>> 
>> On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:
>> 
>>> Well, the problem is that we can't simply decide that anything called 
>>> "ib.." is an IB port and should be ignored. There is no naming rule 
>>> regarding IP interfaces that I've ever heard about that would allow us to 
>>> make such an assumption, though I admit most people let the system create 
>>> default names and thus would get something like an "ib..".
>>> 
>>> So we leave it up to the sys admin to configure the system based on their 
>>> knowledge of what they want to use. On the big clusters at the labs, we 
>>> commonly put MCA params in the default param file for this purpose as we 
>>> *don't* want OOB traffic going over the IB fabric.
>>> 
>>> But that's the sys admin's choice, not a requirement. I've seen 
>>> organizations that do it the other way because their Ethernet is really 
>>> slow.
>>> 
>>> In this case, the problem is really in the OOB itself. The local proc is 
>>> connecting t

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet
Jeff,

as pointed by Ralph, i do wish using eth0 for oob messages.

i work on a 4k+ nodes cluster with a very decent gigabit ethernet
network (reasonable oversubscription + switches
from a reputable vendor you are familiar with ;-) )
my experience is that IPoIB can be very slow at establishing a
connection, especially if the arp table is not populated
(as far as i understand, this involves the subnet manager and
performance can be very random especially if all nodes issue
arp requests at the same time)
on the other hand, performance is much more stable when using the
subnetted IP network.

as Ralf also pointed, i can imagine some architects neglect their
ethernet network (e.g. highly oversubscribed + low end switches)
and in this case ib0 is a best fit for oob messages.

> As a simple solution, there could be an TCP oob MCA param that says 
> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> routing will make everything work out ok).
+1 and/or an option to tell oob mca "do not discard the interface simply
because the peer IP is not in the same subnet"

Cheers,

Gilles

On 2014/06/05 23:01, Ralph Castain wrote:
> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
> solves the problem (the messages just route)
>
> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
> wrote:
>
>> Another random thought for Gilles situation: why not oob-TCP-if-include ib0? 
>>  (And not eth0)
>>



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
I keep explaining that we don't "discard" anything, but there really isn't any 
point to continuing trying to explain the system. With the announced intention 
of completing the move of the BTLs to OPAL, I no longer need the multi-module 
complexity in the OOB/TCP. So I have removed it and gone back to the single 
module that connects to everything.

Try r31956 - hopefully will resolve your connectivity issues.

Still looking at the MPI_Abort hang as I'm having trouble replicating it.


On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet  
wrote:

> Jeff,
> 
> as pointed by Ralph, i do wish using eth0 for oob messages.
> 
> i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> network (reasonable oversubscription + switches
> from a reputable vendor you are familiar with ;-) )
> my experience is that IPoIB can be very slow at establishing a
> connection, especially if the arp table is not populated
> (as far as i understand, this involves the subnet manager and
> performance can be very random especially if all nodes issue
> arp requests at the same time)
> on the other hand, performance is much more stable when using the
> subnetted IP network.
> 
> as Ralf also pointed, i can imagine some architects neglect their
> ethernet network (e.g. highly oversubscribed + low end switches)
> and in this case ib0 is a best fit for oob messages.
> 
>> As a simple solution, there could be an TCP oob MCA param that says 
>> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
>> routing will make everything work out ok).
> +1 and/or an option to tell oob mca "do not discard the interface simply
> because the peer IP is not in the same subnet"
> 
> Cheers,
> 
> Gilles
> 
> On 2014/06/05 23:01, Ralph Castain wrote:
>> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
>> solves the problem (the messages just route)
>> 
>> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> Another random thought for Gilles situation: why not oob-TCP-if-include 
>>> ib0?  (And not eth0)
>>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14982.php