Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Ralph Castain

On Jun 6, 2014, at 7:11 AM, Jeff Squyres (jsquyres)  wrote:

> Looks like Ralph's simpler solution fit the bill.

Yeah, but I still am unhappy with it. It's about the stupidest connection model 
you can imagine. What happens is this:

* a process constructs its URI - this is done by creating a string with the 
IP:PORT for each subnet the proc is listening on. The URI is constructed in 
alphabetical order (well, actually in kernel index order - but that tends to 
follow the alphabetical order of the interface names). This then gets passed to 
the other process

* the sender breaks the URI into its component parts and creates a list of 
addresses for the target. This list gets created in the order of the components 
- i.e., we take the first IP:PORT out of the URI, and that is our first address.

* when the sender initiates a connection, it takes the first address in the 
list (which means the alphabetically first name in the target's list of 
interfaces) and initiates the connection on that subnet. If it succeeds, then 
that is the subnet we use for all subsequent messages.

So if the first subnet can reach the target, even if it means bouncing all over 
the Internet, we will use it - even though the second subnet in the URI might 
have provided a direct connection!

It solves Gilles problem because "ib" comes after "eth", and it matches what 
was done in the original OOB (before my rewrite) - but it sure sounds to me 
like a bad, inefficient solution for general use.


> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14987.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Jeff Squyres (jsquyres)
On Jun 5, 2014, at 9:16 PM, Gilles Gouaillardet  
wrote:

> i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> network (reasonable oversubscription + switches
> from a reputable vendor you are familiar with ;-) )
> my experience is that IPoIB can be very slow at establishing a
> connection, especially if the arp table is not populated
> (as far as i understand, this involves the subnet manager and
> performance can be very random especially if all nodes issue
> arp requests at the same time)
> on the other hand, performance is much more stable when using the
> subnetted IP network.

Got it.

>> As a simple solution, there could be an TCP oob MCA param that says 
>> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
>> routing will make everything work out ok).
> +1 and/or an option to tell oob mca "do not discard the interface simply
> because the peer IP is not in the same subnet"

Looks like Ralph's simpler solution fit the bill.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Ralph Castain
Kewl - thanks!

On Jun 5, 2014, at 9:28 PM, Gilles Gouaillardet  
wrote:

> Ralph,
> 
> sorry for my poor understanding ...
> 
> i tried r31956 and it solved both issues :
> - MPI_Abort does not hang any more if nodes are on different eth0 subnets
> - MPI_Init does not hang any more if hosts have different number of IB ports
> 
> this likely explains why you are having trouble replicating it ;-)
> 
> Thanks a lot !
> 
> Gilles
> 
> 
> On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain  wrote:
> I keep explaining that we don't "discard" anything, but there really isn't 
> any point to continuing trying to explain the system. With the announced 
> intention of completing the move of the BTLs to OPAL, I no longer need the 
> multi-module complexity in the OOB/TCP. So I have removed it and gone back to 
> the single module that connects to everything.
> 
> Try r31956 - hopefully will resolve your connectivity issues.
> 
> Still looking at the MPI_Abort hang as I'm having trouble replicating it.
> 
> 
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet 
>  wrote:
> 
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
> > i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> > network (reasonable oversubscription + switches
> > from a reputable vendor you are familiar with ;-) )
> > my experience is that IPoIB can be very slow at establishing a
> > connection, especially if the arp table is not populated
> > (as far as i understand, this involves the subnet manager and
> > performance can be very random especially if all nodes issue
> > arp requests at the same time)
> > on the other hand, performance is much more stable when using the
> > subnetted IP network.
> >
> > as Ralf also pointed, i can imagine some architects neglect their
> > ethernet network (e.g. highly oversubscribed + low end switches)
> > and in this case ib0 is a best fit for oob messages.
> >
> >> As a simple solution, there could be an TCP oob MCA param that says 
> >> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> >> routing will make everything work out ok).
> > +1 and/or an option to tell oob mca "do not discard the interface simply
> > because the peer IP is not in the same subnet"
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/06/05 23:01, Ralph Castain wrote:
> >> Because Gilles wants to avoid using IB for TCP messages, and using eth0 
> >> also solves the problem (the messages just route)
> >>
> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
> >> wrote:
> >>
> >>> Another random thought for Gilles situation: why not oob-TCP-if-include 
> >>> ib0?  (And not eth0)
> >>>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/06/14982.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14983.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14984.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-06 Thread Gilles Gouaillardet
Ralph,

sorry for my poor understanding ...

i tried r31956 and it solved both issues :
- MPI_Abort does not hang any more if nodes are on different eth0 subnets
- MPI_Init does not hang any more if hosts have different number of IB ports

this likely explains why you are having trouble replicating it ;-)

Thanks a lot !

Gilles


On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain  wrote:

> I keep explaining that we don't "discard" anything, but there really isn't
> any point to continuing trying to explain the system. With the announced
> intention of completing the move of the BTLs to OPAL, I no longer need the
> multi-module complexity in the OOB/TCP. So I have removed it and gone back
> to the single module that connects to everything.
>
> Try r31956 - hopefully will resolve your connectivity issues.
>
> Still looking at the MPI_Abort hang as I'm having trouble replicating it.
>
>
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
> > i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> > network (reasonable oversubscription + switches
> > from a reputable vendor you are familiar with ;-) )
> > my experience is that IPoIB can be very slow at establishing a
> > connection, especially if the arp table is not populated
> > (as far as i understand, this involves the subnet manager and
> > performance can be very random especially if all nodes issue
> > arp requests at the same time)
> > on the other hand, performance is much more stable when using the
> > subnetted IP network.
> >
> > as Ralf also pointed, i can imagine some architects neglect their
> > ethernet network (e.g. highly oversubscribed + low end switches)
> > and in this case ib0 is a best fit for oob messages.
> >
> >> As a simple solution, there could be an TCP oob MCA param that says
> "regardless of peer IP address, I can connect to them" (i.e., assume IP
> routing will make everything work out ok).
> > +1 and/or an option to tell oob mca "do not discard the interface simply
> > because the peer IP is not in the same subnet"
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/06/05 23:01, Ralph Castain wrote:
> >> Because Gilles wants to avoid using IB for TCP messages, and using eth0
> also solves the problem (the messages just route)
> >>
> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) 
> wrote:
> >>
> >>> Another random thought for Gilles situation: why not
> oob-TCP-if-include ib0?  (And not eth0)
> >>>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14982.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14983.php
>


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
I keep explaining that we don't "discard" anything, but there really isn't any 
point to continuing trying to explain the system. With the announced intention 
of completing the move of the BTLs to OPAL, I no longer need the multi-module 
complexity in the OOB/TCP. So I have removed it and gone back to the single 
module that connects to everything.

Try r31956 - hopefully will resolve your connectivity issues.

Still looking at the MPI_Abort hang as I'm having trouble replicating it.


On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet  
wrote:

> Jeff,
> 
> as pointed by Ralph, i do wish using eth0 for oob messages.
> 
> i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> network (reasonable oversubscription + switches
> from a reputable vendor you are familiar with ;-) )
> my experience is that IPoIB can be very slow at establishing a
> connection, especially if the arp table is not populated
> (as far as i understand, this involves the subnet manager and
> performance can be very random especially if all nodes issue
> arp requests at the same time)
> on the other hand, performance is much more stable when using the
> subnetted IP network.
> 
> as Ralf also pointed, i can imagine some architects neglect their
> ethernet network (e.g. highly oversubscribed + low end switches)
> and in this case ib0 is a best fit for oob messages.
> 
>> As a simple solution, there could be an TCP oob MCA param that says 
>> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
>> routing will make everything work out ok).
> +1 and/or an option to tell oob mca "do not discard the interface simply
> because the peer IP is not in the same subnet"
> 
> Cheers,
> 
> Gilles
> 
> On 2014/06/05 23:01, Ralph Castain wrote:
>> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
>> solves the problem (the messages just route)
>> 
>> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> Another random thought for Gilles situation: why not oob-TCP-if-include 
>>> ib0?  (And not eth0)
>>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14982.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Gilles Gouaillardet
Jeff,

as pointed by Ralph, i do wish using eth0 for oob messages.

i work on a 4k+ nodes cluster with a very decent gigabit ethernet
network (reasonable oversubscription + switches
from a reputable vendor you are familiar with ;-) )
my experience is that IPoIB can be very slow at establishing a
connection, especially if the arp table is not populated
(as far as i understand, this involves the subnet manager and
performance can be very random especially if all nodes issue
arp requests at the same time)
on the other hand, performance is much more stable when using the
subnetted IP network.

as Ralf also pointed, i can imagine some architects neglect their
ethernet network (e.g. highly oversubscribed + low end switches)
and in this case ib0 is a best fit for oob messages.

> As a simple solution, there could be an TCP oob MCA param that says 
> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> routing will make everything work out ok).
+1 and/or an option to tell oob mca "do not discard the interface simply
because the peer IP is not in the same subnet"

Cheers,

Gilles

On 2014/06/05 23:01, Ralph Castain wrote:
> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
> solves the problem (the messages just route)
>
> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  
> wrote:
>
>> Another random thought for Gilles situation: why not oob-TCP-if-include ib0? 
>>  (And not eth0)
>>



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain

On Jun 5, 2014, at 7:09 AM, Ralph Castain  wrote:

> Okay, before you go chasing this, let me explain that we already try to 
> address this issue in the TCP oob. When we need to connect to someone, we do 
> the following:
> 
> 1. if we have a direct connection available, we hand the message to the 
> software module assigned to that NIC
> 
> 2. if none of the available NICs match the target's subnet, then we assign 
> the message to the software module for the first NIC in the system - i.e., 
> the one with the lowest kernel index - and let it try to send the message. We 
> expect the OS to know how to route the connection.
> 
> 3. if that fails for some reason, then we'll try assign it to the software 
> module for the next NIC in the system, continuing down this path until every 
> module has had a chance to try.

Actually, this isn't quite correct. The NIC we assigned it to will cycle across 
all of the known connection addresses for the intended target, trying each in 
turn. If *none* of those successfully connect, then the module declares that it 
is unable to make the connection.

At that point, we let the next software module try. This has always bothered me 
a bit as I don't see how it can succeed if the first one failed - the OS is 
going to decide which NIC to send the connection request across anyway. All we 
are doing is assigning the thread that will make the connection request. So 
long as that thread tries all the connection addresses, it shouldn't matter 
which thread makes the attempt.

Point being: we can probably just let the one thread make the attempt and give 
up if it fails on all known addresses for the target. We can then bounce it up 
to the OOB framework and let someone else try with a different transport, 
should one be available for that target. This would simplify the logic.


> 
> 4. if no TCP module can send it, we bump it back up to the OOB framework to 
> see if another component can send it. At the moment, we don't have one, but 
> that will shortly change.
> 
> My intention is to be a little more intelligent on step #2. At the very 
> least, I'd like to see us find the closest subnet match - just check tuples 
> to see who has the most matches. So if the target is on 10.1.2.3 and I have 
> two NICs 10.2.3.x and 192.168.2.y, then I should pick the first one since it 
> at least matches something.
> 
> If your IP experts have a better solution, please pass it along! What is 
> causing the problem here is that the message comes in on one NIC that doesn't 
> have a direct connection to the target, and the "hop" mechanism isn't working 
> correctly (kicks into an infinite loop).
> 
> 
> 
> On Jun 5, 2014, at 4:27 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> That raises a larger issue -- what about Ethernet-only clusters that span 
>> multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
>> enable/support.
>> 
>> The usnic BTL, for example, can handle this scenario.  We hadn't previously 
>> considered the TCP oob component effects in this scenario -- oops.
>> 
>> Hmm.
>> 
>> The usnic BTL both does lazy connections (so to speak...) and uses a 
>> connectivity checker to ensure that it can actually communicate with each 
>> peer.  In this way, OMPI has a way of knowing whether process A can 
>> communicate with process B, even if A and B have effectively unrelated IP 
>> addresses (i.e., they're not on the same IP subnet).
>> 
>> I don't think the TCP oob will be able to use this same kind of strategy.
>> 
>> As a simple solution, there could be an TCP oob MCA param that says 
>> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
>> routing will make everything work out ok).
>> 
>> That doesn't seem like a good overall solution, however -- it doesn't 
>> necessarily fit in the "it just works out of the box" philosophy that we 
>> like to have in OMPI.
>> 
>> Let me take this back to some IP experts here and see if someone can come up 
>> with a better idea.
>> 
>> 
>> 
>> On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:
>> 
>>> Well, the problem is that we can't simply decide that anything called 
>>> "ib.." is an IB port and should be ignored. There is no naming rule 
>>> regarding IP interfaces that I've ever heard about that would allow us to 
>>> make such an assumption, though I admit most people let the system create 
>>> default names and thus would get something like an "ib..".
>>> 
>>> So we leave it up to the sys admin to configure the system based on their 
>>> knowledge of what they want to use. On the big clusters at the labs, we 
>>> commonly put MCA params in the default param file for this purpose as we 
>>> *don't* want OOB traffic going over the IB fabric.
>>> 
>>> But that's the sys admin's choice, not a requirement. I've seen 
>>> organizations that do it the other way because their Ethernet is really 
>>> slow.
>>> 
>>> In this case, the problem is 

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Okay, before you go chasing this, let me explain that we already try to address 
this issue in the TCP oob. When we need to connect to someone, we do the 
following:

1. if we have a direct connection available, we hand the message to the 
software module assigned to that NIC

2. if none of the available NICs match the target's subnet, then we assign the 
message to the software module for the first NIC in the system - i.e., the one 
with the lowest kernel index - and let it try to send the message. We expect 
the OS to know how to route the connection.

3. if that fails for some reason, then we'll try assign it to the software 
module for the next NIC in the system, continuing down this path until every 
module has had a chance to try.

4. if no TCP module can send it, we bump it back up to the OOB framework to see 
if another component can send it. At the moment, we don't have one, but that 
will shortly change.

My intention is to be a little more intelligent on step #2. At the very least, 
I'd like to see us find the closest subnet match - just check tuples to see who 
has the most matches. So if the target is on 10.1.2.3 and I have two NICs 
10.2.3.x and 192.168.2.y, then I should pick the first one since it at least 
matches something.

If your IP experts have a better solution, please pass it along! What is 
causing the problem here is that the message comes in on one NIC that doesn't 
have a direct connection to the target, and the "hop" mechanism isn't working 
correctly (kicks into an infinite loop).



On Jun 5, 2014, at 4:27 AM, Jeff Squyres (jsquyres)  wrote:

> That raises a larger issue -- what about Ethernet-only clusters that span 
> multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
> enable/support.
> 
> The usnic BTL, for example, can handle this scenario.  We hadn't previously 
> considered the TCP oob component effects in this scenario -- oops.
> 
> Hmm.
> 
> The usnic BTL both does lazy connections (so to speak...) and uses a 
> connectivity checker to ensure that it can actually communicate with each 
> peer.  In this way, OMPI has a way of knowing whether process A can 
> communicate with process B, even if A and B have effectively unrelated IP 
> addresses (i.e., they're not on the same IP subnet).
> 
> I don't think the TCP oob will be able to use this same kind of strategy.
> 
> As a simple solution, there could be an TCP oob MCA param that says 
> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> routing will make everything work out ok).
> 
> That doesn't seem like a good overall solution, however -- it doesn't 
> necessarily fit in the "it just works out of the box" philosophy that we like 
> to have in OMPI.
> 
> Let me take this back to some IP experts here and see if someone can come up 
> with a better idea.
> 
> 
> 
> On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:
> 
>> Well, the problem is that we can't simply decide that anything called "ib.." 
>> is an IB port and should be ignored. There is no naming rule regarding IP 
>> interfaces that I've ever heard about that would allow us to make such an 
>> assumption, though I admit most people let the system create default names 
>> and thus would get something like an "ib..".
>> 
>> So we leave it up to the sys admin to configure the system based on their 
>> knowledge of what they want to use. On the big clusters at the labs, we 
>> commonly put MCA params in the default param file for this purpose as we 
>> *don't* want OOB traffic going over the IB fabric.
>> 
>> But that's the sys admin's choice, not a requirement. I've seen 
>> organizations that do it the other way because their Ethernet is really slow.
>> 
>> In this case, the problem is really in the OOB itself. The local proc is 
>> connecting to its local daemon via eth0, which is fine. When it sends a 
>> message to mpirun on a different proc, that message goes from the app to the 
>> daemon via eth0. The daemon looks for mpirun in its contact list, and sees 
>> that it has a direct link to mpirun via this nifty "ib0" interface - and so 
>> it uses that one to relay the message along.
>> 
>> This is where we are hitting the problem - the OOB isn't correctly doing the 
>> transfer between those two interfaces like it should. So it is a bug that we 
>> need to fix, regardless of any other actions (e.g., if it was an eth1 that 
>> was the direct connection, we would still want to transfer the message to 
>> the other interface).
>> 
>> HTH
>> Ralph
>> 
>> On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Thanks Ralf,
>>> 
>>> for the time being, i just found a workaround
>>> --mca oob_tcp_if_include eth0
>>> 
>>> Generally speaking, is openmpi doing the wiser thing ?
>>> here is what i mean :
>>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>> * eth0 (gigabit ethernet) : because of the cluster size, 

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Because Gilles wants to avoid using IB for TCP messages, and using eth0 also 
solves the problem (the messages just route)

On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres)  wrote:

> Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
> (And not eth0)
> 
> That should solve his problem, but not the larger issue I raised in my 
> previous email. 
> 
> Sent from my phone. No type good. 
> 
> On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
>  wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only impacts systems 
>> with mismatched interfaces, which is why we aren't generally seeing it.
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14977.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
(And not eth0)

That should solve his problem, but not the larger issue I raised in my previous 
email.

Sent from my phone. No type good.

On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
> wrote:

Thanks Ralf,

for the time being, i just found a workaround
--mca oob_tcp_if_include eth0

Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
 * eth0 (gigabit ethernet) : because of the cluster size, several subnets are 
used.
 * ib0 (IP over IB) : only one subnet
i can easily understand such a large cluster is not so common, but on the other 
hand i do not believe the IP configuration (subnetted gigE and single subnet 
IPoIB) can be called exotic.

if nodes from different eth0 subnets are used, and if i understand correctly 
your previous replies, orte will "discard" eth0 because nodes cannot contact 
each other "directly".
directly means the nodes are not on the same subnet. that being said, they can 
communicate via IP thanks to IP routing (i mean IP routing, i do *not* mean 
orte routing).
that means orte communications will use IPoIB which might not be the best thing 
to do since establishing an IPoIB connection can be long (especially at scale 
*and* if the arp table is not populated)

is my understanding correct so far ?

bottom line, i would have expected openmpi uses eth0 regardless IP routing is 
required, and ib0 is simply not used (or eventually used as a fallback option)

this leads to my next question : is the current default ok ? if not should we 
change it and how ?
/*
imho :
 - IP routing is not always a bad/slow thing
 - gigE can sometimes be better than IPoIB)
*/

i am fine if at the end :
- this issue is fixed
- we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 the 
default if this is really thought to be best for the cluster. (and i can try to 
draft a faq if needed)

Cheers,

Gilles

On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain 
> wrote:

I'll work on it - may take a day or two to really fix. Only impacts systems 
with mismatched interfaces, which is why we aren't generally seeing it.

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/14972.php


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Jeff Squyres (jsquyres)
That raises a larger issue -- what about Ethernet-only clusters that span 
multiple IP/L3 subnets?  This is a scenario that Cisco definitely wants to 
enable/support.

The usnic BTL, for example, can handle this scenario.  We hadn't previously 
considered the TCP oob component effects in this scenario -- oops.

Hmm.

The usnic BTL both does lazy connections (so to speak...) and uses a 
connectivity checker to ensure that it can actually communicate with each peer. 
 In this way, OMPI has a way of knowing whether process A can communicate with 
process B, even if A and B have effectively unrelated IP addresses (i.e., 
they're not on the same IP subnet).

I don't think the TCP oob will be able to use this same kind of strategy.

As a simple solution, there could be an TCP oob MCA param that says "regardless 
of peer IP address, I can connect to them" (i.e., assume IP routing will make 
everything work out ok).

That doesn't seem like a good overall solution, however -- it doesn't 
necessarily fit in the "it just works out of the box" philosophy that we like 
to have in OMPI.

Let me take this back to some IP experts here and see if someone can come up 
with a better idea.



On Jun 4, 2014, at 10:09 PM, Ralph Castain  wrote:

> Well, the problem is that we can't simply decide that anything called "ib.." 
> is an IB port and should be ignored. There is no naming rule regarding IP 
> interfaces that I've ever heard about that would allow us to make such an 
> assumption, though I admit most people let the system create default names 
> and thus would get something like an "ib..".
> 
> So we leave it up to the sys admin to configure the system based on their 
> knowledge of what they want to use. On the big clusters at the labs, we 
> commonly put MCA params in the default param file for this purpose as we 
> *don't* want OOB traffic going over the IB fabric.
> 
> But that's the sys admin's choice, not a requirement. I've seen organizations 
> that do it the other way because their Ethernet is really slow.
> 
> In this case, the problem is really in the OOB itself. The local proc is 
> connecting to its local daemon via eth0, which is fine. When it sends a 
> message to mpirun on a different proc, that message goes from the app to the 
> daemon via eth0. The daemon looks for mpirun in its contact list, and sees 
> that it has a direct link to mpirun via this nifty "ib0" interface - and so 
> it uses that one to relay the message along.
> 
> This is where we are hitting the problem - the OOB isn't correctly doing the 
> transfer between those two interfaces like it should. So it is a bug that we 
> need to fix, regardless of any other actions (e.g., if it was an eth1 that 
> was the direct connection, we would still want to transfer the message to the 
> other interface).
> 
> HTH
> Ralph
> 
> On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet 
>  wrote:
> 
>> Thanks Ralf,
>> 
>> for the time being, i just found a workaround
>> --mca oob_tcp_if_include eth0
>> 
>> Generally speaking, is openmpi doing the wiser thing ?
>> here is what i mean :
>> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets 
>> are used.
>>  * ib0 (IP over IB) : only one subnet
>> i can easily understand such a large cluster is not so common, but on the 
>> other hand i do not believe the IP configuration (subnetted gigE and single 
>> subnet IPoIB) can be called exotic.
>> 
>> if nodes from different eth0 subnets are used, and if i understand correctly 
>> your previous replies, orte will "discard" eth0 because nodes cannot contact 
>> each other "directly".
>> directly means the nodes are not on the same subnet. that being said, they 
>> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
>> mean orte routing).
>> that means orte communications will use IPoIB which might not be the best 
>> thing to do since establishing an IPoIB connection can be long (especially 
>> at scale *and* if the arp table is not populated)
>> 
>> is my understanding correct so far ?
>> 
>> bottom line, i would have expected openmpi uses eth0 regardless IP routing 
>> is required, and ib0 is simply not used (or eventually used as a fallback 
>> option)
>> 
>> this leads to my next question : is the current default ok ? if not should 
>> we change it and how ?
>> /*
>> imho :
>>  - IP routing is not always a bad/slow thing
>>  - gigE can sometimes be better than IPoIB)
>> */
>> 
>> i am fine if at the end :
>> - this issue is fixed
>> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
>> the default if this is really thought to be best for the cluster. (and i can 
>> try to draft a faq if needed)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:
>> 
>> I'll work on it - may take a day or two to really fix. Only 

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-05 Thread Ralph Castain
Well, the problem is that we can't simply decide that anything called "ib.." is 
an IB port and should be ignored. There is no naming rule regarding IP 
interfaces that I've ever heard about that would allow us to make such an 
assumption, though I admit most people let the system create default names and 
thus would get something like an "ib..".

So we leave it up to the sys admin to configure the system based on their 
knowledge of what they want to use. On the big clusters at the labs, we 
commonly put MCA params in the default param file for this purpose as we 
*don't* want OOB traffic going over the IB fabric.

But that's the sys admin's choice, not a requirement. I've seen organizations 
that do it the other way because their Ethernet is really slow.

In this case, the problem is really in the OOB itself. The local proc is 
connecting to its local daemon via eth0, which is fine. When it sends a message 
to mpirun on a different proc, that message goes from the app to the daemon via 
eth0. The daemon looks for mpirun in its contact list, and sees that it has a 
direct link to mpirun via this nifty "ib0" interface - and so it uses that one 
to relay the message along.

This is where we are hitting the problem - the OOB isn't correctly doing the 
transfer between those two interfaces like it should. So it is a bug that we 
need to fix, regardless of any other actions (e.g., if it was an eth1 that was 
the direct connection, we would still want to transfer the message to the other 
interface).

HTH
Ralph

On Jun 4, 2014, at 7:32 PM, Gilles Gouaillardet  
wrote:

> Thanks Ralf,
> 
> for the time being, i just found a workaround
> --mca oob_tcp_if_include eth0
> 
> Generally speaking, is openmpi doing the wiser thing ?
> here is what i mean :
> the cluster i work on (4k+ nodes) each node has two ip interfaces :
>  * eth0 (gigabit ethernet) : because of the cluster size, several subnets are 
> used.
>  * ib0 (IP over IB) : only one subnet
> i can easily understand such a large cluster is not so common, but on the 
> other hand i do not believe the IP configuration (subnetted gigE and single 
> subnet IPoIB) can be called exotic.
> 
> if nodes from different eth0 subnets are used, and if i understand correctly 
> your previous replies, orte will "discard" eth0 because nodes cannot contact 
> each other "directly".
> directly means the nodes are not on the same subnet. that being said, they 
> can communicate via IP thanks to IP routing (i mean IP routing, i do *not* 
> mean orte routing).
> that means orte communications will use IPoIB which might not be the best 
> thing to do since establishing an IPoIB connection can be long (especially at 
> scale *and* if the arp table is not populated)
> 
> is my understanding correct so far ?
> 
> bottom line, i would have expected openmpi uses eth0 regardless IP routing is 
> required, and ib0 is simply not used (or eventually used as a fallback option)
> 
> this leads to my next question : is the current default ok ? if not should we 
> change it and how ?
> /*
> imho :
>  - IP routing is not always a bad/slow thing
>  - gigE can sometimes be better than IPoIB)
> */
> 
> i am fine if at the end :
> - this issue is fixed
> - we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 
> the default if this is really thought to be best for the cluster. (and i can 
> try to draft a faq if needed)
> 
> Cheers,
> 
> Gilles
> 
> On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain  wrote:
> 
> I'll work on it - may take a day or two to really fix. Only impacts systems 
> with mismatched interfaces, which is why we aren't generally seeing it.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14972.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Gilles Gouaillardet
Ralph,

the application still hangs, i attached new logs.

on slurm0, if i /sbin/ifconfig eth0:1 down
then the application does not hang any more

Cheers,

Gilles


On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain  wrote:

> I appear to have this fixed now - please give the current trunk (r31949 or
> above) a spin to see if I got it for you too.
>
>
>


abort.oob.2.log.gz
Description: GNU Zip compressed data


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-04 Thread Ralph Castain
I appear to have this fixed now - please give the current trunk (r31949 or 
above) a spin to see if I got it for you too.


On Jun 3, 2014, at 6:06 AM, Ralph Castain  wrote:

> You can leave it running - I just needed to know. If mpirun sees slurm (i.e., 
> you were running inside a slurm allocation), it will use it.
> 
> 
> On Jun 3, 2014, at 5:43 AM, Gilles Gouaillardet 
>  wrote:
> 
>> Ralph,
>> 
>> slurm is installed and running on both nodes.
>> 
>> that being said, there is no running job on any node so unless
>> mpirun automagically detects slurm is up and running, i assume
>> i am running under rsh.
>> 
>> i can run the test again after i stop slurm if needed, but that will not 
>> happen before tomorrow.
>> 
>> Cheers,
>> 
>> Gilles
>>> from slurm0, i launch :
>>> 
>>> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 
>>> ./abort
>> 
>> Is this running under slurm? Or are you running under rsh?
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14966.php
> 



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Ralph Castain
You can leave it running - I just needed to know. If mpirun sees slurm (i.e., 
you were running inside a slurm allocation), it will use it.


On Jun 3, 2014, at 5:43 AM, Gilles Gouaillardet  
wrote:

> Ralph,
> 
> slurm is installed and running on both nodes.
> 
> that being said, there is no running job on any node so unless
> mpirun automagically detects slurm is up and running, i assume
> i am running under rsh.
> 
> i can run the test again after i stop slurm if needed, but that will not 
> happen before tomorrow.
> 
> Cheers,
> 
> Gilles
>> from slurm0, i launch :
>> 
>> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 
>> ./abort
> 
> Is this running under slurm? Or are you running under rsh?
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14966.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
Ralph,

slurm is installed and running on both nodes.

that being said, there is no running job on any node so unless
mpirun automagically detects slurm is up and running, i assume
i am running under rsh.

i can run the test again after i stop slurm if needed, but that will not
happen before tomorrow.

Cheers,

Gilles

> from slurm0, i launch :
>
> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10
> ./abort
>
>
> Is this running under slurm? Or are you running under rsh?
>
>


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Ralph Castain

On Jun 3, 2014, at 3:06 AM, Gilles Gouaillardet  
wrote:

> Ralph,
> 
> i get no more complains about rtc :-)
> 
> but MPI_Abort still hangs :-(
> 
> i reviewed my configuration and the hang is not related to one node having 
> one IB port and the other node having two IB ports.
> 
> the two nodes can establish TCP connections via :
> - eth0 (but they are *not* on the same subnet)
> - ib0 (and they *are* on the same subnet)
> 
> from the logs, it seems eth0 is "discarded" and only ib0 is used.

That would be correct - we don't really "discard" eth0, but default to using 
the interfaces on the common subnet to avoid routing

> when the task abort, it hangs ...
> 
> 
> 
> i attached the logs i took on two VM with a "simpler" config :
> - slurm0 has one eth port (eth0)
>   * eth0 is on 192.168.122.100/24 (network 0)
>   * eth0:1 is on 10.0.0.1/24 (network 0)
> - slurm3 has two eth ports (eth0 and eth1)
>   * eth0 is on 192.168.222.0/24 (network 1)
>   * eth1 is on 10.0.0.2/24 (network 0)
> 
> network0 and network1 are connected to a router.
> 
> 
> from slurm0, i launch :
> 
> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 ./abort

Is this running under slurm? Or are you running under rsh?

> 
> the oob logs are attached
> 
> Cheers,
> 
> Gilles
> 
> On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet 
>  wrote:
> Thanks Ralph,
> 
> i will try this tomorrow
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain  wrote:
> I think I have this fixed with r31928, but have no way to test it on my 
> machine. Please see if it works for you.
> 
> 
> On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:
> 
>> This is indeed the problem - we are trying to send a message and don't know 
>> how to get it somewhere. I'll break the loop, and then ask that you run this 
>> again with -mca oob_base_verbose 10 so we can see the intended recipient.
>> 
>> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from 
>>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14964.php



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-03 Thread Gilles Gouaillardet
Ralph,

i get no more complains about rtc :-)

but MPI_Abort still hangs :-(

i reviewed my configuration and the hang is not related to one node having
one IB port and the other node having two IB ports.

the two nodes can establish TCP connections via :
- eth0 (but they are *not* on the same subnet)
- ib0 (and they *are* on the same subnet)

from the logs, it seems eth0 is "discarded" and only ib0 is used.
when the task abort, it hangs ...



i attached the logs i took on two VM with a "simpler" config :
- slurm0 has one eth port (eth0)
  * eth0 is on 192.168.122.100/24 (network 0)
  * eth0:1 is on 10.0.0.1/24 (network 0)
- slurm3 has two eth ports (eth0 and eth1)
  * eth0 is on 192.168.222.0/24 (network 1)
  * eth1 is on 10.0.0.2/24 (network 0)

network0 and network1 are connected to a router.


from slurm0, i launch :

mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10
./abort

the oob logs are attached

Cheers,

Gilles

On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Thanks Ralph,
>
> i will try this tomorrow
>
> Cheers,
>
> Gilles
>
>
>
> On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain  wrote:
>
>> I think I have this fixed with r31928, but have no way to test it on my
>> machine. Please see if it works for you.
>>
>>
>> On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:
>>
>> This is indeed the problem - we are trying to send a message and don't
>> know how to get it somewhere. I'll break the loop, and then ask that you
>> run this again with -mca oob_base_verbose 10 so we can see the intended
>> recipient.
>>
>> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>>
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
>>
>
>


abort.oob.log.gz
Description: GNU Zip compressed data


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Ralph,

i will try this tomorrow

Cheers,

Gilles



On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain  wrote:

> I think I have this fixed with r31928, but have no way to test it on my
> machine. Please see if it works for you.
>
>
> On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:
>
> This is indeed the problem - we are trying to send a message and don't
> know how to get it somewhere. I'll break the loop, and then ask that you
> run this again with -mca oob_base_verbose 10 so we can see the intended
> recipient.
>
> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from
> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
>


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
I think I have this fixed with r31928, but have no way to test it on my 
machine. Please see if it works for you.


On Jun 2, 2014, at 7:09 AM, Ralph Castain  wrote:

> This is indeed the problem - we are trying to send a message and don't know 
> how to get it somewhere. I'll break the loop, and then ask that you run this 
> again with -mca oob_base_verbose 10 so we can see the intended recipient.
> 
> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet 
>  wrote:
> 
>> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from 
>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
> 



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Thanks Jeff,

from the FAQ, openmpi should work on nodes who have different number of IB
ports (at least since v1.2)

about IB ports on the same subnet, all i was able to find is explanation
about why i get this warning :

WARNING: There are more than one active ports on host '%s', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical OFA
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate OFA subnet that is
used between connected MPI processes must have different subnet ID
values.


i really had to read between the lines (and thanks to your email) in order
to figure out IB ports on the same subnet is not the most optimal way.

the following sentence is even more confusing :

"All this being said, note that there are valid network configurations
where multiple ports on the same host can share the same subnet ID value.
For example, two ports from a single host can be connected to the
*same* network
as a bandwidth multiplier or a high-availability configuration."


from a pragmatic approach, and this is not OpenMPI specific, the two IB
ports of the servers are physically connected to the same IB switch.

/* i would guess the NVIDIA Ivy cluster is similar in that sense */

a few years ago (e.g. last time i checked), using different subnets was
possible by partitionning the switch via OpenSM. IMHO this was not an easy
to maintain solution (e.g. if a switch is replaced, the opensm config had
to be changed as well).

is there a simple and free way today to put ports physically connected to
the same switch in different subnets ?

/* such as tagged vlan in Ethernet => simple switch configuration, and the
host can decide by itself in which vlan a port must be */

Cheers,

Gilles

On Mon, Jun 2, 2014 at 8:50 PM, Jeff Squyres (jsquyres) 
wrote:

>  I'm AFK but let me reply about the IB thing: double ports/multi rail is
> a good thing. It's not a good thing if they're on the same subnet.
>
>  Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I
> can't see it well enough on the small screen of my phone, but I think
> there's a q on there about how multi rail destinations are chosen.
>
>  Spoiler: put your ports in different subnets so that OMPI makes
> deterministic choices.
>
> Sent from my phone. No type good.
>


Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Ralph Castain
This is indeed the problem - we are trying to send a message and don't know how 
to get it somewhere. I'll break the loop, and then ask that you run this again 
with -mca oob_base_verbose 10 so we can see the intended recipient.

On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet  
wrote:

> #7  0x7fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from 
> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so



Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
I'm AFK but let me reply about the IB thing: double ports/multi rail is a good 
thing. It's not a good thing if they're on the same subnet.

Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I can't see 
it well enough on the small screen of my phone, but I think there's a q on 
there about how multi rail destinations are chosen.

Spoiler: put your ports in different subnets so that OMPI makes deterministic 
choices.

Sent from my phone. No type good.

On Jun 2, 2014, at 6:55 AM, "Gilles Gouaillardet" 
> wrote:

Jeff,

On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) 
> wrote:
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet 
> wrote:

> i faced a bit different problem, but that is 100% reproductible :
> - i launch mpirun (no batch manager) from a node with one IB port
> - i use -host node01,node02 where node01 and node02 both have two IB port on 
> the
>   same subnet

FWIW: 2 IB ports on the same subnet?  That's not a good idea.

could you please elaborate a bit ?
from what i saw, this basically doubles the bandwidth (imb PingPong benchmark) 
between two nodes (!) which is a not a bad thing.
i can only guess this might not scale (e.g. if 16 tasks is running on each 
host, the overhead associated with the use of two ports might void the extra 
bandwidth)

> by default, this will hang.

...but it still shouldn't hang.  I wonder if it's somehow related to 
https://svn.open-mpi.org/trac/ompi/ticket/4442...?

 i doubt it ...

here is my command line (from node0)
`which mpirun` -np 2 -host node1,node2 --mca rtc_freq_priority 0 --mca btl 
openib,self --mca btl_openib_if_include mlx4_0 ./abort
on top of that, the usnic btl is not built (nor installed)


> if this is a "feature" (e.g. openmpi does not support this kind of 
> configuration) i am fine with it.
>
> when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the application 
> is a success, then it works just fine.
>
> if the application calls MPI_Abort() /* and even if all tasks call 
> MPI_Abort() */ then it will hang 100% of the time.
> i do not see that as a feature but as a bug.

Yes, OMPI should never hang upon a call to MPI_Abort.

Can you get some stack traces to show where the hung process(es) are stuck?  
That would help Ralph pin down where things aren't working down in ORTE.

on node0 :

  \_ -bash
  \_ /.../local/ompi-trunk/bin/mpirun -np 2 -host node1,node2 --mca 
rtc_freq_priority 0 --mc
  \_ /usr/bin/ssh -x node1 PATH=/.../local/ompi-trunk/bin:$PATH ; 
export PATH ; LD_LIBRAR
  \_ /usr/bin/ssh -x node2 PATH=/.../local/ompi-trunk/bin:$PATH ; 
export PATH ; LD_LIBRAR


pstack (mpirun) :
$ pstack 10913
Thread 2 (Thread 0x7f0ecad35700 (LWP 10914)):
#0  0x003ba66e15e3 in select () from /lib64/libc.so.6
#1  0x7f0ecad4391e in listen_thread () from 
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#2  0x003ba72079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x003ba66e8b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f0ecc601700 (LWP 10913)):
#0  0x003ba66df343 in poll () from /lib64/libc.so.6
#1  0x7f0ecc6b1a05 in poll_dispatch () from 
/.../local/ompi-trunk/lib/libopen-pal.so.0
#2  0x7f0ecc6a641c in opal_libevent2021_event_base_loop () from 
/.../local/ompi-trunk/lib/libopen-pal.so.0
#3  0x004056a1 in orterun ()
#4  0x004039f4 in main ()


on node 1 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid 
3459448832 -mca orte_ess_vpid
  \_ [abort] 

$ pstack (orted)
#0  0x7fe0ba6a0566 in vfprintf () from /lib64/libc.so.6
#1  0x7fe0ba6c9a52 in vsnprintf () from /lib64/libc.so.6
#2  0x7fe0ba6a9523 in snprintf () from /lib64/libc.so.6
#3  0x7fe0bbc019b6 in orte_util_print_jobids () from 
/.../local/ompi-trunk/lib/libopen-rte.so.0
#4  0x7fe0bbc01791 in orte_util_print_name_args () from 
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x7fe0b8e16a8b in mca_oob_tcp_component_hop_unknown () from 
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#6  0x7fe0bb94ab7a in event_process_active_single_queue () from 
/.../local/ompi-trunk/lib/libopen-pal.so.0
#7  0x7fe0bb94adf2 in event_process_active () from 
/.../local/ompi-trunk/lib/libopen-pal.so.0
#8  0x7fe0bb94b470 in opal_libevent2021_event_base_loop () from 
/.../local/ompi-trunk/lib/libopen-pal.so.0
#9  0x7fe0bbc1fa7b in orte_daemon () from 
/.../local/ompi-trunk/lib/libopen-rte.so.0
#10 0x0040093a in main ()


on node 2 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid 

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Gilles Gouaillardet
Jeff,

On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres) 
wrote:

> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > - i launch mpirun (no batch manager) from a node with one IB port
> > - i use -host node01,node02 where node01 and node02 both have two IB
> port on the
> >   same subnet
>
> FWIW: 2 IB ports on the same subnet?  That's not a good idea.
>
> could you please elaborate a bit ?
from what i saw, this basically doubles the bandwidth (imb PingPong
benchmark) between two nodes (!) which is a not a bad thing.
i can only guess this might not scale (e.g. if 16 tasks is running on each
host, the overhead associated with the use of two ports might void the
extra bandwidth)


> > by default, this will hang.
>
> ...but it still shouldn't hang.  I wonder if it's somehow related to
> https://svn.open-mpi.org/trac/ompi/ticket/4442...?
>
>  i doubt it ...

here is my command line (from node0)
`which mpirun` -np 2 -host node1,node2 --mca rtc_freq_priority 0 --mca btl
openib,self --mca btl_openib_if_include mlx4_0 ./abort
on top of that, the usnic btl is not built (nor installed)


> if this is a "feature" (e.g. openmpi does not support this kind of
> configuration) i am fine with it.
> >
> > when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the
> application is a success, then it works just fine.
> >
> > if the application calls MPI_Abort() /* and even if all tasks call
> MPI_Abort() */ then it will hang 100% of the time.
> > i do not see that as a feature but as a bug.
>
> Yes, OMPI should never hang upon a call to MPI_Abort.
>
> Can you get some stack traces to show where the hung process(es) are
> stuck?  That would help Ralph pin down where things aren't working down in
> ORTE.
>

on node0 :

  \_ -bash
  \_ /.../local/ompi-trunk/bin/mpirun -np 2 -host node1,node2 --mca
rtc_freq_priority 0 --mc
  \_ /usr/bin/ssh -x node1 PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR
  \_ /usr/bin/ssh -x node2 PATH=/.../local/ompi-trunk/bin:$PATH
; export PATH ; LD_LIBRAR


pstack (mpirun) :
$ pstack 10913
Thread 2 (Thread 0x7f0ecad35700 (LWP 10914)):
#0  0x003ba66e15e3 in select () from /lib64/libc.so.6
#1  0x7f0ecad4391e in listen_thread () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#2  0x003ba72079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x003ba66e8b6d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f0ecc601700 (LWP 10913)):
#0  0x003ba66df343 in poll () from /lib64/libc.so.6
#1  0x7f0ecc6b1a05 in poll_dispatch () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#2  0x7f0ecc6a641c in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#3  0x004056a1 in orterun ()
#4  0x004039f4 in main ()


on node 1 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
  \_ [abort] 

$ pstack (orted)
#0  0x7fe0ba6a0566 in vfprintf () from /lib64/libc.so.6
#1  0x7fe0ba6c9a52 in vsnprintf () from /lib64/libc.so.6
#2  0x7fe0ba6a9523 in snprintf () from /lib64/libc.so.6
#3  0x7fe0bbc019b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#4  0x7fe0bbc01791 in orte_util_print_name_args () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x7fe0b8e16a8b in mca_oob_tcp_component_hop_unknown () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#6  0x7fe0bb94ab7a in event_process_active_single_queue () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#7  0x7fe0bb94adf2 in event_process_active () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#8  0x7fe0bb94b470 in opal_libevent2021_event_base_loop () from
/.../local/ompi-trunk/lib/libopen-pal.so.0
#9  0x7fe0bbc1fa7b in orte_daemon () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#10 0x0040093a in main ()


on node 2 :

 sshd: gouaillardet@notty
  \_ bash -c PATH=/.../local/ompi-trunk/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/...
  \_ /.../local/ompi-trunk/bin/orted -mca ess env -mca orte_ess_jobid
3459448832 -mca orte_ess_vpid
  \_ [abort] 

$ pstack (orted)
#0  0x7fe8fc435e39 in strchrnul () from /lib64/libc.so.6
#1  0x7fe8fc3ef8f5 in vfprintf () from /lib64/libc.so.6
#2  0x7fe8fc41aa52 in vsnprintf () from /lib64/libc.so.6
#3  0x7fe8fc3fa523 in snprintf () from /lib64/libc.so.6
#4  0x7fe8fd9529b6 in orte_util_print_jobids () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#5  0x7fe8fd952791 in orte_util_print_name_args () from
/.../local/ompi-trunk/lib/libopen-rte.so.0
#6  0x7fe8fab6c1b5 in resend () from
/.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
#7  0x7fe8fab67ce3 in 

Re: [OMPI devel] Intermittent hangs when exiting with error

2014-06-02 Thread Jeff Squyres (jsquyres)
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet  
wrote:

> i faced a bit different problem, but that is 100% reproductible :
> - i launch mpirun (no batch manager) from a node with one IB port
> - i use -host node01,node02 where node01 and node02 both have two IB port on 
> the
>   same subnet

FWIW: 2 IB ports on the same subnet?  That's not a good idea.

> by default, this will hang.

...but it still shouldn't hang.  I wonder if it's somehow related to 
https://svn.open-mpi.org/trac/ompi/ticket/4442...?

> if this is a "feature" (e.g. openmpi does not support this kind of 
> configuration) i am fine with it.
> 
> when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the application 
> is a success, then it works just fine.
> 
> if the application calls MPI_Abort() /* and even if all tasks call 
> MPI_Abort() */ then it will hang 100% of the time.
> i do not see that as a feature but as a bug.

Yes, OMPI should never hang upon a call to MPI_Abort.

Can you get some stack traces to show where the hung process(es) are stuck?  
That would help Ralph pin down where things aren't working down in ORTE.

> in an other thread, Jeff mentionned that the usnic btl is doing stuff even if 
> there is no usnic hardware (this will be fixed shortly).
> Do you still see intermittent hang without listing usnic as a btl ?

Yeah, there's a definite race in the usnic BTL ATM.  If you care, here's what's 
happening:

- the usnic BTL fires off its connectivity checker, even if there is no usnic 
hardware present
- during the connectivity checker init:
- local rank 0 on each server will establish a named socket
- non-local-rank-0 will wait for that named socket to exist

The race is that the local rank 0 may establish the socket (which completes its 
connectivity checker setup), and then realize that there is no usnic hardware, 
so it exits/closes the usnic BTL -- which destroys the named socket.  Hence, if 
the non-local-rank-0's are late to the party, they never saw the named socket 
get created and wait forever for it.  Result: hang.

Patch coming today that fixes both of these things:

1. connectivity checker won't be launched unless there is usnic hardware present
2. non-local-rank-0's won't wait indefinitely for the named socket

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/