On Jun 6, 2014, at 7:11 AM, Jeff Squyres (jsquyres) wrote:
> Looks like Ralph's simpler solution fit the bill.
Yeah, but I still am unhappy with it. It's about the stupidest connection model
you can imagine. What happens is this:
* a process constructs its URI - this is done by creating a str
On Jun 5, 2014, at 9:16 PM, Gilles Gouaillardet
wrote:
> i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> network (reasonable oversubscription + switches
> from a reputable vendor you are familiar with ;-) )
> my experience is that IPoIB can be very slow at establishing a
> co
Kewl - thanks!
On Jun 5, 2014, at 9:28 PM, Gilles Gouaillardet
wrote:
> Ralph,
>
> sorry for my poor understanding ...
>
> i tried r31956 and it solved both issues :
> - MPI_Abort does not hang any more if nodes are on different eth0 subnets
> - MPI_Init does not hang any more if hosts have d
Ralph,
sorry for my poor understanding ...
i tried r31956 and it solved both issues :
- MPI_Abort does not hang any more if nodes are on different eth0 subnets
- MPI_Init does not hang any more if hosts have different number of IB ports
this likely explains why you are having trouble replicating
I keep explaining that we don't "discard" anything, but there really isn't any
point to continuing trying to explain the system. With the announced intention
of completing the move of the BTLs to OPAL, I no longer need the multi-module
complexity in the OOB/TCP. So I have removed it and gone bac
Jeff,
as pointed by Ralph, i do wish using eth0 for oob messages.
i work on a 4k+ nodes cluster with a very decent gigabit ethernet
network (reasonable oversubscription + switches
from a reputable vendor you are familiar with ;-) )
my experience is that IPoIB can be very slow at establishing a
co
On Jun 5, 2014, at 7:09 AM, Ralph Castain wrote:
> Okay, before you go chasing this, let me explain that we already try to
> address this issue in the TCP oob. When we need to connect to someone, we do
> the following:
>
> 1. if we have a direct connection available, we hand the message to th
Okay, before you go chasing this, let me explain that we already try to address
this issue in the TCP oob. When we need to connect to someone, we do the
following:
1. if we have a direct connection available, we hand the message to the
software module assigned to that NIC
2. if none of the ava
Because Gilles wants to avoid using IB for TCP messages, and using eth0 also
solves the problem (the messages just route)
On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) wrote:
> Another random thought for Gilles situation: why not oob-TCP-if-include ib0?
> (And not eth0)
>
> That should
Another random thought for Gilles situation: why not oob-TCP-if-include ib0?
(And not eth0)
That should solve his problem, but not the larger issue I raised in my previous
email.
Sent from my phone. No type good.
On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet"
mailto:gilles.gouaillar...@gm
That raises a larger issue -- what about Ethernet-only clusters that span
multiple IP/L3 subnets? This is a scenario that Cisco definitely wants to
enable/support.
The usnic BTL, for example, can handle this scenario. We hadn't previously
considered the TCP oob component effects in this scena
Well, the problem is that we can't simply decide that anything called "ib.." is
an IB port and should be ignored. There is no naming rule regarding IP
interfaces that I've ever heard about that would allow us to make such an
assumption, though I admit most people let the system create default na
Thanks Ralf,
for the time being, i just found a workaround
--mca oob_tcp_if_include eth0
Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
* eth0 (gigabit ethernet) : because of the cluster size, sever
Ah crud - I see what's going on. This is an issue of a message coming in on one
interface that needs to get transferred to another one for relay. Looks like
that mechanism is broken, which is causing us to issue another show_help, which
gets caught in the same loop again.
I'll work on it - may
Ralph,
the application still hangs, i attached new logs.
on slurm0, if i /sbin/ifconfig eth0:1 down
then the application does not hang any more
Cheers,
Gilles
On Wed, Jun 4, 2014 at 12:43 PM, Ralph Castain wrote:
> I appear to have this fixed now - please give the current trunk (r31949 or
>
I appear to have this fixed now - please give the current trunk (r31949 or
above) a spin to see if I got it for you too.
On Jun 3, 2014, at 6:06 AM, Ralph Castain wrote:
> You can leave it running - I just needed to know. If mpirun sees slurm (i.e.,
> you were running inside a slurm allocatio
You can leave it running - I just needed to know. If mpirun sees slurm (i.e.,
you were running inside a slurm allocation), it will use it.
On Jun 3, 2014, at 5:43 AM, Gilles Gouaillardet
wrote:
> Ralph,
>
> slurm is installed and running on both nodes.
>
> that being said, there is no runni
Ralph,
slurm is installed and running on both nodes.
that being said, there is no running job on any node so unless
mpirun automagically detects slurm is up and running, i assume
i am running under rsh.
i can run the test again after i stop slurm if needed, but that will not
happen before tomorr
On Jun 3, 2014, at 3:06 AM, Gilles Gouaillardet
wrote:
> Ralph,
>
> i get no more complains about rtc :-)
>
> but MPI_Abort still hangs :-(
>
> i reviewed my configuration and the hang is not related to one node having
> one IB port and the other node having two IB ports.
>
> the two nodes
Ralph,
i get no more complains about rtc :-)
but MPI_Abort still hangs :-(
i reviewed my configuration and the hang is not related to one node having
one IB port and the other node having two IB ports.
the two nodes can establish TCP connections via :
- eth0 (but they are *not* on the same subn
Thanks Ralph,
i will try this tomorrow
Cheers,
Gilles
On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain wrote:
> I think I have this fixed with r31928, but have no way to test it on my
> machine. Please see if it works for you.
>
>
> On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote:
>
> This i
I think I have this fixed with r31928, but have no way to test it on my
machine. Please see if it works for you.
On Jun 2, 2014, at 7:09 AM, Ralph Castain wrote:
> This is indeed the problem - we are trying to send a message and don't know
> how to get it somewhere. I'll break the loop, and t
Thanks Jeff,
from the FAQ, openmpi should work on nodes who have different number of IB
ports (at least since v1.2)
about IB ports on the same subnet, all i was able to find is explanation
about why i get this warning :
WARNING: There are more than one active ports on host '%s', but the
default
This is indeed the problem - we are trying to send a message and don't know how
to get it somewhere. I'll break the loop, and then ask that you run this again
with -mca oob_base_verbose 10 so we can see the intended recipient.
On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet
wrote:
> #7 0x000
I'm AFK but let me reply about the IB thing: double ports/multi rail is a good
thing. It's not a good thing if they're on the same subnet.
Check the FAQ - http://www.open-mpi.org/faq/?category=openfabrics - I can't see
it well enough on the small screen of my phone, but I think there's a q on
t
Jeff,
On Mon, Jun 2, 2014 at 7:26 PM, Jeff Squyres (jsquyres)
wrote:
> On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> > i faced a bit different problem, but that is 100% reproductible :
> > - i launch mpirun (no batch manager) from a node with one I
On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet
wrote:
> i faced a bit different problem, but that is 100% reproductible :
> - i launch mpirun (no batch manager) from a node with one IB port
> - i use -host node01,node02 where node01 and node02 both have two IB port on
> the
> same subnet
FW
Rolf,
i faced a bit different problem, but that is 100% reproductible :
- i launch mpirun (no batch manager) from a node with one IB port
- i use -host node01,node02 where node01 and node02 both have two IB port
on the
same subnet
by default, this will hang.
if this is a "feature" (e.g. openmpi
28 matches
Mail list logo