Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-21 Thread tmishima


Ralph, thanks. I'll try it on Tuseday.

Let me confirm one thing. I don't put "-with-libevent" when I build
openmpi.
Is there any possibility to build with external libevent automatically?

Tetsuya Mishima


> Not entirely sure - add "-mca rmaps_base_verbose 10 --display-map" to
your cmd line and let's see if it finishes the mapping.
>
> Unless you specifically built with an external libevent (which I doubt),
there is no conflict. The connection issue is unlikely to be a factor here
as it works when not using the lama mapper.
>
>
> On Dec 21, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Thank you, Ralph.
> >
> > Then, this problem should depend on our environment.
> > But, at least, inversion problem is not the cause because
> > node05 has normal hier order.
> >
> > I can not connect to our cluster now. Tuesday, going
> > back to my office, I'll send you further report.
> >
> > Before that, please let me know your configuration. I will
> > follow your configuation as much as possible. Our configuraion
> > is very simple, only -with-tm -with-ibverbs -disable-ipv6.
> > (on CentOS 5.8)
> >
> > The 1.7 series is a llite bit unstable on our cluster yet.
> >
> > Similar freezing(hang up) was observed with 1.7.3. At that
> > time, lama worked well but putting "-rank-by something" caused
> > same freezing (curiously, rank-by works with 1.7.4rc1).
> > I checked where it stopped using gdb, then I found that it
> > stopped to wait for event in a function of libevent(I can not
> > recall the name).
> >
> > Is this related to your "connection issue in the OOB
> > subsystem"? Or libevent version conflict? I guess these two
> > problems are related each other. They stopped at very early
> > stage before reaching mapping function because no message
> > appeared before freezing, which is my random guess.
> >
> > Could you give me any hint or comment?
> >
> > Regards,
> > Tetsuya Mishima
> >
> >
> >> It seems to be working fine for me:
> >>
> >> [rhc@bend001 tcp]$ mpirun -np 2 -host bend001 -report-bindings -mca
> > rmaps_lama_bind 1c -mca rmaps lama hostname
> >> bend001
> >> [bend001:17005] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
> > [../BB/../../../..][../../../../../..]
> >> [bend001:17005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> > [BB/../../../../..][../../../../../..]
> >> bend001
> >> [rhc@bend001 tcp]$
> >>
> >> (I also checked the internals using "-mca rmaps_base_verbose 10") so
it
> > could be your hier inversion causing problems again. Or it could be
that
> > you are hitting a connection issue we are seeing in
> >> some scenarios in the OOB subsystem - though if you are able to run
using
> > a non-lama mapper, that would seem unlikely.
> >>
> >>
> >> On Dec 20, 2013, at 8:09 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>
> >>
> >> Hi Ralph,
> >>
> >> Thank you very much. I tried many things such as:
> >>
> >> mpirun -np 2 -host node05 -report-bindings -mca rmaps lama -mca
> >> rmaps_lama_bind 1c myprog
> >>
> >> But every try failed. At least they were accepted by openmpi-1.7.3 as
far
> >> as I remember.
> >> Anyway, please check it when you have a time, because using lama comes
> > from
> >> my curiosity.
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >>
> >> I'll try to take a look at it - my expectation is that lama might get
> >> stuck because you didn't tell it a pattern to map, and I doubt that
code
> >> path has seen much testing.
> >>
> >>
> >> On Dec 20, 2013, at 5:52 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>
> >>
> >> Hi Ralph, I'm glad to hear that, thanks.
> >>
> >> By the way, yesterday I tried to check how lama in 1.7.4rc treat numa
> >> node.
> >>
> >> Then, even wiht this simple command line, it freezed without any
> >> massage:
> >>
> >> mpirun -np 2 -host node05 -mca rmaps lama myprog
> >>
> >> Could you check what happened?
> >>
> >> Is it better to open new thread or continue this thread?
> >>
> >> Regards,
> >> Tetsuya Mishima
> >>
> >>
> >> I'll make it work so that NUMA can be either above or below socket
> >>
> >> On Dec 20, 2013, at 2:57 AM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>
> >>
> >> Hi Brice,
> >>
> >> Thank you for your comment. I understand what you mean.
> >>
> >> My opinion was made just considering easy way to adjust the code for
> >> inversion of hierarchy in object tree.
> >>
> >> Tetsuya Mishima
> >>
> >>
> >> I don't think there's any such difference.
> >> Also, all these NUMA architectures are reported the same by hwloc,
> >> and
> >> therefore used the same in Open MPI.
> >>
> >> And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours
> >> (and
> >> most recent AMD and Intel platforms).
> >>
> >> Brice
> >>
> >>
> >>
> >> Le 20/12/2013 11:33, tmish...@jcity.maeda.co.jp a écrit :
> >>
> >> Hi Ralph,
> >>
> >> The numa-node in AMD Mangy-Cours/Interlagos is so called cc(cache
> >> coherent)NUMA,
> >> which seems to be a little bit different from the traditional numa
> >> defined
> >> in openmpi.
> >>
> >> I 

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-21 Thread Ralph Castain
Not entirely sure - add "-mca rmaps_base_verbose 10 --display-map" to your cmd 
line and let's see if it finishes the mapping.

Unless you specifically built with an external libevent (which I doubt), there 
is no conflict. The connection issue is unlikely to be a factor here as it 
works when not using the lama mapper.


On Dec 21, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Thank you, Ralph.
> 
> Then, this problem should depend on our environment.
> But, at least, inversion problem is not the cause because
> node05 has normal hier order.
> 
> I can not connect to our cluster now. Tuesday, going
> back to my office, I'll send you further report.
> 
> Before that, please let me know your configuration. I will
> follow your configuation as much as possible. Our configuraion
> is very simple, only -with-tm -with-ibverbs -disable-ipv6.
> (on CentOS 5.8)
> 
> The 1.7 series is a llite bit unstable on our cluster yet.
> 
> Similar freezing(hang up) was observed with 1.7.3. At that
> time, lama worked well but putting "-rank-by something" caused
> same freezing (curiously, rank-by works with 1.7.4rc1).
> I checked where it stopped using gdb, then I found that it
> stopped to wait for event in a function of libevent(I can not
> recall the name).
> 
> Is this related to your "connection issue in the OOB
> subsystem"? Or libevent version conflict? I guess these two
> problems are related each other. They stopped at very early
> stage before reaching mapping function because no message
> appeared before freezing, which is my random guess.
> 
> Could you give me any hint or comment?
> 
> Regards,
> Tetsuya Mishima
> 
> 
>> It seems to be working fine for me:
>> 
>> [rhc@bend001 tcp]$ mpirun -np 2 -host bend001 -report-bindings -mca
> rmaps_lama_bind 1c -mca rmaps lama hostname
>> bend001
>> [bend001:17005] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
> [../BB/../../../..][../../../../../..]
>> [bend001:17005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../..][../../../../../..]
>> bend001
>> [rhc@bend001 tcp]$
>> 
>> (I also checked the internals using "-mca rmaps_base_verbose 10") so it
> could be your hier inversion causing problems again. Or it could be that
> you are hitting a connection issue we are seeing in
>> some scenarios in the OOB subsystem - though if you are able to run using
> a non-lama mapper, that would seem unlikely.
>> 
>> 
>> On Dec 20, 2013, at 8:09 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>> 
>> 
>> Hi Ralph,
>> 
>> Thank you very much. I tried many things such as:
>> 
>> mpirun -np 2 -host node05 -report-bindings -mca rmaps lama -mca
>> rmaps_lama_bind 1c myprog
>> 
>> But every try failed. At least they were accepted by openmpi-1.7.3 as far
>> as I remember.
>> Anyway, please check it when you have a time, because using lama comes
> from
>> my curiosity.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> 
>> I'll try to take a look at it - my expectation is that lama might get
>> stuck because you didn't tell it a pattern to map, and I doubt that code
>> path has seen much testing.
>> 
>> 
>> On Dec 20, 2013, at 5:52 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>> 
>> 
>> Hi Ralph, I'm glad to hear that, thanks.
>> 
>> By the way, yesterday I tried to check how lama in 1.7.4rc treat numa
>> node.
>> 
>> Then, even wiht this simple command line, it freezed without any
>> massage:
>> 
>> mpirun -np 2 -host node05 -mca rmaps lama myprog
>> 
>> Could you check what happened?
>> 
>> Is it better to open new thread or continue this thread?
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> 
>> I'll make it work so that NUMA can be either above or below socket
>> 
>> On Dec 20, 2013, at 2:57 AM, tmish...@jcity.maeda.co.jp wrote:
>> 
>> 
>> 
>> Hi Brice,
>> 
>> Thank you for your comment. I understand what you mean.
>> 
>> My opinion was made just considering easy way to adjust the code for
>> inversion of hierarchy in object tree.
>> 
>> Tetsuya Mishima
>> 
>> 
>> I don't think there's any such difference.
>> Also, all these NUMA architectures are reported the same by hwloc,
>> and
>> therefore used the same in Open MPI.
>> 
>> And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours
>> (and
>> most recent AMD and Intel platforms).
>> 
>> Brice
>> 
>> 
>> 
>> Le 20/12/2013 11:33, tmish...@jcity.maeda.co.jp a écrit :
>> 
>> Hi Ralph,
>> 
>> The numa-node in AMD Mangy-Cours/Interlagos is so called cc(cache
>> coherent)NUMA,
>> which seems to be a little bit different from the traditional numa
>> defined
>> in openmpi.
>> 
>> I notice that ccNUMA object is almost same as L3cache object.
>> So "-bind-to l3cache" or "-map-by l3cache" is valid for what I want
>> to
>> do.
>> Therefore, "do not touch it" is one of the solution, I think ...
>> 
>> Anyway, mixing up these two types of numa is the problem.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> I can wait it'll be fixed in 1.7.5 or later, because putting
>> "-bind-to
>> numa"
>> and "-map-by numa" at the same time works as a 

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-21 Thread tmishima


Thank you, Ralph.

Then, this problem should depend on our environment.
But, at least, inversion problem is not the cause because
node05 has normal hier order.

I can not connect to our cluster now. Tuesday, going
back to my office, I'll send you further report.

Before that, please let me know your configuration. I will
follow your configuation as much as possible. Our configuraion
is very simple, only -with-tm -with-ibverbs -disable-ipv6.
(on CentOS 5.8)

The 1.7 series is a llite bit unstable on our cluster yet.

Similar freezing(hang up) was observed with 1.7.3. At that
time, lama worked well but putting "-rank-by something" caused
same freezing (curiously, rank-by works with 1.7.4rc1).
I checked where it stopped using gdb, then I found that it
stopped to wait for event in a function of libevent(I can not
recall the name).

Is this related to your "connection issue in the OOB
subsystem"? Or libevent version conflict? I guess these two
problems are related each other. They stopped at very early
stage before reaching mapping function because no message
appeared before freezing, which is my random guess.

Could you give me any hint or comment?

Regards,
Tetsuya Mishima


> It seems to be working fine for me:
>
> [rhc@bend001 tcp]$ mpirun -np 2 -host bend001 -report-bindings -mca
rmaps_lama_bind 1c -mca rmaps lama hostname
> bend001
> [bend001:17005] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
[../BB/../../../..][../../../../../..]
> [bend001:17005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
[BB/../../../../..][../../../../../..]
> bend001
> [rhc@bend001 tcp]$
>
> (I also checked the internals using "-mca rmaps_base_verbose 10") so it
could be your hier inversion causing problems again. Or it could be that
you are hitting a connection issue we are seeing in
> some scenarios in the OOB subsystem - though if you are able to run using
a non-lama mapper, that would seem unlikely.
>
>
> On Dec 20, 2013, at 8:09 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
>
> Hi Ralph,
>
> Thank you very much. I tried many things such as:
>
> mpirun -np 2 -host node05 -report-bindings -mca rmaps lama -mca
> rmaps_lama_bind 1c myprog
>
> But every try failed. At least they were accepted by openmpi-1.7.3 as far
> as I remember.
> Anyway, please check it when you have a time, because using lama comes
from
> my curiosity.
>
> Regards,
> Tetsuya Mishima
>
>
> I'll try to take a look at it - my expectation is that lama might get
> stuck because you didn't tell it a pattern to map, and I doubt that code
> path has seen much testing.
>
>
> On Dec 20, 2013, at 5:52 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
>
> Hi Ralph, I'm glad to hear that, thanks.
>
> By the way, yesterday I tried to check how lama in 1.7.4rc treat numa
> node.
>
> Then, even wiht this simple command line, it freezed without any
> massage:
>
> mpirun -np 2 -host node05 -mca rmaps lama myprog
>
> Could you check what happened?
>
> Is it better to open new thread or continue this thread?
>
> Regards,
> Tetsuya Mishima
>
>
> I'll make it work so that NUMA can be either above or below socket
>
> On Dec 20, 2013, at 2:57 AM, tmish...@jcity.maeda.co.jp wrote:
>
>
>
> Hi Brice,
>
> Thank you for your comment. I understand what you mean.
>
> My opinion was made just considering easy way to adjust the code for
> inversion of hierarchy in object tree.
>
> Tetsuya Mishima
>
>
> I don't think there's any such difference.
> Also, all these NUMA architectures are reported the same by hwloc,
> and
> therefore used the same in Open MPI.
>
> And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours
> (and
> most recent AMD and Intel platforms).
>
> Brice
>
>
>
> Le 20/12/2013 11:33, tmish...@jcity.maeda.co.jp a écrit :
>
> Hi Ralph,
>
> The numa-node in AMD Mangy-Cours/Interlagos is so called cc(cache
> coherent)NUMA,
> which seems to be a little bit different from the traditional numa
> defined
> in openmpi.
>
> I notice that ccNUMA object is almost same as L3cache object.
> So "-bind-to l3cache" or "-map-by l3cache" is valid for what I want
> to
> do.
> Therefore, "do not touch it" is one of the solution, I think ...
>
> Anyway, mixing up these two types of numa is the problem.
>
> Regards,
> Tetsuya Mishima
>
> I can wait it'll be fixed in 1.7.5 or later, because putting
> "-bind-to
> numa"
> and "-map-by numa" at the same time works as a workaround.
>
> Thanks,
> Tetsuya Mishima
>
> Yeah, it will impact everything that uses hwloc topology maps, I
> fear.
>
> One side note: you'll need to add --hetero-nodes to your cmd
> line.
> If
> we
> don't see that, we assume that all the node topologies are
> identical
> -
> which clearly isn't true here.
> I'll try to resolve the hier inversion over the holiday - won't
> be
> for
> 1.7.4, but hopefully for 1.7.5
> Thanks
> Ralph
>
> On Dec 18, 2013, at 9:44 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
> I think it's normal for AMD opteron having 8/16 cores such as
> magny cours or interlagos. Because it usually has 2 numa 

[OMPI users] Clarification about dual-rail capabilities (sharing)

2013-12-21 Thread Filippo Spiga
Dear Open MPI users,

in my institution a cluster with dual-rail IB has recently deployed. Each 
compute node has two physical single-port Mellanox Connect-IB MT27600 card 
(mlx5_0, mlx5_1). By running bandwidth tests (OSU 4.2 benchmark) using 
MVAPICH2, I can achieve from one node to another (1 MPI process per node) up to 
12 GB/s using the rail in sharing by brokering small messages across both HCA 
devices. This is good.

The I switched to Open MPI (1.7.3 and 1.7.4rc1). I tried to use both HCAs 
together but it seems to me that only one is used (because there is only one 
process per node?). In Open MPI it seems more complicated to set up such a 
test. This is what I did... 

mpirun --mca coll_fca_enable 0 --mca btl_openib_verbose 1 -host HOST1,HOST2 
--mca btl_openib_if_include mlx5_0,mlx5_1  -np 1 
./osu-bin/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw : -np 1 --mca 
coll_fca_enable 0 --mca btl_openib_verbose 1 --mca btl_openib_if_include 
mlx5_0,mlx5_1 ./osu-bin/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw

Max measured bandwidth is around 6.5 GB/s, basically the same if I use a single 
HCA.

What I do wrong? Is this the correct way to exploit a multi-rail system? 

Many thanks in advance,
Regards

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert



Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-21 Thread Ralph Castain
It seems to be working fine for me:

[rhc@bend001 tcp]$ mpirun -np 2 -host bend001 -report-bindings -mca 
rmaps_lama_bind 1c -mca rmaps lama hostname
bend001
[bend001:17005] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../..][../../../../../..]
[bend001:17005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../..][../../../../../..]
bend001
[rhc@bend001 tcp]$ 

(I also checked the internals using "-mca rmaps_base_verbose 10") so it could 
be your hier inversion causing problems again. Or it could be that you are 
hitting a connection issue we are seeing in some scenarios in the OOB subsystem 
- though if you are able to run using a non-lama mapper, that would seem 
unlikely.


On Dec 20, 2013, at 8:09 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Thank you very much. I tried many things such as:
> 
> mpirun -np 2 -host node05 -report-bindings -mca rmaps lama -mca
> rmaps_lama_bind 1c myprog
> 
> But every try failed. At least they were accepted by openmpi-1.7.3 as far
> as I remember.
> Anyway, please check it when you have a time, because using lama comes from
> my curiosity.
> 
> Regards,
> Tetsuya Mishima
> 
> 
>> I'll try to take a look at it - my expectation is that lama might get
> stuck because you didn't tell it a pattern to map, and I doubt that code
> path has seen much testing.
>> 
>> 
>> On Dec 20, 2013, at 5:52 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph, I'm glad to hear that, thanks.
>>> 
>>> By the way, yesterday I tried to check how lama in 1.7.4rc treat numa
> node.
>>> 
>>> Then, even wiht this simple command line, it freezed without any
> massage:
>>> 
>>> mpirun -np 2 -host node05 -mca rmaps lama myprog
>>> 
>>> Could you check what happened?
>>> 
>>> Is it better to open new thread or continue this thread?
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
>>> 
 I'll make it work so that NUMA can be either above or below socket
 
 On Dec 20, 2013, at 2:57 AM, tmish...@jcity.maeda.co.jp wrote:
 
> 
> 
> Hi Brice,
> 
> Thank you for your comment. I understand what you mean.
> 
> My opinion was made just considering easy way to adjust the code for
> inversion of hierarchy in object tree.
> 
> Tetsuya Mishima
> 
> 
>> I don't think there's any such difference.
>> Also, all these NUMA architectures are reported the same by hwloc,
> and
>> therefore used the same in Open MPI.
>> 
>> And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours
>>> (and
>> most recent AMD and Intel platforms).
>> 
>> Brice
>> 
>> 
>> 
>> Le 20/12/2013 11:33, tmish...@jcity.maeda.co.jp a écrit :
>>> 
>>> Hi Ralph,
>>> 
>>> The numa-node in AMD Mangy-Cours/Interlagos is so called cc(cache
>>> coherent)NUMA,
>>> which seems to be a little bit different from the traditional numa
> defined
>>> in openmpi.
>>> 
>>> I notice that ccNUMA object is almost same as L3cache object.
>>> So "-bind-to l3cache" or "-map-by l3cache" is valid for what I want
>>> to
> do.
>>> Therefore, "do not touch it" is one of the solution, I think ...
>>> 
>>> Anyway, mixing up these two types of numa is the problem.
>>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
 I can wait it'll be fixed in 1.7.5 or later, because putting
>>> "-bind-to
 numa"
 and "-map-by numa" at the same time works as a workaround.
 
 Thanks,
 Tetsuya Mishima
 
> Yeah, it will impact everything that uses hwloc topology maps, I
> fear.
> 
> One side note: you'll need to add --hetero-nodes to your cmd
> line.
>>> If
>>> we
 don't see that, we assume that all the node topologies are
> identical
>>> -
 which clearly isn't true here.
> I'll try to resolve the hier inversion over the holiday - won't
> be
> for
 1.7.4, but hopefully for 1.7.5
> Thanks
> Ralph
> 
> On Dec 18, 2013, at 9:44 PM, tmish...@jcity.maeda.co.jp wrote:
> 
>> 
>> I think it's normal for AMD opteron having 8/16 cores such as
>> magny cours or interlagos. Because it usually has 2 numa nodes
>> in a cpu(socket), numa-node can not include a socket. This type
>> of hierarchy would be natural.
>> 
>> (node03 is Dell PowerEdge R815 and maybe quite common, I guess)
>> 
>> By the way, I think this inversion should affect rmaps_lama
>>> mapping.
>> 
>> Tetsuya Mishima
>> 
>>> Ick - yeah, that would be a problem. I haven't seen that type
> of
>> hierarchical inversion before - is node03 a different type of
>>> chip?
>>> Might take awhile for me to adjust the code to handle hier
>> inversion... :-(
>>> On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote: