Re: [OMPI users] Linkage problem

2018-04-05 Thread Quentin Faure
I solved my problem. I uninstalled all the mpi software that were on the 
computer reinstalled openmpi. It was still not working so I uninstalled it 
again and reinstalled it again and it is working now. Apparently there was a 
problem with the installation.

Thanks for the help.

Quentin 

> On 4 Apr 2018, at 11:30, Jeff Squyres (jsquyres)  wrote:
> 
>> On Apr 4, 2018, at 12:58 PM, Quentin Faure  wrote:
>> 
>> Sorry, I did not see my autocorrect changed some word.
>> 
>> I added the -l and it did not change anything. Also the mpicxx —showme does 
>> not work. It says that the option —showme does not exist
> 
> If 'mpicxx --showme' (with 2 dashes) does not work, then you are not using 
> Open MPI's mpicxx.  You should check to make sure you are testing what you 
> think you are testing.
> 
> Note, too, that Nathan was pointing out a missing capital "i" (as in 
> "include") not a missing capital "l" (as in "link").  Depending on the font 
> in your mail client, it can be difficult to tell the two apart.
> 
> He is correct that what you showed was not an *error* -- it was a warning 
> that the C++ compiler was telling you that it ignored an argument on the 
> command line.  Specifically, you did:
> 
> mpicxx -g -O3  -DLAMMPS_GZIP -DLMP_USER_INTEL -DLMP_MPIIO  
> /usr/lib/openmpi/include -pthread -DFFT_FFTW3 -DFFT_SINGLE   
> -I../../lib/molfile   -c ../create_atoms.cpp
> 
> But you missed the -I (capital I, as in indigo, not capital L, as in Llama).  
> It should have been:
> 
> mpicxx -g -O3  -DLAMMPS_GZIP -DLMP_USER_INTEL -DLMP_MPIIO -I 
> /usr/lib/openmpi/include -pthread -DFFT_FFTW3 -DFFT_SINGLE   
> -I../../lib/molfile   -c ../create_atoms.cpp
> 
> That being said, you shouldn't need to mention /usr/lib/openmpi/include at 
> all (even with the -I), because mpicxx will automatically insert that for 
> you.  Specifically: mpicxx is not a compiler itself -- it's just a "wrapper" 
> around the underlying C++ compiler.  All mpicxx does is add some additional 
> command line arguments and then invoke the underlying C++ compiler.  When you 
> run "mpicxx --showme" (with Open MPI's mpicxx command), it will show you the 
> underlying C++ compiler command that it would have invoked.
> 
> Similarly, the "ompi_info: error while loading shared libraries: 
> libmpi.so.40: cannot open shared object file: No such file or directory" 
> error means that you do not have Open MPI's libmpi at the front of your 
> searchable library path.  See 
> https://www.open-mpi.org/faq/?category=running#adding-ompi-to-path.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] disabling libraries?

2018-04-05 Thread Gilles Gouaillardet
Michael,

in this case, you can
mpirun --mca oob ^ud ...
in order to blacklist the oob/ud component.

an alternative is to add
oob = ^ud
in /.../etc/openmpi-mca-params.conf

If Open MPI is installed on a local filesystem, then this setting can
be node specific.


That being said, the error suggest mca_oob_ud.so is a module from a
previous install,
Open MPI was not built on the system it is running, or libibverbs.so.1
has been removed after
Open MPI was built.
So I do encourage you to take a step back, and think if you can find a
better solution for your site.


Cheers,

Gilles

On Fri, Apr 6, 2018 at 3:37 AM, Michael Di Domenico
 wrote:
> i'm trying to compile openmpi to support all of our interconnects,
> psm/openib/mxm/etc
>
> this works fine, openmpi finds all the libs, compiles and runs on each
> of the respective machines
>
> however, we don't install the libraries for everything everywhere
>
> so when i run things like ompi_info and mpirun i get
>
> mca_base_component_reposity_open: unable to open mca_oob_ud:
> libibverbs.so.1: cannot open shared object file: no such file or
> directory (ignored)
>
> and so on, for a bunch of other libs.
>
> i understand how the lib linking works so this isn't unexpected and
> doesn't stop the mpi programs from running.
>
> here's the part i don't understand, how can i trace the above warning
> and others like it back the required --mca parameters i need to add
> into the configuration to make the warnings go away?
>
> as an aside, i believe i can set most of them via environment
> variables as well as the command, but what i really like to do is set
> them from a file.  i know i can create a default param file, but is
> there a way to feed a param file at invocation depending where mpirun
> is being run?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Gilles Gouaillardet
Noam,

you might also want to try

mpirun --mca btl tcp,self ...

to rule out btl (shared memory and/or infiniband) related issues.


Once you rebuild Open MPI with --enable-debug, I recommend you first
check the arguments of the MPI_Send() and MPI_Recv() functions and
make sure
 - same communicator is used (in C, check comm->c_contextid)
 - same tag
 - double check the MPI tasks do wait for each other (in C, check
comm->c_my_rank, source and dest)


Cheers,

Gilles

On Fri, Apr 6, 2018 at 5:31 AM, George Bosilca  wrote:
> Yes, you can do this by adding --enable-debug to OMPI configure (and make
> sure your don't have the configure flag --with-platform=optimize).
>
>   George.
>
>
> On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein 
> wrote:
>>
>>
>> On Apr 5, 2018, at 4:11 PM, George Bosilca  wrote:
>>
>> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
>> 1)". This allows the debugger to make a call our function, and output
>> internal information about the library status.
>>
>>
>> Great.  But I guess I need to recompile ompi in debug mode?  Is that just
>> a flag to configure?
>>
>> thanks,
>> Noam
>>
>>
>> 
>> |
>> |
>> |
>> U.S. NAVAL
>> |
>> |
>> _RESEARCH_
>> |
>> LABORATORY
>>
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://www.nrl.navy.mil
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
Hi Nathan, Howard,

Thanks for the feedback. Yes, we do already have UCX compiled in to our OpenMPI 
installations, but it’s disabled by default on our system because some users 
were reporting problems with it previously. But I’m not sure what the status of 
these are with OpenMPI 3.0, something for me to follow up with them.

Cheers,
Ben



> On 6 Apr 2018, at 2:48 am, Nathan Hjelm  wrote:
> 
> 
> Honestly, this is a configuration issue with the openib btl. There is no 
> reason to keep either eager RDMA nor is there a reason to pipeline RDMA. I 
> haven't found an app where either of these "features" helps you with 
> infiniband. You have the right idea with the parameter changes but Howard is 
> correct, for Mellanox the future is UCX not verbs. I would try it and see if 
> it works for you but if it doesn't I would set those two parameters in your 
> /etc/openmpi-mca-params.conf and run like that.
> 
> -Nathan
> 
> On Apr 05, 2018, at 01:18 AM, Ben Menadue  wrote:
> 
>> Hi,
>> 
>> Another interesting point. I noticed that the last two message sizes tested 
>> (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. 
>> Increasing the minimum size to use the RDMA pipeline to above these sizes 
>> brings those two data-points up to scratch for both benchmarks:
>> 
>> 3.0.0, osu_bw, no rdma for large messages
>> 
>> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node 
>> > -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
>> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
>> # Size  Bandwidth (MB/s)
>> 2097152  6133.22
>> 4194304  6054.06
>> 
>> 3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages
>> 
>> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
>> > btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw 
>> > -m 2097152:4194304
>> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
>> # Size  Bandwidth (MB/s)
>> 2097152 11397.85
>> 4194304 11389.64
>> 
>> This makes me think something odd is going on in the RDMA pipeline.
>> 
>> Cheers,
>> Ben
>> 
>> 
>> 
>>> On 5 Apr 2018, at 5:03 pm, Ben Menadue >> > wrote:
>>> Hi,
>>> 
>>> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed 
>>> that osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR 
>>> IB). However, osu_bw is fine.
>>> 
>>> If I disable eager RDMA, then osu_bibw gives the expected numbers. 
>>> Similarly, if I increase the number of eager RDMA buffers, it gives the 
>>> expected results.
>>> 
>>> OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, 
>>> but they’re not as good as 3.0.0 (when tuned) for large buffers. The same 
>>> option changes produce no different in the performance for 1.10.7.
>>> 
>>> I was wondering if anyone else has noticed anything similar, and if this is 
>>> unexpected, if anyone has a suggestion on how to investigate further?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> Here’s are the numbers:
>>> 
>>> 3.0.0, osu_bw, default settings
>>> 
>>> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
>>> # OSU MPI Bandwidth Test v5.4.0
>>> # Size  Bandwidth (MB/s)
>>> 1   1.13
>>> 2   2.29
>>> 4   4.63
>>> 8   9.21
>>> 16 18.18
>>> 32 36.46
>>> 64 69.95
>>> 128   128.55
>>> 256   250.74
>>> 512   451.54
>>> 1024  829.44
>>> 2048 1475.87
>>> 4096 2119.99
>>> 8192 3452.37
>>> 163842866.51
>>> 327684048.17
>>> 655365030.54
>>> 131072   5573.81
>>> 262144   5861.61
>>> 524288   6015.15
>>> 1048576  6099.46
>>> 2097152   989.82
>>> 4194304   989.81
>>> 
>>> 3.0.0, osu_bibw, default settings
>>> 
>>> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw
>>> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
>>> # Size  Bandwidth (MB/s)
>>> 1   0.00
>>> 2   0.01
>>> 4   0.01
>>> 8   0.02
>>> 16  0.04
>>> 32  0.09
>>> 64  0.16
>>> 128   135.30
>>> 256   265.35
>>> 512   499.92
>>> 1024  949.22
>>> 2048 1440.27
>>> 4096 1960.09
>>> 8192 3166.97
>>> 16384 127.62
>>> 32768 165.12
>>> 65536 312.80
>>> 131072   1120.03
>>> 262144   4724.01
>>> 524288   4545.93
>>> 1048576  5186.51
>>> 2097152   989.84
>>> 4194304   989.88
>>> 
>>> 

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Yes, you can do this by adding --enable-debug to OMPI configure (and make
sure your don't have the configure flag --with-platform=optimize).

  George.


On Thu, Apr 5, 2018 at 4:20 PM, Noam Bernstein 
wrote:

>
> On Apr 5, 2018, at 4:11 PM, George Bosilca  wrote:
>
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
> 1)". This allows the debugger to make a call our function, and output
> internal information about the library status.
>
>
> Great.  But I guess I need to recompile ompi in debug mode?  Is that just
> a flag to configure?
>
> thanks,
> Noam
>
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein

> On Apr 5, 2018, at 4:11 PM, George Bosilca  wrote:
> 
> I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm, 1)". 
> This allows the debugger to make a call our function, and output internal 
> information about the library status.

Great.  But I guess I need to recompile ompi in debug mode?  Is that just a 
flag to configure?

thanks,
Noam




||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
I attach with gdb on the processes and do a "call mca_pml_ob1_dump(comm,
1)". This allows the debugger to make a call our function, and output
internal information about the library status.

  George.



On Thu, Apr 5, 2018 at 4:03 PM, Noam Bernstein 
wrote:

> On Apr 5, 2018, at 3:55 PM, George Bosilca  wrote:
>
> Noam,
>
> The OB1 provide a mechanism to dump all pending communications in a
> particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
> 1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
> idea how you can find the pointer to the communicator out of your code, but
> if you compile OMPI in debug mode you will see it as an argument to the 
> mca_pml_ob1_send
> and mca_pml_ob1_recv function.
>
> This information will give us a better idea on what happened to the
> message, where is has been sent (or not), and what were the source and tag
> used for the matching.
>
>
> Interesting.  How would you do this in a hung program?  Call it before you
> call the things that you expect will hang?  And any ideas how to get a
> communicator pointer from fortran?
>
> Noam
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 3:55 PM, George Bosilca  wrote:
> 
> Noam,
> 
> The OB1 provide a mechanism to dump all pending communications in a 
> particular communicator. To do this I usually call mca_pml_ob1_dump(comm, 1), 
> with comm being the MPI_Comm and 1 being the verbose mode. I have no idea how 
> you can find the pointer to the communicator out of your code, but if you 
> compile OMPI in debug mode you will see it as an argument to the 
> mca_pml_ob1_send and mca_pml_ob1_recv function.
> 
> This information will give us a better idea on what happened to the message, 
> where is has been sent (or not), and what were the source and tag used for 
> the matching.

Interesting.  How would you do this in a hung program?  Call it before you call 
the things that you expect will hang?  And any ideas how to get a communicator 
pointer from fortran?

Noam


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread George Bosilca
Noam,

The OB1 provide a mechanism to dump all pending communications in a
particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
idea how you can find the pointer to the communicator out of your code, but
if you compile OMPI in debug mode you will see it as an argument to
the mca_pml_ob1_send
and mca_pml_ob1_recv function.

This information will give us a better idea on what happened to the
message, where is has been sent (or not), and what were the source and tag
used for the matching.

  George.



On Thu, Apr 5, 2018 at 12:01 PM, Edgar Gabriel 
wrote:

> is the file I/O that you mentioned using MPI I/O for that? If yes, what
> file system are you writing to?
>
> Edgar
>
>
>
> On 4/5/2018 10:15 AM, Noam Bernstein wrote:
>
>> On Apr 5, 2018, at 11:03 AM, Reuti  wrote:
>>>
>>> Hi,
>>>
>>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <
 noam.bernst...@nrl.navy.mil>:

 Hi all - I have a code that uses MPI (vasp), and it’s hanging in a
 strange way.  Basically, there’s a Cartesian communicator, 4x16 (64
 processes total), and despite the fact that the communication pattern is
 rather regular, one particular send/recv pair hangs consistently.
 Basically, across each row of 4, task 0 receives from 1,2,3, and tasks
 1,2,3 send to 0.  On most of the 16 such sets all those send/recv pairs
 complete.  However, on 2 of them, it hangs (both the send and recv).  I
 have stack traces (with gdb -p on the running processes) from what I
 believe are corresponding send/recv pairs.

 

 This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older
 versions), Intel compilers (17.2.174). It seems to be independent of which
 nodes, always happens on this pair of calls and happens after the code has
 been running for a while, and the same code for the other 14 sets of 4 work
 fine, suggesting that it’s an MPI issue, rather than an obvious bug in this
 code or a hardware problem.  Does anyone have any ideas, either about
 possible causes or how to debug things further?

>>> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL
>>> with the Intel compilers for VASP and found, that using in addition a
>>> self-compiled scaLAPACK is working fine in combination with Open MPI. Using
>>> Intel scaLAPACK and Intel MPI is also working fine. What I never got
>>> working was the combination Intel scaLAPACK and Open MPI – at one point one
>>> process got a message from a wrong rank IIRC. I tried both: the Intel
>>> supplied Open MPI version of scaLAPACK and also compiling the necessary
>>> interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with
>>> identical results.
>>>
>> MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I
>> set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to
>> test.  In any case, this is when it’s writing out the wavefunctions, which
>> I would assume be unrelated to scalapack operations (unless they’re
>> corrupting some low level MPI thing, I guess).
>>
>>
>>   Noam
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] disabling libraries?

2018-04-05 Thread Michael Di Domenico
i'm trying to compile openmpi to support all of our interconnects,
psm/openib/mxm/etc

this works fine, openmpi finds all the libs, compiles and runs on each
of the respective machines

however, we don't install the libraries for everything everywhere

so when i run things like ompi_info and mpirun i get

mca_base_component_reposity_open: unable to open mca_oob_ud:
libibverbs.so.1: cannot open shared object file: no such file or
directory (ignored)

and so on, for a bunch of other libs.

i understand how the lib linking works so this isn't unexpected and
doesn't stop the mpi programs from running.

here's the part i don't understand, how can i trace the above warning
and others like it back the required --mca parameters i need to add
into the configuration to make the warnings go away?

as an aside, i believe i can set most of them via environment
variables as well as the command, but what i really like to do is set
them from a file.  i know i can create a default param file, but is
there a way to feed a param file at invocation depending where mpirun
is being run?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Nathan Hjelm


Honestly, this is a configuration issue with the openib btl. There is no reason to keep 
either eager RDMA nor is there a reason to pipeline RDMA. I haven't found an app where 
either of these "features" helps you with infiniband. You have the right idea 
with the parameter changes but Howard is correct, for Mellanox the future is UCX not 
verbs. I would try it and see if it works for you but if it doesn't I would set those two 
parameters in your /etc/openmpi-mca-params.conf and run like that.

-Nathan

On Apr 05, 2018, at 01:18 AM, Ben Menadue  wrote:

Hi,

Another interesting point. I noticed that the last two message sizes tested 
(2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. Increasing 
the minimum size to use the RDMA pipeline to above these sizes brings those two 
data-points up to scratch for both benchmarks:

3.0.0, osu_bw, no rdma for large messages


mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node -np 2 
-H r6,r7 ./osu_bw -m 2097152:4194304

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
2097152              6133.22
4194304              6054.06

3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages


mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m 
2097152:4194304

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
2097152             11397.85
4194304             11389.64

This makes me think something odd is going on in the RDMA pipeline.

Cheers,
Ben



On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
Hi,

We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that 
osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB). 
However, osu_bw is fine.

If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, 
if I increase the number of eager RDMA buffers, it gives the expected results.

OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, but 
they’re not as good as 3.0.0 (when tuned) for large buffers. The same option 
changes produce no different in the performance for 1.10.7.

I was wondering if anyone else has noticed anything similar, and if this is 
unexpected, if anyone has a suggestion on how to investigate further?

Thanks,
Ben


Here’s are the numbers:

3.0.0, osu_bw, default settings


mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw

# OSU MPI Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       1.13
2                       2.29
4                       4.63
8                       9.21
16                     18.18
32                     36.46
64                     69.95
128                   128.55
256                   250.74
512                   451.54
1024                  829.44
2048                 1475.87
4096                 2119.99
8192                 3452.37
16384                2866.51
32768                4048.17
65536                5030.54
131072               5573.81
262144               5861.61
524288               6015.15
1048576              6099.46
2097152               989.82
4194304               989.81

3.0.0, osu_bibw, default settings


mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       0.00
2                       0.01
4                       0.01
8                       0.02
16                      0.04
32                      0.09
64                      0.16
128                   135.30
256                   265.35
512                   499.92
1024                  949.22
2048                 1440.27
4096                 1960.09
8192                 3166.97
16384                 127.62
32768                 165.12
65536                 312.80
131072               1120.03
262144               4724.01
524288               4545.93
1048576              5186.51
2097152               989.84
4194304               989.88

3.0.0, osu_bibw, eager RDMA disabled


mpirun -mca btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 
./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size      Bandwidth (MB/s)
1                       1.49
2                       2.97
4                       5.96
8                      11.98
16                     23.95
32                     47.39
64                     93.57
128                   153.82
256                   304.69
512                   572.30
1024                 1003.52
2048                 1083.89
4096                 1879.32
8192                 2785.18
16384                3535.77
32768                5614.72
65536                8113.69
131072               9666.74
262144              10738.97
524288              11247.02
1048576             11416.50
2097152               989.88
4194304               989.88

3.0.0, osu_bibw, increased eager RDMA 

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Edgar Gabriel
is the file I/O that you mentioned using MPI I/O for that? If yes, what 
file system are you writing to?


Edgar


On 4/5/2018 10:15 AM, Noam Bernstein wrote:

On Apr 5, 2018, at 11:03 AM, Reuti  wrote:

Hi,


Am 05.04.2018 um 16:16 schrieb Noam Bernstein :

Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. 
 Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and 
despite the fact that the communication pattern is rather regular, one 
particular send/recv pair hangs consistently.  Basically, across each row of 4, 
task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.  On most of the 16 such 
sets all those send/recv pairs complete.  However, on 2 of them, it hangs (both 
the send and recv).  I have stack traces (with gdb -p on the running processes) 
from what I believe are corresponding send/recv pairs.



This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
Intel compilers (17.2.174). It seems to be independent of which nodes, always 
happens on this pair of calls and happens after the code has been running for a 
while, and the same code for the other 14 sets of 4 work fine, suggesting that 
it’s an MPI issue, rather than an obvious bug in this code or a hardware 
problem.  Does anyone have any ideas, either about possible causes or how to 
debug things further?

Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the 
Intel compilers for VASP and found, that using in addition a self-compiled 
scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK 
and Intel MPI is also working fine. What I never got working was the 
combination Intel scaLAPACK and Open MPI – at one point one process got a 
message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI 
version of scaLAPACK and also compiling the necessary interface on my own for 
Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set 
LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test.  In 
any case, this is when it’s writing out the wavefunctions, which I would assume 
be unrelated to scalapack operations (unless they’re corrupting some low level 
MPI thing, I guess).


Noam

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Howard Pritchard
Hello Ben,

Thanks for the info.   You would probably be better off installing UCX on
your cluster and rebuilding your Open MPI with the
--with-ucx
configure option.

Here's what I'm seeing with Open MPI 3.0.1 on a ConnectX5 based cluster
using ob1/openib BTL:

mpirun -map-by ppr:1:node -np 2 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   0.00

2   0.00

4   0.01

8   0.02

16  0.04

32  0.07

64  0.13

128   273.64

256   485.04

512   869.51

1024 1434.99

2048 2208.12

4096 3055.67

8192 3896.93

16384  89.29

32768 252.59

65536 614.42

131072  22878.74

262144  23846.93

524288  24256.23

1048576 24498.27

2097152 24615.64

4194304 24632.58


export OMPI_MCA_pml=ucx

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   4.57

2   8.95

4  17.67

8  35.99

16 71.99

32141.56

64208.86

128   410.32

256   495.56

512  1455.98

1024 2414.78

2048 3008.19

4096 5351.62

8192 5563.66

163845945.16

327686061.33

65536   21376.89

131072  23462.99

262144  24064.56

524288  24366.84

1048576 24550.75

2097152 24649.03

4194304 24693.77

You can get ucx off of GitHub

https://github.com/openucx/ucx/releases


There is also a pre-release version of UCX (1.3.0RCX?) packaged as an RPM

available in MOFED 4.3.  See


http://www.mellanox.com/page/products_dyn?product_family=26=linux_sw_drivers


I was using UCX 1.2.2 for the results above.


Good luck,


Howard




2018-04-05 1:12 GMT-06:00 Ben Menadue :

> Hi,
>
> Another interesting point. I noticed that the last two message sizes
> tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw.
> Increasing the minimum size to use the RDMA pipeline to above these sizes
> brings those two data-points up to scratch for both benchmarks:
>
> *3.0.0, osu_bw, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by
> ppr:1:node -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152  6133.22
> 4194304  6054.06
>
> *3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m
> 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152 11397.85
> 4194304 11389.64
>
> This makes me think something odd is going on in the RDMA pipeline.
>
> Cheers,
> Ben
>
>
>
> On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
>
> Hi,
>
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed
> that *osu_bibw* gives nowhere near the bandwidth I’d expect (this is on
> FDR IB). However, *osu_bw* is fine.
>
> If I disable eager RDMA, then *osu_bibw* gives the expected
> numbers. Similarly, if I increase the number of eager RDMA buffers, it
> gives the expected results.
>
> OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings,
> but they’re not as good as 3.0.0 (when tuned) for large buffers. The same
> option changes produce no different in the performance for 1.10.7.
>
> I was wondering if anyone else has noticed anything similar, and if this
> is unexpected, if anyone has a suggestion on how to investigate further?
>
> Thanks,
> Ben
>
>
> Here’s are the numbers:
>
> *3.0.0, osu_bw, default settings*
>
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
> # OSU MPI Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.13
> 2   2.29
> 4   4.63
> 8   9.21
> 16 18.18
> 32 36.46
> 64 69.95
> 128   128.55
> 256   250.74
> 512   451.54
> 1024  829.44
> 2048 1475.87
> 4096 2119.99
> 8192 3452.37
> 163842866.51
> 327684048.17
> 655365030.54
> 131072   5573.81
> 262144   5861.61
> 524288   6015.15
> 1048576

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:32 AM, Edgar Gabriel  wrote:
> 
> is the file I/O that you mentioned using MPI I/O for that? If yes, what file 
> system are you writing to?

No MPI I/O.  Just MPI calls to gather the data, and plain Fortran I/O on the 
head node only.  

I should also say that in lots of other circumstances (different node numbers, 
computational systems, etc) it works fine.  But the hang is completely 
repeatable for this particular set of parameters (MPI and physical simulation). 
 I haven’t explored to see what variations do/don’t lead to this kind of 
hanging.


Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
> On Apr 5, 2018, at 11:03 AM, Reuti  wrote:
> 
> Hi,
> 
>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein :
>> 
>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange 
>> way.  Basically, there’s a Cartesian communicator, 4x16 (64 processes 
>> total), and despite the fact that the communication pattern is rather 
>> regular, one particular send/recv pair hangs consistently.  Basically, 
>> across each row of 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0. 
>>  On most of the 16 such sets all those send/recv pairs complete.  However, 
>> on 2 of them, it hangs (both the send and recv).  I have stack traces (with 
>> gdb -p on the running processes) from what I believe are corresponding 
>> send/recv pairs.  
>> 
>> 
>> 
>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
>> Intel compilers (17.2.174). It seems to be independent of which nodes, 
>> always happens on this pair of calls and happens after the code has been 
>> running for a while, and the same code for the other 14 sets of 4 work fine, 
>> suggesting that it’s an MPI issue, rather than an obvious bug in this code 
>> or a hardware problem.  Does anyone have any ideas, either about possible 
>> causes or how to debug things further?
> 
> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with 
> the Intel compilers for VASP and found, that using in addition a 
> self-compiled scaLAPACK is working fine in combination with Open MPI. Using 
> Intel scaLAPACK and Intel MPI is also working fine. What I never got working 
> was the combination Intel scaLAPACK and Open MPI – at one point one process 
> got a message from a wrong rank IIRC. I tried both: the Intel supplied Open 
> MPI version of scaLAPACK and also compiling the necessary interface on my own 
> for Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set 
LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test.  In 
any case, this is when it’s writing out the wavefunctions, which I would assume 
be unrelated to scalapack operations (unless they’re corrupting some low level 
MPI thing, I guess).


Noam

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Reuti
Hi,

> Am 05.04.2018 um 16:16 schrieb Noam Bernstein :
> 
> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange 
> way.  Basically, there’s a Cartesian communicator, 4x16 (64 processes total), 
> and despite the fact that the communication pattern is rather regular, one 
> particular send/recv pair hangs consistently.  Basically, across each row of 
> 4, task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.  On most of the 16 
> such sets all those send/recv pairs complete.  However, on 2 of them, it 
> hangs (both the send and recv).  I have stack traces (with gdb -p on the 
> running processes) from what I believe are corresponding send/recv pairs.  
> 
> 
> 
> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
> Intel compilers (17.2.174). It seems to be independent of which nodes, always 
> happens on this pair of calls and happens after the code has been running for 
> a while, and the same code for the other 14 sets of 4 work fine, suggesting 
> that it’s an MPI issue, rather than an obvious bug in this code or a hardware 
> problem.  Does anyone have any ideas, either about possible causes or how to 
> debug things further?

Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the 
Intel compilers for VASP and found, that using in addition a self-compiled 
scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK 
and Intel MPI is also working fine. What I never got working was the 
combination Intel scaLAPACK and Open MPI – at one point one process got a 
message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI 
version of scaLAPACK and also compiling the necessary interface on my own for 
Open MPI in $MKLROOT/interfaces/mklmpi with identical results.

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] mpi send/recv pair hangin

2018-04-05 Thread Noam Bernstein
Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. 
 Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and 
despite the fact that the communication pattern is rather regular, one 
particular send/recv pair hangs consistently.  Basically, across each row of 4, 
task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.  On most of the 16 such 
sets all those send/recv pairs complete.  However, on 2 of them, it hangs (both 
the send and recv).  I have stack traces (with gdb -p on the running processes) 
from what I believe are corresponding send/recv pairs.  

receiving:
0x2b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0  0x2b06eeed0eb2 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1  0x2b06f0a5d2de in poll_device () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2  0x2b06f0a5e0af in btl_openib_component_progress () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3  0x2b06dd3c00b0 in opal_progress () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4  0x2b06f1c9232d in mca_pml_ob1_recv () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5  0x2b06dce56bb7 in PMPI_Recv () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6  0x2b06dcbd1e0b in pmpi_recv__ () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7  0x0042887b in m_recv_z (comm=..., node=-858993460, zvec=) at 
mpi.F:680
#8  0x0123e0b7 in fileio::outwav (io=..., wdes=..., w=) at fileio.F:952
#9  0x02abfccf in vamp () at main.F:4204
#10 0x004139de in main ()
#11 0x00314561ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x004138e9 in _start ()
sending:
0x2abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#0  0x2abc32ed0ea1 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1  0x2abc34a5d2de in poll_device () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#2  0x2abc34a5e0af in btl_openib_component_progress () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_btl_openib.so
#3  0x2abc238800b0 in opal_progress () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libopen-pal.so.40
#4  0x2abc35c95955 in mca_pml_ob1_send () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/openmpi/mca_pml_ob1.so
#5  0x2abc2331c412 in PMPI_Send () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi.so.40
#6  0x2abc230927e0 in pmpi_send__ () from 
/usr/local/openmpi/3.0.1/x86_64/ib/intel/11.1.080/lib/libmpi_mpifh.so.40
#7  0x00428798 in m_send_z (comm=..., node=) at mpi.F:655
#8  0x0123d0a9 in fileio::outwav (io=..., wdes=) at fileio.F:942
#9  0x02abfccf in vamp () at main.F:4204
#10 0x004139de in main ()
#11 0x003cec81ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x004138e9 in _start ()

This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
Intel compilers (17.2.174). It seems to be independent of which nodes, always 
happens on this pair of calls and happens after the code has been running for a 
while, and the same code for the other 14 sets of 4 work fine, suggesting that 
it’s an MPI issue, rather than an obvious bug in this code or a hardware 
problem.  Does anyone have any ideas, either about possible causes or how to 
debug things further?


thanks,

Noam___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
Hi,

Another interesting point. I noticed that the last two message sizes tested 
(2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. Increasing 
the minimum size to use the RDMA pipeline to above these sizes brings those two 
data-points up to scratch for both benchmarks:

3.0.0, osu_bw, no rdma for large messages

> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by ppr:1:node -np 
> 2 -H r6,r7 ./osu_bw -m 2097152:4194304
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
2097152  6133.22
4194304  6054.06

3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages

> mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca 
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m 
> 2097152:4194304
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
2097152 11397.85
4194304 11389.64

This makes me think something odd is going on in the RDMA pipeline.

Cheers,
Ben



> On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
> 
> Hi,
> 
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed 
> that osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR 
> IB). However, osu_bw is fine.
> 
> If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, 
> if I increase the number of eager RDMA buffers, it gives the expected results.
> 
> OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, 
> but they’re not as good as 3.0.0 (when tuned) for large buffers. The same 
> option changes produce no different in the performance for 1.10.7.
> 
> I was wondering if anyone else has noticed anything similar, and if this is 
> unexpected, if anyone has a suggestion on how to investigate further?
> 
> Thanks,
> Ben
> 
> 
> Here’s are the numbers:
> 
> 3.0.0, osu_bw, default settings
> 
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
> # OSU MPI Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.13
> 2   2.29
> 4   4.63
> 8   9.21
> 16 18.18
> 32 36.46
> 64 69.95
> 128   128.55
> 256   250.74
> 512   451.54
> 1024  829.44
> 2048 1475.87
> 4096 2119.99
> 8192 3452.37
> 163842866.51
> 327684048.17
> 655365030.54
> 131072   5573.81
> 262144   5861.61
> 524288   6015.15
> 1048576  6099.46
> 2097152   989.82
> 4194304   989.81
> 
> 3.0.0, osu_bibw, default settings
> 
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   0.00
> 2   0.01
> 4   0.01
> 8   0.02
> 16  0.04
> 32  0.09
> 64  0.16
> 128   135.30
> 256   265.35
> 512   499.92
> 1024  949.22
> 2048 1440.27
> 4096 1960.09
> 8192 3166.97
> 16384 127.62
> 32768 165.12
> 65536 312.80
> 131072   1120.03
> 262144   4724.01
> 524288   4545.93
> 1048576  5186.51
> 2097152   989.84
> 4194304   989.88
> 
> 3.0.0, osu_bibw, eager RDMA disabled
> 
> > mpirun -mca btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 
> > ./osu_bibw
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.49
> 2   2.97
> 4   5.96
> 8  11.98
> 16 23.95
> 32 47.39
> 64 93.57
> 128   153.82
> 256   304.69
> 512   572.30
> 1024 1003.52
> 2048 1083.89
> 4096 1879.32
> 8192 2785.18
> 163843535.77
> 327685614.72
> 655368113.69
> 131072   9666.74
> 262144  10738.97
> 524288  11247.02
> 1048576 11416.50
> 2097152   989.88
> 4194304   989.88
> 
> 3.0.0, osu_bibw, increased eager RDMA buffer count
> 
> > mpirun -mca btl_openib_eager_rdma_num 32768 -map-by ppr:1:node -np 2 -H 
> > r6,r7 ./osu_bibw
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.42
> 2   2.84
> 4   5.67
> 8  11.18
> 16 22.46
> 32 

[OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Ben Menadue
Hi,

We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed that 
osu_bibw gives nowhere near the bandwidth I’d expect (this is on FDR IB). 
However, osu_bw is fine.

If I disable eager RDMA, then osu_bibw gives the expected numbers. Similarly, 
if I increase the number of eager RDMA buffers, it gives the expected results.

OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, but 
they’re not as good as 3.0.0 (when tuned) for large buffers. The same option 
changes produce no different in the performance for 1.10.7.

I was wondering if anyone else has noticed anything similar, and if this is 
unexpected, if anyone has a suggestion on how to investigate further?

Thanks,
Ben


Here’s are the numbers:

3.0.0, osu_bw, default settings

> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
# OSU MPI Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   1.13
2   2.29
4   4.63
8   9.21
16 18.18
32 36.46
64 69.95
128   128.55
256   250.74
512   451.54
1024  829.44
2048 1475.87
4096 2119.99
8192 3452.37
163842866.51
327684048.17
655365030.54
131072   5573.81
262144   5861.61
524288   6015.15
1048576  6099.46
2097152   989.82
4194304   989.81

3.0.0, osu_bibw, default settings

> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   0.00
2   0.01
4   0.01
8   0.02
16  0.04
32  0.09
64  0.16
128   135.30
256   265.35
512   499.92
1024  949.22
2048 1440.27
4096 1960.09
8192 3166.97
16384 127.62
32768 165.12
65536 312.80
131072   1120.03
262144   4724.01
524288   4545.93
1048576  5186.51
2097152   989.84
4194304   989.88

3.0.0, osu_bibw, eager RDMA disabled

> mpirun -mca btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 
> ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   1.49
2   2.97
4   5.96
8  11.98
16 23.95
32 47.39
64 93.57
128   153.82
256   304.69
512   572.30
1024 1003.52
2048 1083.89
4096 1879.32
8192 2785.18
163843535.77
327685614.72
655368113.69
131072   9666.74
262144  10738.97
524288  11247.02
1048576 11416.50
2097152   989.88
4194304   989.88

3.0.0, osu_bibw, increased eager RDMA buffer count

> mpirun -mca btl_openib_eager_rdma_num 32768 -map-by ppr:1:node -np 2 -H r6,r7 
> ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   1.42
2   2.84
4   5.67
8  11.18
16 22.46
32 44.65
64 83.10
128   154.00
256   291.63
512   537.66
1024  942.35
2048 1433.09
4096 2356.40
8192 1998.54
163843584.82
327685523.08
655367717.63
131072   9419.50
262144  10564.77
524288  11104.71
1048576 11130.75
2097152  7943.89
4194304  5270.00

1.10.7, osu_bibw, default settings

> mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw
# OSU MPI Bi-Directional Bandwidth Test v5.4.0
# Size  Bandwidth (MB/s)
1   1.70
2   3.45
4   6.95
8  13.68
16 27.41
32 53.80
64105.34
128   164.40
256   324.63
512   623.95
1024 1127.35
2048 1784.58
4096 3305.45
8192 3697.55
163844935.75
327687186.28
655368996.94
131072   9301.78
262144   4691.36
524288   7039.18
1048576  7213.33
2097152  9601.41
4194304  9281.31