Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users
 > On May 9, 2019, at 12:37 AM, Joseph Schuchart via users > wrote: > > Nathan, > > Over the last couple of weeks I made some more interesting observations > regarding the latencies of accumulate operations on both Aries and InfiniBand > systems: > > 1) There seems to be a sig

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users
I will try to take a look at it today. -Nathan > On May 9, 2019, at 12:37 AM, Joseph Schuchart via users > wrote: > > Nathan, > > Over the last couple of weeks I made some more interesting observations > regarding the latencies of accumulate operations on both Aries and InfiniBand > systems

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users
Benson, I just gave 4.0.1 a shot and the behavior is the same (the reason I'm stuck with 3.1.2 is a regression with `osc_rdma_acc_single_intrinsic` on 4.0 [1]). The IB cluster has both Mellanox ConnectX-3 (w/ Haswell CPU) and ConnectX-4 (w/ Skylake CPU) nodes, the effect is visible on both n

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Benson Muite via users
Hi, Have you tried anything with OpenMPI 4.0.1? What are the specifications of the Infiniband system you are using? Benson On 5/9/19 9:37 AM, Joseph Schuchart via users wrote: Nathan, Over the last couple of weeks I made some more interesting observations regarding the latencies of accumula

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-08 Thread Joseph Schuchart via users
Nathan, Over the last couple of weeks I made some more interesting observations regarding the latencies of accumulate operations on both Aries and InfiniBand systems: 1) There seems to be a significant difference between 64bit and 32bit operations: on Aries, the average latency for compare-e

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users
Ok, then it sounds like a regression. I will try to track it down today or tomorrow. -Nathan On Nov 08, 2018, at 01:41 PM, Joseph Schuchart wrote: Sorry for the delay, I wanted to make sure that I test the same version on both Aries and IB: git master bbe5da4. I realized that I had previous

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart
Sorry for the delay, I wanted to make sure that I test the same version on both Aries and IB: git master bbe5da4. I realized that I had previously tested with 3.1.3 on the IB cluster, which ran fine. If I use the same version I run into the same problem on both systems (with --mca btl_openib_al

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users
Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.What Open MPI version are you using?-NathanOn Nov 08, 2018, at 11:10 AM, Joseph Schuchart wrote:While using the mca parameter in a real application I noticed a strange effect, which t

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart
While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously in

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Nathan Hjelm via users
All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_sing

[OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t re