Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Saliya Ekanayake
Thank you, Nathan. This makes more sense now. On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm wrote: > What Ralph said. You just blow memory on a queue that is not recovered in > the current implementation. > > Also, moving to Allreduce will resolve the issue as now every call is > effectively also

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Nathan Hjelm via devel
What Ralph said. You just blow memory on a queue that is not recovered in the current implementation. Also, moving to Allreduce will resolve the issue as now every call is effectively also a barrier. I have found with some benchmarks and collective implementations it can be faster than reduce a

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
Not exactly. The problem is that rank=0 initially falls behind because it is doing more work - i.e., it has to receive all the buffers and do something with them. As a result, it doesn’t get to post the next allreduce before the messages from the other participants arrive - which means that rank

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Thank you, Nathan. Could you elaborate a bit on what happens internally? >From your answer it seems, the program will still produce the correct output at the end but it'll use more resources. On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel < devel@lists.open-mpi.org> wrote: > If you do tha

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
There is a coll/sync component that will automatically inject those barriers for you so you don’t have to add them to your code. Controlled by MCA param: coll_sync_barrier_before: Do a synchronization before each Nth collective coll_sync_barrier_after: Do a synchronization after each Nth collect

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Nathan Hjelm via devel
If you do that it may run out of resources and deadlock or crash. I recommend either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2 is probably the easiest option and depending on how large you run may not be any slowe

[OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Hi Devs, When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct understanding that ranks other than root (0 in this case) will pass the collective as soon as their data is written to MPI buffers without waiting for all of them to be received at the root? If that's the case then