Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Saliya Ekanayake
Thank you, Nathan. This makes more sense now. On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm wrote: > What Ralph said. You just blow memory on a queue that is not recovered in > the current implementation. > > Also, moving to Allreduce will resolve the issue as now every call is > effectively also

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Nathan Hjelm via devel
What Ralph said. You just blow memory on a queue that is not recovered in the current implementation. Also, moving to Allreduce will resolve the issue as now every call is effectively also a barrier. I have found with some benchmarks and collective implementations it can be faster than reduce a

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
Not exactly. The problem is that rank=0 initially falls behind because it is doing more work - i.e., it has to receive all the buffers and do something with them. As a result, it doesn’t get to post the next allreduce before the messages from the other participants arrive - which means that rank

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Thank you, Nathan. Could you elaborate a bit on what happens internally? >From your answer it seems, the program will still produce the correct output at the end but it'll use more resources. On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel < devel@lists.open-mpi.org> wrote: > If you do tha

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
There is a coll/sync component that will automatically inject those barriers for you so you don’t have to add them to your code. Controlled by MCA param: coll_sync_barrier_before: Do a synchronization before each Nth collective coll_sync_barrier_after: Do a synchronization after each Nth collect

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Nathan Hjelm via devel
If you do that it may run out of resources and deadlock or crash. I recommend either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2 is probably the easiest option and depending on how large you run may not be any slowe