Thank you, Nathan. This makes more sense now.
On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm wrote:
> What Ralph said. You just blow memory on a queue that is not recovered in
> the current implementation.
>
> Also, moving to Allreduce will resolve the issue as now every call is
> effectively also
What Ralph said. You just blow memory on a queue that is not recovered in the
current implementation.
Also, moving to Allreduce will resolve the issue as now every call is
effectively also a barrier. I have found with some benchmarks and collective
implementations it can be faster than reduce a
Not exactly. The problem is that rank=0 initially falls behind because it is
doing more work - i.e., it has to receive all the buffers and do something with
them. As a result, it doesn’t get to post the next allreduce before the
messages from the other participants arrive - which means that rank
Thank you, Nathan. Could you elaborate a bit on what happens internally?
>From your answer it seems, the program will still produce the correct
output at the end but it'll use more resources.
On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
devel@lists.open-mpi.org> wrote:
> If you do tha
There is a coll/sync component that will automatically inject those barriers
for you so you don’t have to add them to your code. Controlled by MCA param:
coll_sync_barrier_before: Do a synchronization before each Nth collective
coll_sync_barrier_after: Do a synchronization after each Nth collect
If you do that it may run out of resources and deadlock or crash. I recommend
either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3)
enable coll/sync (which essentially does 1). Honestly, 2 is probably the
easiest option and depending on how large you run may not be any slowe
Hi Devs,
When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct
understanding that ranks other than root (0 in this case) will pass the
collective as soon as their data is written to MPI buffers without waiting
for all of them to be received at the root?
If that's the case then