Thank you, Nathan. This makes more sense now.
On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm wrote:
> What Ralph said. You just blow memory on a queue that is not recovered in
> the current implementation.
>
> Also, moving to Allreduce will resolve the issue as now every call is
> effectively also
What Ralph said. You just blow memory on a queue that is not recovered in the
current implementation.
Also, moving to Allreduce will resolve the issue as now every call is
effectively also a barrier. I have found with some benchmarks and collective
implementations it can be faster than reduce a
Not exactly. The problem is that rank=0 initially falls behind because it is
doing more work - i.e., it has to receive all the buffers and do something with
them. As a result, it doesn’t get to post the next allreduce before the
messages from the other participants arrive - which means that rank
Thank you, Nathan. Could you elaborate a bit on what happens internally?
>From your answer it seems, the program will still produce the correct
output at the end but it'll use more resources.
On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
devel@lists.open-mpi.org> wrote:
> If you do tha
There is a coll/sync component that will automatically inject those barriers
for you so you don’t have to add them to your code. Controlled by MCA param:
coll_sync_barrier_before: Do a synchronization before each Nth collective
coll_sync_barrier_after: Do a synchronization after each Nth collect
If you do that it may run out of resources and deadlock or crash. I recommend
either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3)
enable coll/sync (which essentially does 1). Honestly, 2 is probably the
easiest option and depending on how large you run may not be any slowe