Not exactly. The problem is that rank=0 initially falls behind because it is doing more work - i.e., it has to receive all the buffers and do something with them. As a result, it doesn’t get to post the next allreduce before the messages from the other participants arrive - which means that rank=0 has to stick those messages into the “unexpected message” queue. As iterations go by, the memory consumed by that queue gets bigger and bigger, causing rank=0 to run slower and slower…until you run out of memory and it aborts.
Adding the occasional barrier resolves the problem by letting rank=0 catch up and release the memory in the “unexpected message” queue. > On Apr 15, 2019, at 1:33 PM, Saliya Ekanayake <esal...@gmail.com> wrote: > > Thank you, Nathan. Could you elaborate a bit on what happens internally? From > your answer it seems, the program will still produce the correct output at > the end but it'll use more resources. > > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel > <devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>> wrote: > If you do that it may run out of resources and deadlock or crash. I recommend > either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) > enable coll/sync (which essentially does 1). Honestly, 2 is probably the > easiest option and depending on how large you run may not be any slower than > 1 or 3. > > -Nathan > > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake <esal...@gmail.com > > <mailto:esal...@gmail.com>> wrote: > > > > Hi Devs, > > > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct > > understanding that ranks other than root (0 in this case) will pass the > > collective as soon as their data is written to MPI buffers without waiting > > for all of them to be received at the root? > > > > If that's the case then what would happen (semantically) if we execute > > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the > > collective multiple times while the root will be processing an earlier > > reduce? For example, the root can be in the first reduce invocation, while > > another rank is in the second the reduce invocation. > > > > Thank you, > > Saliya > > > > -- > > Saliya Ekanayake, Ph.D > > Postdoctoral Scholar > > Performance and Algorithms Research (PAR) Group > > Lawrence Berkeley National Laboratory > > Phone: 510-486-5772 > > > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > > https://lists.open-mpi.org/mailman/listinfo/devel > > <https://lists.open-mpi.org/mailman/listinfo/devel> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/devel > <https://lists.open-mpi.org/mailman/listinfo/devel> > > > -- > Saliya Ekanayake, Ph.D > Postdoctoral Scholar > Performance and Algorithms Research (PAR) Group > Lawrence Berkeley National Laboratory > Phone: 510-486-5772 > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel