Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Saliya Ekanayake
Thank you, Nathan. This makes more sense now.

On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm  wrote:

> What Ralph said. You just blow memory on a queue that is not recovered in
> the current implementation.
>
> Also, moving to Allreduce will resolve the issue as now every call is
> effectively also a barrier. I have found with some benchmarks and
> collective implementations it can be faster than reduce anyway. That is why
> it might be worth trying.
>
> -Nathan
>
> > On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> >
> > Thank you, Nathan. Could you elaborate a bit on what happens internally?
> From your answer it seems, the program will still produce the correct
> output at the end but it'll use more resources.
> >
> > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
> devel@lists.open-mpi.org> wrote:
> > If you do that it may run out of resources and deadlock or crash. I
> recommend either 1) adding a barrier every 100 iterations, 2) using
> allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2
> is probably the easiest option and depending on how large you run may not
> be any slower than 1 or 3.
> >
> > -Nathan
> >
> > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake 
> wrote:
> > >
> > > Hi Devs,
> > >
> > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the
> correct understanding that ranks other than root (0 in this case) will pass
> the collective as soon as their data is written to MPI buffers without
> waiting for all of them to be received at the root?
> > >
> > > If that's the case then what would happen (semantically) if we execute
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
> collective multiple times while the root will be processing an earlier
> reduce? For example, the root can be in the first reduce invocation, while
> another rank is in the second the reduce invocation.
> > >
> > > Thank you,
> > > Saliya
> > >
> > > --
> > > Saliya Ekanayake, Ph.D
> > > Postdoctoral Scholar
> > > Performance and Algorithms Research (PAR) Group
> > > Lawrence Berkeley National Laboratory
> > > Phone: 510-486-5772
> > >
> > > ___
> > > devel mailing list
> > > devel@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> >
> > --
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> >
>
>

-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Nathan Hjelm via devel
What Ralph said. You just blow memory on a queue that is not recovered in the 
current implementation.

Also, moving to Allreduce will resolve the issue as now every call is 
effectively also a barrier. I have found with some benchmarks and collective 
implementations it can be faster than reduce anyway. That is why it might be 
worth trying.

-Nathan

> On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> 
> Thank you, Nathan. Could you elaborate a bit on what happens internally? From 
> your answer it seems, the program will still produce the correct output at 
> the end but it'll use more resources. 
> 
> On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel 
>  wrote:
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> > 
> > Hi Devs,
> > 
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> > understanding that ranks other than root (0 in this case) will pass the 
> > collective as soon as their data is written to MPI buffers without waiting 
> > for all of them to be received at the root?
> > 
> > If that's the case then what would happen (semantically) if we execute 
> > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> > collective multiple times while the root will be processing an earlier 
> > reduce? For example, the root can be in the first reduce invocation, while 
> > another rank is in the second the reduce invocation.
> > 
> > Thank you,
> > Saliya
> > 
> > -- 
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
Not exactly. The problem is that rank=0 initially falls behind because it is 
doing more work - i.e., it has to receive all the buffers and do something with 
them. As a result, it doesn’t get to post the next allreduce before the 
messages from the other participants arrive - which means that rank=0 has to 
stick those messages into the “unexpected message” queue. As iterations go by, 
the memory consumed by that queue gets bigger and bigger, causing rank=0 to run 
slower and slower…until you run out of memory and it aborts.

Adding the occasional barrier resolves the problem by letting rank=0 catch up 
and release the memory in the “unexpected message” queue.


> On Apr 15, 2019, at 1:33 PM, Saliya Ekanayake  wrote:
> 
> Thank you, Nathan. Could you elaborate a bit on what happens internally? From 
> your answer it seems, the program will still produce the correct output at 
> the end but it'll use more resources. 
> 
> On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel 
> mailto:devel@lists.open-mpi.org>> wrote:
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  > > wrote:
> > 
> > Hi Devs,
> > 
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> > understanding that ranks other than root (0 in this case) will pass the 
> > collective as soon as their data is written to MPI buffers without waiting 
> > for all of them to be received at the root?
> > 
> > If that's the case then what would happen (semantically) if we execute 
> > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> > collective multiple times while the root will be processing an earlier 
> > reduce? For example, the root can be in the first reduce invocation, while 
> > another rank is in the second the reduce invocation.
> > 
> > Thank you,
> > Saliya
> > 
> > -- 
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org 
> > https://lists.open-mpi.org/mailman/listinfo/devel 
> > 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://lists.open-mpi.org/mailman/listinfo/devel 
> 
> 
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Thank you, Nathan. Could you elaborate a bit on what happens internally?
>From your answer it seems, the program will still produce the correct
output at the end but it'll use more resources.

On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
devel@lists.open-mpi.org> wrote:

> If you do that it may run out of resources and deadlock or crash. I
> recommend either 1) adding a barrier every 100 iterations, 2) using
> allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2
> is probably the easiest option and depending on how large you run may not
> be any slower than 1 or 3.
>
> -Nathan
>
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> >
> > Hi Devs,
> >
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the
> correct understanding that ranks other than root (0 in this case) will pass
> the collective as soon as their data is written to MPI buffers without
> waiting for all of them to be received at the root?
> >
> > If that's the case then what would happen (semantically) if we execute
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
> collective multiple times while the root will be processing an earlier
> reduce? For example, the root can be in the first reduce invocation, while
> another rank is in the second the reduce invocation.
> >
> > Thank you,
> > Saliya
> >
> > --
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>


-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
There is a coll/sync component that will automatically inject those barriers 
for you so you don’t have to add them to your code. Controlled by MCA param:

coll_sync_barrier_before: Do a synchronization before each Nth collective

coll_sync_barrier_after: Do a synchronization after each Nth collective

Ralph


> On Apr 15, 2019, at 8:59 AM, Nathan Hjelm via devel 
>  wrote:
> 
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
>> On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
>> 
>> Hi Devs,
>> 
>> When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
>> understanding that ranks other than root (0 in this case) will pass the 
>> collective as soon as their data is written to MPI buffers without waiting 
>> for all of them to be received at the root?
>> 
>> If that's the case then what would happen (semantically) if we execute 
>> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
>> collective multiple times while the root will be processing an earlier 
>> reduce? For example, the root can be in the first reduce invocation, while 
>> another rank is in the second the reduce invocation.
>> 
>> Thank you,
>> Saliya
>> 
>> -- 
>> Saliya Ekanayake, Ph.D
>> Postdoctoral Scholar
>> Performance and Algorithms Research (PAR) Group
>> Lawrence Berkeley National Laboratory
>> Phone: 510-486-5772
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Nathan Hjelm via devel
If you do that it may run out of resources and deadlock or crash. I recommend 
either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
easiest option and depending on how large you run may not be any slower than 1 
or 3.

-Nathan

> On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> 
> Hi Devs,
> 
> When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> understanding that ranks other than root (0 in this case) will pass the 
> collective as soon as their data is written to MPI buffers without waiting 
> for all of them to be received at the root?
> 
> If that's the case then what would happen (semantically) if we execute 
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> collective multiple times while the root will be processing an earlier 
> reduce? For example, the root can be in the first reduce invocation, while 
> another rank is in the second the reduce invocation.
> 
> Thank you,
> Saliya
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel