Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Saliya Ekanayake
Thank you, Nathan. This makes more sense now.

On Tue, Apr 16, 2019 at 6:48 AM Nathan Hjelm  wrote:

> What Ralph said. You just blow memory on a queue that is not recovered in
> the current implementation.
>
> Also, moving to Allreduce will resolve the issue as now every call is
> effectively also a barrier. I have found with some benchmarks and
> collective implementations it can be faster than reduce anyway. That is why
> it might be worth trying.
>
> -Nathan
>
> > On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> >
> > Thank you, Nathan. Could you elaborate a bit on what happens internally?
> From your answer it seems, the program will still produce the correct
> output at the end but it'll use more resources.
> >
> > On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
> devel@lists.open-mpi.org> wrote:
> > If you do that it may run out of resources and deadlock or crash. I
> recommend either 1) adding a barrier every 100 iterations, 2) using
> allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2
> is probably the easiest option and depending on how large you run may not
> be any slower than 1 or 3.
> >
> > -Nathan
> >
> > > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake 
> wrote:
> > >
> > > Hi Devs,
> > >
> > > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the
> correct understanding that ranks other than root (0 in this case) will pass
> the collective as soon as their data is written to MPI buffers without
> waiting for all of them to be received at the root?
> > >
> > > If that's the case then what would happen (semantically) if we execute
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
> collective multiple times while the root will be processing an earlier
> reduce? For example, the root can be in the first reduce invocation, while
> another rank is in the second the reduce invocation.
> > >
> > > Thank you,
> > > Saliya
> > >
> > > --
> > > Saliya Ekanayake, Ph.D
> > > Postdoctoral Scholar
> > > Performance and Algorithms Research (PAR) Group
> > > Lawrence Berkeley National Laboratory
> > > Phone: 510-486-5772
> > >
> > > ___
> > > devel mailing list
> > > devel@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> >
> >
> > --
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> >
>
>

-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-16 Thread Nathan Hjelm via devel
What Ralph said. You just blow memory on a queue that is not recovered in the 
current implementation.

Also, moving to Allreduce will resolve the issue as now every call is 
effectively also a barrier. I have found with some benchmarks and collective 
implementations it can be faster than reduce anyway. That is why it might be 
worth trying.

-Nathan

> On Apr 15, 2019, at 2:33 PM, Saliya Ekanayake  wrote:
> 
> Thank you, Nathan. Could you elaborate a bit on what happens internally? From 
> your answer it seems, the program will still produce the correct output at 
> the end but it'll use more resources. 
> 
> On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel 
>  wrote:
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> > 
> > Hi Devs,
> > 
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> > understanding that ranks other than root (0 in this case) will pass the 
> > collective as soon as their data is written to MPI buffers without waiting 
> > for all of them to be received at the root?
> > 
> > If that's the case then what would happen (semantically) if we execute 
> > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> > collective multiple times while the root will be processing an earlier 
> > reduce? For example, the root can be in the first reduce invocation, while 
> > another rank is in the second the reduce invocation.
> > 
> > Thank you,
> > Saliya
> > 
> > -- 
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
Not exactly. The problem is that rank=0 initially falls behind because it is 
doing more work - i.e., it has to receive all the buffers and do something with 
them. As a result, it doesn’t get to post the next allreduce before the 
messages from the other participants arrive - which means that rank=0 has to 
stick those messages into the “unexpected message” queue. As iterations go by, 
the memory consumed by that queue gets bigger and bigger, causing rank=0 to run 
slower and slower…until you run out of memory and it aborts.

Adding the occasional barrier resolves the problem by letting rank=0 catch up 
and release the memory in the “unexpected message” queue.


> On Apr 15, 2019, at 1:33 PM, Saliya Ekanayake  wrote:
> 
> Thank you, Nathan. Could you elaborate a bit on what happens internally? From 
> your answer it seems, the program will still produce the correct output at 
> the end but it'll use more resources. 
> 
> On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel 
> mailto:devel@lists.open-mpi.org>> wrote:
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  > > wrote:
> > 
> > Hi Devs,
> > 
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> > understanding that ranks other than root (0 in this case) will pass the 
> > collective as soon as their data is written to MPI buffers without waiting 
> > for all of them to be received at the root?
> > 
> > If that's the case then what would happen (semantically) if we execute 
> > MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> > collective multiple times while the root will be processing an earlier 
> > reduce? For example, the root can be in the first reduce invocation, while 
> > another rank is in the second the reduce invocation.
> > 
> > Thank you,
> > Saliya
> > 
> > -- 
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org 
> > https://lists.open-mpi.org/mailman/listinfo/devel 
> > 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org 
> https://lists.open-mpi.org/mailman/listinfo/devel 
> 
> 
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Thank you, Nathan. Could you elaborate a bit on what happens internally?
>From your answer it seems, the program will still produce the correct
output at the end but it'll use more resources.

On Mon, Apr 15, 2019 at 9:00 AM Nathan Hjelm via devel <
devel@lists.open-mpi.org> wrote:

> If you do that it may run out of resources and deadlock or crash. I
> recommend either 1) adding a barrier every 100 iterations, 2) using
> allreduce, or 3) enable coll/sync (which essentially does 1). Honestly, 2
> is probably the easiest option and depending on how large you run may not
> be any slower than 1 or 3.
>
> -Nathan
>
> > On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> >
> > Hi Devs,
> >
> > When doing MPI_Reduce in a loop (collecting on Rank 0), is it the
> correct understanding that ranks other than root (0 in this case) will pass
> the collective as soon as their data is written to MPI buffers without
> waiting for all of them to be received at the root?
> >
> > If that's the case then what would happen (semantically) if we execute
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
> collective multiple times while the root will be processing an earlier
> reduce? For example, the root can be in the first reduce invocation, while
> another rank is in the second the reduce invocation.
> >
> > Thank you,
> > Saliya
> >
> > --
> > Saliya Ekanayake, Ph.D
> > Postdoctoral Scholar
> > Performance and Algorithms Research (PAR) Group
> > Lawrence Berkeley National Laboratory
> > Phone: 510-486-5772
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>


-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Ralph H Castain
There is a coll/sync component that will automatically inject those barriers 
for you so you don’t have to add them to your code. Controlled by MCA param:

coll_sync_barrier_before: Do a synchronization before each Nth collective

coll_sync_barrier_after: Do a synchronization after each Nth collective

Ralph


> On Apr 15, 2019, at 8:59 AM, Nathan Hjelm via devel 
>  wrote:
> 
> If you do that it may run out of resources and deadlock or crash. I recommend 
> either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
> enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
> easiest option and depending on how large you run may not be any slower than 
> 1 or 3.
> 
> -Nathan
> 
>> On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
>> 
>> Hi Devs,
>> 
>> When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
>> understanding that ranks other than root (0 in this case) will pass the 
>> collective as soon as their data is written to MPI buffers without waiting 
>> for all of them to be received at the root?
>> 
>> If that's the case then what would happen (semantically) if we execute 
>> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
>> collective multiple times while the root will be processing an earlier 
>> reduce? For example, the root can be in the first reduce invocation, while 
>> another rank is in the second the reduce invocation.
>> 
>> Thank you,
>> Saliya
>> 
>> -- 
>> Saliya Ekanayake, Ph.D
>> Postdoctoral Scholar
>> Performance and Algorithms Research (PAR) Group
>> Lawrence Berkeley National Laboratory
>> Phone: 510-486-5772
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Nathan Hjelm via devel
If you do that it may run out of resources and deadlock or crash. I recommend 
either 1) adding a barrier every 100 iterations, 2) using allreduce, or 3) 
enable coll/sync (which essentially does 1). Honestly, 2 is probably the 
easiest option and depending on how large you run may not be any slower than 1 
or 3.

-Nathan

> On Apr 15, 2019, at 9:53 AM, Saliya Ekanayake  wrote:
> 
> Hi Devs,
> 
> When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct 
> understanding that ranks other than root (0 in this case) will pass the 
> collective as soon as their data is written to MPI buffers without waiting 
> for all of them to be received at the root?
> 
> If that's the case then what would happen (semantically) if we execute 
> MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the 
> collective multiple times while the root will be processing an earlier 
> reduce? For example, the root can be in the first reduce invocation, while 
> another rank is in the second the reduce invocation.
> 
> Thank you,
> Saliya
> 
> -- 
> Saliya Ekanayake, Ph.D
> Postdoctoral Scholar
> Performance and Algorithms Research (PAR) Group
> Lawrence Berkeley National Laboratory
> Phone: 510-486-5772
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] MPI Reduce Without a Barrier

2019-04-15 Thread Saliya Ekanayake
Hi Devs,

When doing MPI_Reduce in a loop (collecting on Rank 0), is it the correct
understanding that ranks other than root (0 in this case) will pass the
collective as soon as their data is written to MPI buffers without waiting
for all of them to be received at the root?

If that's the case then what would happen (semantically) if we execute
MPI_Reduce in a loop without a barrier allowing non-root ranks to hit the
collective multiple times while the root will be processing an earlier
reduce? For example, the root can be in the first reduce invocation, while
another rank is in the second the reduce invocation.

Thank you,
Saliya

-- 
Saliya Ekanayake, Ph.D
Postdoctoral Scholar
Performance and Algorithms Research (PAR) Group
Lawrence Berkeley National Laboratory
Phone: 510-486-5772
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel