Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Gilles Gouaillardet
Samuel,

The default MPI library on the K computer is Fujitsu MPI, and yes, it
is based on Open MPI.
/* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
>From a support perspective, this should be reported to the HPCI
helpdesk http://www.hpci-office.jp/pages/e_support

As far as i understand, Fujitsu MPI currently available on K is not
based on the latest Open MPI.
I suspect most of the time is spent trying to find the new
communicator ID (CID) when a communicator is created (vs figuring out
the new ranks)
iirc, on older versions of Open MPI, that was implemented with as many
MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
unused CID for the newly created communicator.

So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
your program, only one MPI_Allreduce() should be involved per
MPI_Comm_dup().
But if you do the same thing in the middle of your run, and after each
rank has a different lower unused CID, the performances can be (much)
worst.
If i understand correctly your description of the issue, that would
explain the performance discrepancy between static vs dynamic
communicator creation time.

fwiw, this part has been (highly) improved in the latest releases of Open MPI.

If your benchmark is available for download, could you please post a link ?


Cheers,

Gilles

On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams  wrote:
> Some of my collaborators have had issues with one of my benchmarks at high 
> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> increasing quadratically with process concurrency.  At 82K processes, each 
> call to dup/split is taking 15s to complete.  These high times restrict 
> comm_split/dup to be used statically (at the beginning) and not dynamically 
> in an application.
>
> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.
>
> Can you confirm whether qsort (or the like) is (still) used in these routines 
> in OpenMPI?  It seems mergesort (worst case complexity of PlogP) would be a 
> more scalable approach.  I have not observed this issue on the Cray MPICH 
> implementation and the Mira MPICH issues has since been resolved.
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Samuel Williams
I'll ask my collaborators if they've submitted a ticket.
(they have the accounts; built the code; ran the code; observed the issues)

I believe the issue on MPICH was a qsort issue and not a Allreduce issue.  When 
this is coupled with the fact that it looked like qsort is called in 
ompi_comm_split 
(https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
 I wanted to raise the issue so that it may be investigated to understand 
whether users can naively blunder into worst case computational complexity 
issues.

We've been running hpgmg-fv (not -fe).  They were using the flux variants 
(requires local.mk build operators.flux.c instead of operators.fv4.c) and they 
are a couple commits behind.  Regardless, this issue has persisted on K for 
several years.  By default, it will build log(N) subcommunicators where N is 
the problem size.  Weak scaling experiments has shown comm_split/dup times 
growing consistently with worst case complexity.  That being said, AMR codes 
might rebuild the sub communicators as they regrid/adapt.  








> On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet 
>  wrote:
> 
> Samuel,
> 
> The default MPI library on the K computer is Fujitsu MPI, and yes, it
> is based on Open MPI.
> /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> From a support perspective, this should be reported to the HPCI
> helpdesk http://www.hpci-office.jp/pages/e_support
> 
> As far as i understand, Fujitsu MPI currently available on K is not
> based on the latest Open MPI.
> I suspect most of the time is spent trying to find the new
> communicator ID (CID) when a communicator is created (vs figuring out
> the new ranks)
> iirc, on older versions of Open MPI, that was implemented with as many
> MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> unused CID for the newly created communicator.
> 
> So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> your program, only one MPI_Allreduce() should be involved per
> MPI_Comm_dup().
> But if you do the same thing in the middle of your run, and after each
> rank has a different lower unused CID, the performances can be (much)
> worst.
> If i understand correctly your description of the issue, that would
> explain the performance discrepancy between static vs dynamic
> communicator creation time.
> 
> fwiw, this part has been (highly) improved in the latest releases of Open MPI.
> 
> If your benchmark is available for download, could you please post a link ?
> 
> 
> Cheers,
> 
> Gilles
> 
> On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams  wrote:
>> Some of my collaborators have had issues with one of my benchmarks at high 
>> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
>> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
>> increasing quadratically with process concurrency.  At 82K processes, each 
>> call to dup/split is taking 15s to complete.  These high times restrict 
>> comm_split/dup to be used statically (at the beginning) and not dynamically 
>> in an application.
>> 
>> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
>> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
>> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
>> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.
>> 
>> Can you confirm whether qsort (or the like) is (still) used in these 
>> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP) 
>> would be a more scalable approach.  I have not observed this issue on the 
>> Cray MPICH implementation and the Mira MPICH issues has since been resolved.
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread George Bosilca
Samuel,

You are right, we use qsort to sort the keys, but the qsort only applies on
participants with the same color. So while the complexity of the qsort
might reach bottom only when most of the processes participate with the
same color.

What I think is OMPI problem in this are is the selection of the next cid
for the newly created communicator. We are doing the selection of the cid
on the original communicator, and this basically counts for a significant
increase in the duration, as will need to iterate a longer to converge to a
common cid.

We haven't made any improvement in this area for the last few years, we
simply transformed the code to use non-blocking communications instead of
blocking, but this has little impact on the performance of the split itself.

  George.


On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams  wrote:

> I'll ask my collaborators if they've submitted a ticket.
> (they have the accounts; built the code; ran the code; observed the issues)
>
> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.
> When this is coupled with the fact that it looked like qsort is called in
> ompi_comm_split (https://github.com/open-mpi/ompi/blob/
> a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), I
> wanted to raise the issue so that it may be investigated to understand
> whether users can naively blunder into worst case computational complexity
> issues.
>
> We've been running hpgmg-fv (not -fe).  They were using the flux variants
> (requires local.mk build operators.flux.c instead of operators.fv4.c) and
> they are a couple commits behind.  Regardless, this issue has persisted on
> K for several years.  By default, it will build log(N) subcommunicators
> where N is the problem size.  Weak scaling experiments has shown
> comm_split/dup times growing consistently with worst case complexity.  That
> being said, AMR codes might rebuild the sub communicators as they
> regrid/adapt.
>
>
>
>
>
>
>
>
> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > Samuel,
> >
> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
> > is based on Open MPI.
> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> > From a support perspective, this should be reported to the HPCI
> > helpdesk http://www.hpci-office.jp/pages/e_support
> >
> > As far as i understand, Fujitsu MPI currently available on K is not
> > based on the latest Open MPI.
> > I suspect most of the time is spent trying to find the new
> > communicator ID (CID) when a communicator is created (vs figuring out
> > the new ranks)
> > iirc, on older versions of Open MPI, that was implemented with as many
> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> > unused CID for the newly created communicator.
> >
> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> > your program, only one MPI_Allreduce() should be involved per
> > MPI_Comm_dup().
> > But if you do the same thing in the middle of your run, and after each
> > rank has a different lower unused CID, the performances can be (much)
> > worst.
> > If i understand correctly your description of the issue, that would
> > explain the performance discrepancy between static vs dynamic
> > communicator creation time.
> >
> > fwiw, this part has been (highly) improved in the latest releases of
> Open MPI.
> >
> > If your benchmark is available for download, could you please post a
> link ?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams 
> wrote:
> >> Some of my collaborators have had issues with one of my benchmarks at
> high concurrency (82K MPI procs) on the K machine in Japan.  I believe K
> uses OpenMPI and the issues has been tracked to time in
> MPI_Comm_dup/Comm_split increasing quadratically with process concurrency.
> At 82K processes, each call to dup/split is taking 15s to complete.  These
> high times restrict comm_split/dup to be used statically (at the beginning)
> and not dynamically in an application.
> >>
> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they
> called qsort to split the ranks.  Although qsort/quicksort has ideal
> computational complexity of O(PlogP)  [P is the number of MPI ranks], it
> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x
> slowdown.
> >>
> >> Can you confirm whether qsort (or the like) is (still) used in these
> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP)
> would be a more scalable approach.  I have not observed this issue on the
> Cray MPICH implementation and the Mira MPICH issues has since been resolved.
> >>
> >>
> >> ___
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/devel
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://l

Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Samuel Williams
Although splitting into sqrt(p) teams of sqrt(p) is common for 2D SpMV (or 3D 
for LU/SparseLU), splitting by coarse grained parallelism (e.g. multiple 
concurrent solves on multiple RHS) is also possible.  Here, the teams are 
 On Nov 7, 2017, at 9:23 AM, George Bosilca  wrote:
> 
> Samuel,
> 
> You are right, we use qsort to sort the keys, but the qsort only applies on 
> participants with the same color. So while the complexity of the qsort might 
> reach bottom only when most of the processes participate with the same color.
> 
> What I think is OMPI problem in this are is the selection of the next cid for 
> the newly created communicator. We are doing the selection of the cid on the 
> original communicator, and this basically counts for a significant increase 
> in the duration, as will need to iterate a longer to converge to a common cid.
> 
> We haven't made any improvement in this area for the last few years, we 
> simply transformed the code to use non-blocking communications instead of 
> blocking, but this has little impact on the performance of the split itself.
> 
>   George.
> 
> 
> On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams  wrote:
> I'll ask my collaborators if they've submitted a ticket.
> (they have the accounts; built the code; ran the code; observed the issues)
> 
> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.  
> When this is coupled with the fact that it looked like qsort is called in 
> ompi_comm_split 
> (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
>  I wanted to raise the issue so that it may be investigated to understand 
> whether users can naively blunder into worst case computational complexity 
> issues.
> 
> We've been running hpgmg-fv (not -fe).  They were using the flux variants 
> (requires local.mk build operators.flux.c instead of operators.fv4.c) and 
> they are a couple commits behind.  Regardless, this issue has persisted on K 
> for several years.  By default, it will build log(N) subcommunicators where N 
> is the problem size.  Weak scaling experiments has shown comm_split/dup times 
> growing consistently with worst case complexity.  That being said, AMR codes 
> might rebuild the sub communicators as they regrid/adapt.
> 
> 
> 
> 
> 
> 
> 
> 
> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet 
> >  wrote:
> >
> > Samuel,
> >
> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
> > is based on Open MPI.
> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> > From a support perspective, this should be reported to the HPCI
> > helpdesk http://www.hpci-office.jp/pages/e_support
> >
> > As far as i understand, Fujitsu MPI currently available on K is not
> > based on the latest Open MPI.
> > I suspect most of the time is spent trying to find the new
> > communicator ID (CID) when a communicator is created (vs figuring out
> > the new ranks)
> > iirc, on older versions of Open MPI, that was implemented with as many
> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> > unused CID for the newly created communicator.
> >
> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> > your program, only one MPI_Allreduce() should be involved per
> > MPI_Comm_dup().
> > But if you do the same thing in the middle of your run, and after each
> > rank has a different lower unused CID, the performances can be (much)
> > worst.
> > If i understand correctly your description of the issue, that would
> > explain the performance discrepancy between static vs dynamic
> > communicator creation time.
> >
> > fwiw, this part has been (highly) improved in the latest releases of Open 
> > MPI.
> >
> > If your benchmark is available for download, could you please post a link ?
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams  wrote:
> >> Some of my collaborators have had issues with one of my benchmarks at high 
> >> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> >> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> >> increasing quadratically with process concurrency.  At 82K processes, each 
> >> call to dup/split is taking 15s to complete.  These high times restrict 
> >> comm_split/dup to be used statically (at the beginning) and not 
> >> dynamically in an application.
> >>
> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> >> qsort to split the ranks.  Although qsort/quicksort has ideal 
> >> computational complexity of O(PlogP)  [P is the number of MPI ranks], it 
> >> can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x 
> >> slowdown.
> >>
> >> Can you confirm whether qsort (or the like) is (still) used in these 
> >> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP) 
> >> would be a more scalable approach.  I have not observed this issue on the 
>

Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Edgar Gabriel
My guess would be that both aspects (sorting + CID allocation) could be 
a problem.


There was a loong time back an effort to convert the sequence of 
allgather + qsort into a distributed sort (based on a paper by Moody et. 
al. where he demonstrated the benefits of this approach).  We didn't get 
unfortunately to implement this part due to timing constraints.


Also, we used to have a block allocation algorithm for CIDs that did 
save quite a bit in communication costs, but we had to abandon it 
because the implementation was wasting CIDs by not properly reusing 
them. It is fairly clear on how to implement it correctly, it just 
hasn't been done for a lack of resources.


Maybe we should revisit those items at one point in time.
Thanks
Edgar

On 11/7/2017 10:23 AM, George Bosilca wrote:

Samuel,

You are right, we use qsort to sort the keys, but the qsort only 
applies on participants with the same color. So while the complexity 
of the qsort might reach bottom only when most of the processes 
participate with the same color.


What I think is OMPI problem in this are is the selection of the next 
cid for the newly created communicator. We are doing the selection of 
the cid on the original communicator, and this basically counts for a 
significant increase in the duration, as will need to iterate a longer 
to converge to a common cid.


We haven't made any improvement in this area for the last few years, 
we simply transformed the code to use non-blocking communications 
instead of blocking, but this has little impact on the performance of 
the split itself.


  George.


On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams > wrote:


I'll ask my collaborators if they've submitted a ticket.
(they have the accounts; built the code; ran the code; observed
the issues)

I believe the issue on MPICH was a qsort issue and not a Allreduce
issue.  When this is coupled with the fact that it looked like
qsort is called in ompi_comm_split

(https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c

),
I wanted to raise the issue so that it may be investigated to
understand whether users can naively blunder into worst case
computational complexity issues.

We've been running hpgmg-fv (not -fe).  They were using the flux
variants (requires local.mk  build
operators.flux.c instead of operators.fv4.c) and they are a couple
commits behind.  Regardless, this issue has persisted on K for
several years.  By default, it will build log(N) subcommunicators
where N is the problem size.  Weak scaling experiments has shown
comm_split/dup times growing consistently with worst case
complexity.  That being said, AMR codes might rebuild the sub
communicators as they regrid/adapt.








> On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet
mailto:gilles.gouaillar...@gmail.com>> wrote:
>
> Samuel,
>
> The default MPI library on the K computer is Fujitsu MPI, and
yes, it
> is based on Open MPI.
> /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> From a support perspective, this should be reported to the HPCI
> helpdesk http://www.hpci-office.jp/pages/e_support

>
> As far as i understand, Fujitsu MPI currently available on K is not
> based on the latest Open MPI.
> I suspect most of the time is spent trying to find the new
> communicator ID (CID) when a communicator is created (vs
figuring out
> the new ranks)
> iirc, on older versions of Open MPI, that was implemented with
as many
> MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> unused CID for the newly created communicator.
>
> So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> your program, only one MPI_Allreduce() should be involved per
> MPI_Comm_dup().
> But if you do the same thing in the middle of your run, and
after each
> rank has a different lower unused CID, the performances can be
(much)
> worst.
> If i understand correctly your description of the issue, that would
> explain the performance discrepancy between static vs dynamic
> communicator creation time.
>
> fwiw, this part has been (highly) improved in the latest
releases of Open MPI.
>
> If your benchmark is available for download, could you please
post a link ?
>
>
> Cheers,
>
> Gilles
>
> On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams
mailto:swwilli...@lbl.gov>> wrote:
>> Some of my collaborators have had issues with one of my
benchmarks at high concurrency (82K MPI procs) on the K machine in
Japan.  I believe K uses OpenMPI and the issues has been

Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Gilles Gouaillardet
Folks,

fwiw, i wrote a simple test program that mimicks Open MPI qsort usage,
and a given qsort invokation on 1 million keys always take less than
one second on K.

So qsort() is unlikely to be blamed here.

Cheers,

Gilles

On Wed, Nov 8, 2017 at 3:24 AM, George Bosilca  wrote:
> Samuel,
>
> We did some work in this area last year, but we never pursued it to push the
> code back in OMPI. We would be interested to give a more well-defined
> context to a followup study, especially if there is a large scale app as a
> target. Ping me back if you, or your collaborators, share the same interest.
>
>   George.
>
>
>
> On Tue, Nov 7, 2017 at 11:55 AM, Edgar Gabriel 
> wrote:
>>
>> My guess would be that both aspects (sorting + CID allocation) could be a
>> problem.
>>
>> There was a loong time back an effort to convert the sequence of allgather
>> + qsort into a distributed sort (based on a paper by Moody et. al. where he
>> demonstrated the benefits of this approach).  We didn't get unfortunately to
>> implement this part due to timing constraints.
>>
>> Also, we used to have a block allocation algorithm for CIDs that did save
>> quite a bit in communication costs, but we had to abandon it because the
>> implementation was wasting CIDs by not properly reusing them. It is fairly
>> clear on how to implement it correctly, it just hasn't been done for a lack
>> of resources.
>>
>> Maybe we should revisit those items at one point in time.
>> Thanks
>> Edgar
>>
>>
>> On 11/7/2017 10:23 AM, George Bosilca wrote:
>>
>> Samuel,
>>
>> You are right, we use qsort to sort the keys, but the qsort only applies
>> on participants with the same color. So while the complexity of the qsort
>> might reach bottom only when most of the processes participate with the same
>> color.
>>
>> What I think is OMPI problem in this are is the selection of the next cid
>> for the newly created communicator. We are doing the selection of the cid on
>> the original communicator, and this basically counts for a significant
>> increase in the duration, as will need to iterate a longer to converge to a
>> common cid.
>>
>> We haven't made any improvement in this area for the last few years, we
>> simply transformed the code to use non-blocking communications instead of
>> blocking, but this has little impact on the performance of the split itself.
>>
>>   George.
>>
>>
>> On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams 
>> wrote:
>>>
>>> I'll ask my collaborators if they've submitted a ticket.
>>> (they have the accounts; built the code; ran the code; observed the
>>> issues)
>>>
>>> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.
>>> When this is coupled with the fact that it looked like qsort is called in
>>> ompi_comm_split
>>> (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
>>> I wanted to raise the issue so that it may be investigated to understand
>>> whether users can naively blunder into worst case computational complexity
>>> issues.
>>>
>>> We've been running hpgmg-fv (not -fe).  They were using the flux variants
>>> (requires local.mk build operators.flux.c instead of operators.fv4.c) and
>>> they are a couple commits behind.  Regardless, this issue has persisted on K
>>> for several years.  By default, it will build log(N) subcommunicators where
>>> N is the problem size.  Weak scaling experiments has shown comm_split/dup
>>> times growing consistently with worst case complexity.  That being said, AMR
>>> codes might rebuild the sub communicators as they regrid/adapt.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet
>>> >  wrote:
>>> >
>>> > Samuel,
>>> >
>>> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
>>> > is based on Open MPI.
>>> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
>>> > From a support perspective, this should be reported to the HPCI
>>> > helpdesk http://www.hpci-office.jp/pages/e_support
>>> >
>>> > As far as i understand, Fujitsu MPI currently available on K is not
>>> > based on the latest Open MPI.
>>> > I suspect most of the time is spent trying to find the new
>>> > communicator ID (CID) when a communicator is created (vs figuring out
>>> > the new ranks)
>>> > iirc, on older versions of Open MPI, that was implemented with as many
>>> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
>>> > unused CID for the newly created communicator.
>>> >
>>> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
>>> > your program, only one MPI_Allreduce() should be involved per
>>> > MPI_Comm_dup().
>>> > But if you do the same thing in the middle of your run, and after each
>>> > rank has a different lower unused CID, the performances can be (much)
>>> > worst.
>>> > If i understand correctly your description of the issue, that would
>>> > explain the performance discrepancy between static vs dynamic
>>> > communicator crea

Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Kawashima, Takahiro
Samuel,

I am a developer of Fujitsu MPI. Thanks for using the K computer.
For official support, please consult with the helpdesk of K,
as Gilles said. The helpdesk may have information based on past
inquiries. If not, the inquiry will be forwarded to our team.

As other people said, Fujitsu MPI used in K is based on old
Open MPI (v1.6.3 with bug fixes). We don't have a plan to
update it to newer version because it is in a maintenance
phase regarding system softwares. At first glance, I also
suspect the cost of multiple allreduce.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Some of my collaborators have had issues with one of my benchmarks at high 
> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> increasing quadratically with process concurrency.  At 82K processes, each 
> call to dup/split is taking 15s to complete.  These high times restrict 
> comm_split/dup to be used statically (at the beginning) and not dynamically 
> in an application.
> 
> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.  
> 
> Can you confirm whether qsort (or the like) is (still) used in these routines 
> in OpenMPI?  It seems mergesort (worst case complexity of PlogP) would be a 
> more scalable approach.  I have not observed this issue on the Cray MPICH 
> implementation and the Mira MPICH issues has since been resolved.

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Christopher Samuel
On 08/11/17 12:30, Kawashima, Takahiro wrote:

> As other people said, Fujitsu MPI used in K is based on old
> Open MPI (v1.6.3 with bug fixes). 

I guess the obvious question is will the vanilla Open-MPI work on K?

-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Gilles Gouaillardet
vanilla Open MPI has no support for the tofu interconnect, nor Fujitsu 
batch manager.


also, and iirc, tcp communication between two compute nodes is not 
always possible on the K computer.



so the simple answer is no.


Cheers,


Gilles


On 11/8/2017 10:37 AM, Christopher Samuel wrote:

On 08/11/17 12:30, Kawashima, Takahiro wrote:


As other people said, Fujitsu MPI used in K is based on old
Open MPI (v1.6.3 with bug fixes).

I guess the obvious question is will the vanilla Open-MPI work on K?



___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] subcommunicator OpenMPI issues on K

2017-11-07 Thread Kawashima, Takahiro
> > As other people said, Fujitsu MPI used in K is based on old
> > Open MPI (v1.6.3 with bug fixes). 
> 
> I guess the obvious question is will the vanilla Open-MPI work on K?

Unfortunately no. Support of Tofu and Fujitsu resource manager
are not included in Open MPI.

Takahiro Kawashima,
MPI development team,
Fujitsu

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel