Re: [petsc-users] MPI error for large number of processes and subcomms

Randall Mackie Thu, 30 Apr 2020 17:48:23 -0700

Hi Junchao,

Unfortunately these modifications did not work on our cluster (see output 
below).
However, I am not asking you to spend anymore time on this, as we are able to 
avoid the problem by setting appropriate sysctl parameters into 
/etc/sysctl.conf.


Thank you again for all your help on this.

Randy


Output of test program:

 mpiexec -np 1280 -hostfile machines ./test -nsubs 160 -nx 100 -ny 100 -nz 10 
-max_pending_isends 64
Started 
 
 ind2 max              31999999
 nis                  33600
begin VecScatter create 
[1175]PETSC ERROR: #1 PetscCommBuildTwoSided_Ibarrier() line 102 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c
[1175]PETSC ERROR: #2 PetscCommBuildTwoSided() line 313 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c
[1175]PETSC ERROR: #3 PetscSFSetUp_Basic() line 33 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/impls/basic/sfbasic.c
[1175]PETSC ERROR: #4 PetscSFSetUp() line 253 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/interface/sf.c
[1175]PETSC ERROR: #5 VecScatterSetUp_SF() line 747 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/impls/sf/vscatsf.c
[1175]PETSC ERROR: #6 VecScatterSetUp() line 208 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscatfce.c
[1175]PETSC ERROR: #7 VecScatterCreate() line 287 in 
/state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscreate.c



> On Apr 27, 2020, at 9:59 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> Randy,
>    You are absolutely right. The AOApplicationToPetsc could not be removed.  
> Since the excessive communication is inevitable, I made two changes in petsc 
> to ease that. One is I skewed the communication to let each rank send to 
> ranks greater than itself first. The other is an option, -max_pending_isend, 
> to control number of pending isends. Current default is 512.
>    I have an MR at https://gitlab.com/petsc/petsc/-/merge_requests/2757 
> <https://gitlab.com/petsc/petsc/-/merge_requests/2757>. I tested it dozens of 
> times with your example at 5120 ranks. It worked fine.
>    Please try it in your environment and let me know the result. Since the 
> failure is random, you may need to run multiple times.
> 
>   BTW, if no objection, I'd like to add your excellent example to petsc repo.
> 
>    Thanks
> --Junchao Zhang
> 
> 
> On Fri, Apr 24, 2020 at 5:32 PM Randall Mackie <rlmackie...@gmail.com 
> <mailto:rlmackie...@gmail.com>> wrote:
> Hi Junchao,
> 
> I tested by commenting out the AOApplicationToPetsc calls as you suggest, but 
> it doesn’t work because it doesn’t maintain the proper order of the elements 
> in the scattered vectors.
> 
> I attach a modified version of the test code where I put elements into the 
> global vector, then carry out the scatter, and check on the subcomms that 
> they are correct.
> 
> You can see everything is fine with the AOApplicationToPetsc calls, but the 
> comparison fails when those are commented out.
> 
> If there is some way I can achieve the right VecScatters without those calls, 
> I would be happy to know how to do that.
> 
> Thank you again for your help.
> 
> Randy
> 
> ps. I suggest you run this test with nx=ny=nz=10 and only a couple subcomms 
> and maybe 4 processes to demonstrate the behavior
> 
> 
>> On Apr 20, 2020, at 2:45 PM, Junchao Zhang <junchao.zh...@gmail.com 
>> <mailto:junchao.zh...@gmail.com>> wrote:
>> 
>> Hello, Randy,
>>   I further looked at the problem and believe it was due to overwhelming 
>> traffic. The code sometimes fails at MPI_Waitall. I printed out MPI error 
>> strings of bad MPI Statuses. One of them is like 
>> "MPID_nem_tcp_connpoll(1845): Communication error with rank 25: Connection 
>> reset by peer", which is a tcp error and has nothing to do with petsc.
>>   Further investigation shows in the case of 5120 ranks with 320 sub 
>> communicators, during VecScatterSetUp, each rank has around 640 
>> isends/irecvs neighbors, and quite a few ranks has 1280 isends neighbors. I 
>> guess these overwhelming isends occasionally crashed the connection.
>>   The piece of code in VecScatterSetUp is to calculate the communication 
>> pattern. With index sets "having good locality", the calculate itself incurs 
>> less traffic. Here good locality means indices in an index set mostly point 
>> to local entries. However, the AOApplicationToPetsc() call in your code 
>> unnecessarily ruined the good petsc ordering. If we remove 
>> AOApplicationToPetsc() (the vecscatter result is still correct) , then each 
>> rank uniformly has around 320 isends/irecvs.
>>   So, test with this modification and see if it really works in your 
>> environment. If not applicable, we can provide options in petsc to carry out 
>> the communication in phases to avoid flooding the network (though it is 
>> better done by MPI). 
>> 
>>  Thanks.
>> --Junchao Zhang
>> 
>> 
>> On Fri, Apr 17, 2020 at 10:47 AM Randall Mackie <rlmackie...@gmail.com 
>> <mailto:rlmackie...@gmail.com>> wrote:
>> Hi Junchao,
>> 
>> Thank you for your efforts.
>> We tried petsc-3.13.0 but it made no difference.
>> We think now the issue are with sysctl parameters, and increasing those 
>> seemed to have cleared up the problem.
>> This also most likely explains how different clusters had different 
>> behaviors with our test code.
>> 
>> We are now running our code and will report back once we are sure that there 
>> are no further issues.
>> 
>> Thanks again for your help.
>> 
>> Randy M.
>> 
>>> On Apr 17, 2020, at 8:09 AM, Junchao Zhang <junchao.zh...@gmail.com 
>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Apr 16, 2020 at 11:13 PM Junchao Zhang <junchao.zh...@gmail.com 
>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>> Randy,
>>>   I reproduced your error with petsc-3.12.4 and 5120 mpi ranks. I also 
>>> found the error went away with petsc-3.13.  However, I have not figured out 
>>> what is the bug and which commit fixed it :).
>>>   So at your side, it is better to use the latest petsc.
>>> I want to add that even with petsc-3.12.4 the error is random. I was only 
>>> able to reproduce the error once, so I can not claim petsc-3.13 actually 
>>> fixed it (or, the bug is really in petsc).
>>>  
>>> --Junchao Zhang
>>> 
>>> 
>>> On Thu, Apr 16, 2020 at 9:06 PM Junchao Zhang <junchao.zh...@gmail.com 
>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>> Randy,
>>>   Up to now I could not reproduce your error, even with the biggest mpirun 
>>> -n 5120 ./test -nsubs 320 -nx 100 -ny 100 -nz 100
>>>   While I continue doing test, you can try other options. It looks you want 
>>> to duplicate a vector to subcomms. I don't think you need the two lines:
>>> call AOApplicationToPetsc(aoParent,nis,ind1,ierr)
>>> call AOApplicationToPetsc(aoSub,nis,ind2,ierr)
>>>  In addition, you can use simpler and more memory-efficient index sets. 
>>> There is a petsc example for this task, see case 3 in 
>>> https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c 
>>> <https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c>
>>>  BTW, it is good to use petsc master so we are on the same page.
>>> --Junchao Zhang
>>> 
>>> 
>>> On Wed, Apr 15, 2020 at 10:28 AM Randall Mackie <rlmackie...@gmail.com 
>>> <mailto:rlmackie...@gmail.com>> wrote:
>>> Hi Junchao,
>>> 
>>> So I was able to create a small test code that duplicates the issue we have 
>>> been having, and it is attached to this email in a zip file.
>>> Included is the test.F90 code, the commands to duplicate crash and to 
>>> duplicate a successful run, output errors, and our petsc configuration.
>>> 
>>> Our findings to date include:
>>> 
>>> The error is reproducible in a very short time with this script
>>> It is related to nproc*nsubs and (although to a less extent) to DM grid size
>>> It happens regardless of MPI implementation (mpich, intel mpi 2018, 2019, 
>>> openmpi) or compiler (gfortran/gcc , intel 2018)
>>> No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to slightly 
>>> increase the limit, but still fails on the full machine set.
>>> Nothing looks interesting on valgrind
>>> 
>>> Our initial tests were carried out on an Azure cluster, but we also tested 
>>> on our smaller cluster, and we found the following:
>>> 
>>> Works:
>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile ./test 
>>> -nsubs 80 -nx 100 -ny 100 -nz 100
>>>  
>>> Crashes (this works on Azure)
>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile ./test 
>>> -nsubs 80 -nx 100 -ny 100 -nz 100
>>> 
>>> So it looks like it may also be related to the physical number of nodes as 
>>> well.
>>> 
>>> In any case, even with 2560 processes on 192 cores the memory does not go 
>>> above 3.5 Gbyes so you don’t need a huge cluster to test.
>>> 
>>> Thanks,
>>> 
>>> Randy M.
>>> 
>>> 
>>> 
>>>> On Apr 14, 2020, at 12:23 PM, Junchao Zhang <junchao.zh...@gmail.com 
>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>> 
>>>> There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why I 
>>>> doubted it was the problem. Even if users configure petsc with 64-bit 
>>>> indices, we use PetscMPIInt in MPI calls. So it is not a problem.
>>>> Try -vecscatter_type mpi1 to restore to the original VecScatter 
>>>> implementation. If the problem still remains, could you provide a test 
>>>> example for me to debug?
>>>> 
>>>> --Junchao Zhang
>>>> 
>>>> 
>>>> On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie <rlmackie...@gmail.com 
>>>> <mailto:rlmackie...@gmail.com>> wrote:
>>>> Hi Junchao,
>>>> 
>>>> We have tried your two suggestions but the problem remains.
>>>> And the problem seems to be on the MPI_Isend line 117 in 
>>>> PetscGatherMessageLengths and not MPI_AllReduce.
>>>> 
>>>> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking the 
>>>> problem must be elsewhere and not MPI.
>>>> 
>>>> Give that this is a 64 bit indices build of PETSc, is there some possible 
>>>> incompatibility between PETSc and MPI calls?
>>>> 
>>>> We are open to any other possible suggestions to try as other than 
>>>> valgrind on thousands of processes we seem to have run out of ideas.
>>>> 
>>>> Thanks, Randy M.
>>>> 
>>>>> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zh...@gmail.com 
>>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <junchao.zh...@gmail.com 
>>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>>> Randy,
>>>>>    Someone reported similar problem before. It turned out an Intel MPI 
>>>>> MPI_Allreduce bug.  A workaround is setting the environment variable 
>>>>> I_MPI_ADJUST_ALLREDUCE=1.arr
>>>>>  Correct:  I_MPI_ADJUST_ALLREDUCE=1
>>>>>    But you mentioned mpich also had the error. So maybe the problem is 
>>>>> not the same. So let's try the workaround first. If it doesn't work, add 
>>>>> another petsc option -build_twosided allreduce, which is a workaround for 
>>>>> Intel MPI_Ibarrier bugs we met.
>>>>>    Thanks.
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie...@gmail.com 
>>>>> <mailto:rlmackie...@gmail.com>> wrote:
>>>>> Dear PETSc users,
>>>>> 
>>>>> We are trying to understand an issue that has come up in running our code 
>>>>> on a large cloud cluster with a large number of processes and subcomms.
>>>>> This is code that we use daily on multiple clusters without problems, and 
>>>>> that runs valgrind clean for small test problems.
>>>>> 
>>>>> The run generates the following messages, but doesn’t crash, just seems 
>>>>> to hang with all processes continuing to show activity:
>>>>> 
>>>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in 
>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
>>>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in 
>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
>>>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in 
>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
>>>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in 
>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
>>>>> 
>>>>> 
>>>>> Looking at line 117 in PetscGatherMessageLengths we find the offending 
>>>>> statement is the MPI_Isend:
>>>>> 
>>>>>  
>>>>>   /* Post the Isends with the message length-info */
>>>>>   for (i=0,j=0; i<size; ++i) {
>>>>>     if (ilengths[i]) {
>>>>>       ierr = 
>>>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>>>>>       j++;
>>>>>     }
>>>>>   } 
>>>>> 
>>>>> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the 
>>>>> same problem.
>>>>> 
>>>>> We suspect there is some limit being set on this cloud cluster on the 
>>>>> number of file connections or something, but we don’t know.
>>>>> 
>>>>> Anyone have any ideas? We are sort of grasping for straws at this point.
>>>>> 
>>>>> Thanks, Randy M.
>>>> 
>>> 
>> 
>

Re: [petsc-users] MPI error for large number of processes and subcomms

Reply via email to