Hi Junchao, Unfortunately these modifications did not work on our cluster (see output below). However, I am not asking you to spend anymore time on this, as we are able to avoid the problem by setting appropriate sysctl parameters into /etc/sysctl.conf.
Thank you again for all your help on this. Randy Output of test program: mpiexec -np 1280 -hostfile machines ./test -nsubs 160 -nx 100 -ny 100 -nz 10 -max_pending_isends 64 Started ind2 max 31999999 nis 33600 begin VecScatter create [1175]PETSC ERROR: #1 PetscCommBuildTwoSided_Ibarrier() line 102 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c [1175]PETSC ERROR: #2 PetscCommBuildTwoSided() line 313 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c [1175]PETSC ERROR: #3 PetscSFSetUp_Basic() line 33 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/impls/basic/sfbasic.c [1175]PETSC ERROR: #4 PetscSFSetUp() line 253 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/interface/sf.c [1175]PETSC ERROR: #5 VecScatterSetUp_SF() line 747 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/impls/sf/vscatsf.c [1175]PETSC ERROR: #6 VecScatterSetUp() line 208 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscatfce.c [1175]PETSC ERROR: #7 VecScatterCreate() line 287 in /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscreate.c > On Apr 27, 2020, at 9:59 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > Randy, > You are absolutely right. The AOApplicationToPetsc could not be removed. > Since the excessive communication is inevitable, I made two changes in petsc > to ease that. One is I skewed the communication to let each rank send to > ranks greater than itself first. The other is an option, -max_pending_isend, > to control number of pending isends. Current default is 512. > I have an MR at https://gitlab.com/petsc/petsc/-/merge_requests/2757 > <https://gitlab.com/petsc/petsc/-/merge_requests/2757>. I tested it dozens of > times with your example at 5120 ranks. It worked fine. > Please try it in your environment and let me know the result. Since the > failure is random, you may need to run multiple times. > > BTW, if no objection, I'd like to add your excellent example to petsc repo. > > Thanks > --Junchao Zhang > > > On Fri, Apr 24, 2020 at 5:32 PM Randall Mackie <rlmackie...@gmail.com > <mailto:rlmackie...@gmail.com>> wrote: > Hi Junchao, > > I tested by commenting out the AOApplicationToPetsc calls as you suggest, but > it doesn’t work because it doesn’t maintain the proper order of the elements > in the scattered vectors. > > I attach a modified version of the test code where I put elements into the > global vector, then carry out the scatter, and check on the subcomms that > they are correct. > > You can see everything is fine with the AOApplicationToPetsc calls, but the > comparison fails when those are commented out. > > If there is some way I can achieve the right VecScatters without those calls, > I would be happy to know how to do that. > > Thank you again for your help. > > Randy > > ps. I suggest you run this test with nx=ny=nz=10 and only a couple subcomms > and maybe 4 processes to demonstrate the behavior > > >> On Apr 20, 2020, at 2:45 PM, Junchao Zhang <junchao.zh...@gmail.com >> <mailto:junchao.zh...@gmail.com>> wrote: >> >> Hello, Randy, >> I further looked at the problem and believe it was due to overwhelming >> traffic. The code sometimes fails at MPI_Waitall. I printed out MPI error >> strings of bad MPI Statuses. One of them is like >> "MPID_nem_tcp_connpoll(1845): Communication error with rank 25: Connection >> reset by peer", which is a tcp error and has nothing to do with petsc. >> Further investigation shows in the case of 5120 ranks with 320 sub >> communicators, during VecScatterSetUp, each rank has around 640 >> isends/irecvs neighbors, and quite a few ranks has 1280 isends neighbors. I >> guess these overwhelming isends occasionally crashed the connection. >> The piece of code in VecScatterSetUp is to calculate the communication >> pattern. With index sets "having good locality", the calculate itself incurs >> less traffic. Here good locality means indices in an index set mostly point >> to local entries. However, the AOApplicationToPetsc() call in your code >> unnecessarily ruined the good petsc ordering. If we remove >> AOApplicationToPetsc() (the vecscatter result is still correct) , then each >> rank uniformly has around 320 isends/irecvs. >> So, test with this modification and see if it really works in your >> environment. If not applicable, we can provide options in petsc to carry out >> the communication in phases to avoid flooding the network (though it is >> better done by MPI). >> >> Thanks. >> --Junchao Zhang >> >> >> On Fri, Apr 17, 2020 at 10:47 AM Randall Mackie <rlmackie...@gmail.com >> <mailto:rlmackie...@gmail.com>> wrote: >> Hi Junchao, >> >> Thank you for your efforts. >> We tried petsc-3.13.0 but it made no difference. >> We think now the issue are with sysctl parameters, and increasing those >> seemed to have cleared up the problem. >> This also most likely explains how different clusters had different >> behaviors with our test code. >> >> We are now running our code and will report back once we are sure that there >> are no further issues. >> >> Thanks again for your help. >> >> Randy M. >> >>> On Apr 17, 2020, at 8:09 AM, Junchao Zhang <junchao.zh...@gmail.com >>> <mailto:junchao.zh...@gmail.com>> wrote: >>> >>> >>> >>> >>> On Thu, Apr 16, 2020 at 11:13 PM Junchao Zhang <junchao.zh...@gmail.com >>> <mailto:junchao.zh...@gmail.com>> wrote: >>> Randy, >>> I reproduced your error with petsc-3.12.4 and 5120 mpi ranks. I also >>> found the error went away with petsc-3.13. However, I have not figured out >>> what is the bug and which commit fixed it :). >>> So at your side, it is better to use the latest petsc. >>> I want to add that even with petsc-3.12.4 the error is random. I was only >>> able to reproduce the error once, so I can not claim petsc-3.13 actually >>> fixed it (or, the bug is really in petsc). >>> >>> --Junchao Zhang >>> >>> >>> On Thu, Apr 16, 2020 at 9:06 PM Junchao Zhang <junchao.zh...@gmail.com >>> <mailto:junchao.zh...@gmail.com>> wrote: >>> Randy, >>> Up to now I could not reproduce your error, even with the biggest mpirun >>> -n 5120 ./test -nsubs 320 -nx 100 -ny 100 -nz 100 >>> While I continue doing test, you can try other options. It looks you want >>> to duplicate a vector to subcomms. I don't think you need the two lines: >>> call AOApplicationToPetsc(aoParent,nis,ind1,ierr) >>> call AOApplicationToPetsc(aoSub,nis,ind2,ierr) >>> In addition, you can use simpler and more memory-efficient index sets. >>> There is a petsc example for this task, see case 3 in >>> https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c >>> <https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c> >>> BTW, it is good to use petsc master so we are on the same page. >>> --Junchao Zhang >>> >>> >>> On Wed, Apr 15, 2020 at 10:28 AM Randall Mackie <rlmackie...@gmail.com >>> <mailto:rlmackie...@gmail.com>> wrote: >>> Hi Junchao, >>> >>> So I was able to create a small test code that duplicates the issue we have >>> been having, and it is attached to this email in a zip file. >>> Included is the test.F90 code, the commands to duplicate crash and to >>> duplicate a successful run, output errors, and our petsc configuration. >>> >>> Our findings to date include: >>> >>> The error is reproducible in a very short time with this script >>> It is related to nproc*nsubs and (although to a less extent) to DM grid size >>> It happens regardless of MPI implementation (mpich, intel mpi 2018, 2019, >>> openmpi) or compiler (gfortran/gcc , intel 2018) >>> No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to slightly >>> increase the limit, but still fails on the full machine set. >>> Nothing looks interesting on valgrind >>> >>> Our initial tests were carried out on an Azure cluster, but we also tested >>> on our smaller cluster, and we found the following: >>> >>> Works: >>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile ./test >>> -nsubs 80 -nx 100 -ny 100 -nz 100 >>> >>> Crashes (this works on Azure) >>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile ./test >>> -nsubs 80 -nx 100 -ny 100 -nz 100 >>> >>> So it looks like it may also be related to the physical number of nodes as >>> well. >>> >>> In any case, even with 2560 processes on 192 cores the memory does not go >>> above 3.5 Gbyes so you don’t need a huge cluster to test. >>> >>> Thanks, >>> >>> Randy M. >>> >>> >>> >>>> On Apr 14, 2020, at 12:23 PM, Junchao Zhang <junchao.zh...@gmail.com >>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>> >>>> There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why I >>>> doubted it was the problem. Even if users configure petsc with 64-bit >>>> indices, we use PetscMPIInt in MPI calls. So it is not a problem. >>>> Try -vecscatter_type mpi1 to restore to the original VecScatter >>>> implementation. If the problem still remains, could you provide a test >>>> example for me to debug? >>>> >>>> --Junchao Zhang >>>> >>>> >>>> On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie <rlmackie...@gmail.com >>>> <mailto:rlmackie...@gmail.com>> wrote: >>>> Hi Junchao, >>>> >>>> We have tried your two suggestions but the problem remains. >>>> And the problem seems to be on the MPI_Isend line 117 in >>>> PetscGatherMessageLengths and not MPI_AllReduce. >>>> >>>> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking the >>>> problem must be elsewhere and not MPI. >>>> >>>> Give that this is a 64 bit indices build of PETSc, is there some possible >>>> incompatibility between PETSc and MPI calls? >>>> >>>> We are open to any other possible suggestions to try as other than >>>> valgrind on thousands of processes we seem to have run out of ideas. >>>> >>>> Thanks, Randy M. >>>> >>>>> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zh...@gmail.com >>>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>>> >>>>> >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang <junchao.zh...@gmail.com >>>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>>> Randy, >>>>> Someone reported similar problem before. It turned out an Intel MPI >>>>> MPI_Allreduce bug. A workaround is setting the environment variable >>>>> I_MPI_ADJUST_ALLREDUCE=1.arr >>>>> Correct: I_MPI_ADJUST_ALLREDUCE=1 >>>>> But you mentioned mpich also had the error. So maybe the problem is >>>>> not the same. So let's try the workaround first. If it doesn't work, add >>>>> another petsc option -build_twosided allreduce, which is a workaround for >>>>> Intel MPI_Ibarrier bugs we met. >>>>> Thanks. >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie...@gmail.com >>>>> <mailto:rlmackie...@gmail.com>> wrote: >>>>> Dear PETSc users, >>>>> >>>>> We are trying to understand an issue that has come up in running our code >>>>> on a large cloud cluster with a large number of processes and subcomms. >>>>> This is code that we use daily on multiple clusters without problems, and >>>>> that runs valgrind clean for small test problems. >>>>> >>>>> The run generates the following messages, but doesn’t crash, just seems >>>>> to hang with all processes continuing to show activity: >>>>> >>>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in >>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c >>>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in >>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c >>>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in >>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c >>>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in >>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c >>>>> >>>>> >>>>> Looking at line 117 in PetscGatherMessageLengths we find the offending >>>>> statement is the MPI_Isend: >>>>> >>>>> >>>>> /* Post the Isends with the message length-info */ >>>>> for (i=0,j=0; i<size; ++i) { >>>>> if (ilengths[i]) { >>>>> ierr = >>>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr); >>>>> j++; >>>>> } >>>>> } >>>>> >>>>> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the >>>>> same problem. >>>>> >>>>> We suspect there is some limit being set on this cloud cluster on the >>>>> number of file connections or something, but we don’t know. >>>>> >>>>> Anyone have any ideas? We are sort of grasping for straws at this point. >>>>> >>>>> Thanks, Randy M. >>>> >>> >> >