I guess you can fix that with an additional option, -build_twosided allreduce
We have two algorithms for PetscCommBuildTwoSided: ibarrier (when # of ranks > 1024) and allreduce (otherwise). The flow control with ibarrier is much weaker than that in allreduce. Though in my tests, they both worked. Thanks. --Junchao Zhang On Thu, Apr 30, 2020 at 7:46 PM Randall Mackie <rlmackie...@gmail.com> wrote: > Hi Junchao, > > Unfortunately these modifications did not work on our cluster (see output > below). > However, I am not asking you to spend anymore time on this, as we are able > to avoid the problem by setting appropriate sysctl parameters into > /etc/sysctl.conf. > > Thank you again for all your help on this. > > Randy > > > Output of test program: > > mpiexec -np 1280 -hostfile machines ./test -nsubs 160 -nx 100 -ny 100 -nz > 10 -max_pending_isends 64 > Started > > ind2 max 31999999 > nis 33600 > begin VecScatter create > [1175]PETSC ERROR: #1 PetscCommBuildTwoSided_Ibarrier() line 102 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c > [1175]PETSC ERROR: #2 PetscCommBuildTwoSided() line 313 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/sys/utils/mpits.c > [1175]PETSC ERROR: #3 PetscSFSetUp_Basic() line 33 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/impls/basic/sfbasic.c > [1175]PETSC ERROR: #4 PetscSFSetUp() line 253 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/is/sf/interface/sf.c > [1175]PETSC ERROR: #5 VecScatterSetUp_SF() line 747 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/impls/sf/vscatsf.c > [1175]PETSC ERROR: #6 VecScatterSetUp() line 208 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscatfce.c > [1175]PETSC ERROR: #7 VecScatterCreate() line 287 in > /state/std2/FEMI/PETSc/petsc-jczhang-throttle-pending-isends/src/vec/vscat/interface/vscreate.c > > > > On Apr 27, 2020, at 9:59 AM, Junchao Zhang <junchao.zh...@gmail.com> > wrote: > > Randy, > You are absolutely right. The AOApplicationToPetsc could not be > removed. Since the excessive communication is inevitable, I made two > changes in petsc to ease that. One is I skewed the communication to let > each rank send to ranks greater than itself first. The other is an option, > -max_pending_isend, to control number of pending isends. Current default is > 512. > I have an MR at https://gitlab.com/petsc/petsc/-/merge_requests/2757. > I tested it dozens of times with your example at 5120 ranks. It worked fine. > Please try it in your environment and let me know the result. Since the > failure is random, you may need to run multiple times. > > BTW, if no objection, I'd like to add your excellent example to petsc > repo. > > Thanks > --Junchao Zhang > > > On Fri, Apr 24, 2020 at 5:32 PM Randall Mackie <rlmackie...@gmail.com> > wrote: > >> Hi Junchao, >> >> I tested by commenting out the AOApplicationToPetsc calls as you suggest, >> but it doesn’t work because it doesn’t maintain the proper order of the >> elements in the scattered vectors. >> >> I attach a modified version of the test code where I put elements into >> the global vector, then carry out the scatter, and check on the subcomms >> that they are correct. >> >> You can see everything is fine with the AOApplicationToPetsc calls, but >> the comparison fails when those are commented out. >> >> If there is some way I can achieve the right VecScatters without those >> calls, I would be happy to know how to do that. >> >> Thank you again for your help. >> >> Randy >> >> ps. I suggest you run this test with nx=ny=nz=10 and only a couple >> subcomms and maybe 4 processes to demonstrate the behavior >> >> >> On Apr 20, 2020, at 2:45 PM, Junchao Zhang <junchao.zh...@gmail.com> >> wrote: >> >> Hello, Randy, >> I further looked at the problem and believe it was due to overwhelming >> traffic. The code sometimes fails at MPI_Waitall. I printed out MPI error >> strings of bad MPI Statuses. One of them is like >> "MPID_nem_tcp_connpoll(1845): Communication error with rank 25: Connection >> reset by peer", which is a tcp error and has nothing to do with petsc. >> Further investigation shows in the case of 5120 ranks with 320 sub >> communicators, during VecScatterSetUp, each rank has around 640 >> isends/irecvs neighbors, and quite a few ranks has 1280 isends neighbors. I >> guess these overwhelming isends occasionally crashed the connection. >> The piece of code in VecScatterSetUp is to calculate the communication >> pattern. With index sets "having good locality", the calculate itself >> incurs less traffic. Here good locality means indices in an index set >> mostly point to local entries. However, the AOApplicationToPetsc() call in >> your code unnecessarily ruined the good petsc ordering. If we remove >> AOApplicationToPetsc() (the vecscatter result is still correct) , then each >> rank uniformly has around 320 isends/irecvs. >> So, test with this modification and see if it really works in your >> environment. If not applicable, we can provide options in petsc to carry >> out the communication in phases to avoid flooding the network (though it is >> better done by MPI). >> >> Thanks. >> --Junchao Zhang >> >> >> On Fri, Apr 17, 2020 at 10:47 AM Randall Mackie <rlmackie...@gmail.com> >> wrote: >> >>> Hi Junchao, >>> >>> Thank you for your efforts. >>> We tried petsc-3.13.0 but it made no difference. >>> We think now the issue are with sysctl parameters, and increasing those >>> seemed to have cleared up the problem. >>> This also most likely explains how different clusters had different >>> behaviors with our test code. >>> >>> We are now running our code and will report back once we are sure that >>> there are no further issues. >>> >>> Thanks again for your help. >>> >>> Randy M. >>> >>> On Apr 17, 2020, at 8:09 AM, Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>> >>> >>> >>> On Thu, Apr 16, 2020 at 11:13 PM Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>>> Randy, >>>> I reproduced your error with petsc-3.12.4 and 5120 mpi ranks. I also >>>> found the error went away with petsc-3.13. However, I have not figured out >>>> what is the bug and which commit fixed it :). >>>> So at your side, it is better to use the latest petsc. >>>> >>> I want to add that even with petsc-3.12.4 the error is random. I was >>> only able to reproduce the error once, so I can not claim petsc-3.13 >>> actually fixed it (or, the bug is really in petsc). >>> >>> >>>> --Junchao Zhang >>>> >>>> >>>> On Thu, Apr 16, 2020 at 9:06 PM Junchao Zhang <junchao.zh...@gmail.com> >>>> wrote: >>>> >>>>> Randy, >>>>> Up to now I could not reproduce your error, even with the biggest >>>>> mpirun -n 5120 ./test -nsubs 320 -nx 100 -ny 100 -nz 100 >>>>> While I continue doing test, you can try other options. It looks you >>>>> want to duplicate a vector to subcomms. I don't think you need the two >>>>> lines: >>>>> >>>>> call AOApplicationToPetsc(aoParent,nis,ind1,ierr) >>>>> call AOApplicationToPetsc(aoSub,nis,ind2,ierr) >>>>> >>>>> In addition, you can use simpler and more memory-efficient index >>>>> sets. There is a petsc example for this task, see case 3 in >>>>> https://gitlab.com/petsc/petsc/-/blob/master/src/vec/vscat/tests/ex9.c >>>>> BTW, it is good to use petsc master so we are on the same page. >>>>> --Junchao Zhang >>>>> >>>>> >>>>> On Wed, Apr 15, 2020 at 10:28 AM Randall Mackie <rlmackie...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Junchao, >>>>>> >>>>>> So I was able to create a small test code that duplicates the issue >>>>>> we have been having, and it is attached to this email in a zip file. >>>>>> Included is the test.F90 code, the commands to duplicate crash and to >>>>>> duplicate a successful run, output errors, and our petsc configuration. >>>>>> >>>>>> Our findings to date include: >>>>>> >>>>>> The error is reproducible in a very short time with this script >>>>>> It is related to nproc*nsubs and (although to a less extent) to DM >>>>>> grid size >>>>>> It happens regardless of MPI implementation (mpich, intel mpi 2018, >>>>>> 2019, openmpi) or compiler (gfortran/gcc , intel 2018) >>>>>> No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to >>>>>> slightly increase the limit, but still fails on the full machine set. >>>>>> Nothing looks interesting on valgrind >>>>>> >>>>>> Our initial tests were carried out on an Azure cluster, but we also >>>>>> tested on our smaller cluster, and we found the following: >>>>>> >>>>>> Works: >>>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile >>>>>> ./test -nsubs 80 -nx 100 -ny 100 -nz 100 >>>>>> >>>>>> Crashes (this works on Azure) >>>>>> $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile >>>>>> ./test -nsubs 80 -nx 100 -ny 100 -nz 100 >>>>>> >>>>>> So it looks like it may also be related to the physical number of >>>>>> nodes as well. >>>>>> >>>>>> In any case, even with 2560 processes on 192 cores the memory does >>>>>> not go above 3.5 Gbyes so you don’t need a huge cluster to test. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Randy M. >>>>>> >>>>>> >>>>>> >>>>>> On Apr 14, 2020, at 12:23 PM, Junchao Zhang <junchao.zh...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> There is an MPI_Allreduce in PetscGatherNumberOfMessages, that is why >>>>>> I doubted it was the problem. Even if users configure petsc with 64-bit >>>>>> indices, we use PetscMPIInt in MPI calls. So it is not a problem. >>>>>> Try -vecscatter_type mpi1 to restore to the original VecScatter >>>>>> implementation. If the problem still remains, could you provide a test >>>>>> example for me to debug? >>>>>> >>>>>> --Junchao Zhang >>>>>> >>>>>> >>>>>> On Tue, Apr 14, 2020 at 12:13 PM Randall Mackie < >>>>>> rlmackie...@gmail.com> wrote: >>>>>> >>>>>>> Hi Junchao, >>>>>>> >>>>>>> We have tried your two suggestions but the problem remains. >>>>>>> And the problem seems to be on the MPI_Isend line 117 in >>>>>>> PetscGatherMessageLengths and not MPI_AllReduce. >>>>>>> >>>>>>> We have now tried Intel MPI, Mpich, and OpenMPI, and so are thinking >>>>>>> the problem must be elsewhere and not MPI. >>>>>>> >>>>>>> Give that this is a 64 bit indices build of PETSc, is there some >>>>>>> possible incompatibility between PETSc and MPI calls? >>>>>>> >>>>>>> We are open to any other possible suggestions to try as other than >>>>>>> valgrind on thousands of processes we seem to have run out of ideas. >>>>>>> >>>>>>> Thanks, Randy M. >>>>>>> >>>>>>> On Apr 13, 2020, at 8:54 AM, Junchao Zhang <junchao.zh...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> --Junchao Zhang >>>>>>> >>>>>>> >>>>>>> On Mon, Apr 13, 2020 at 10:53 AM Junchao Zhang < >>>>>>> junchao.zh...@gmail.com> wrote: >>>>>>> >>>>>>>> Randy, >>>>>>>> Someone reported similar problem before. It turned out an Intel >>>>>>>> MPI MPI_Allreduce bug. A workaround is setting the environment >>>>>>>> variable >>>>>>>> I_MPI_ADJUST_ALLREDUCE=1.arr >>>>>>>> >>>>>>> Correct: I_MPI_ADJUST_ALLREDUCE=1 >>>>>>> >>>>>>>> But you mentioned mpich also had the error. So maybe the problem >>>>>>>> is not the same. So let's try the workaround first. If it doesn't >>>>>>>> work, add >>>>>>>> another petsc option -build_twosided allreduce, which is a workaround >>>>>>>> for >>>>>>>> Intel MPI_Ibarrier bugs we met. >>>>>>>> Thanks. >>>>>>>> --Junchao Zhang >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie < >>>>>>>> rlmackie...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Dear PETSc users, >>>>>>>>> >>>>>>>>> We are trying to understand an issue that has come up in running >>>>>>>>> our code on a large cloud cluster with a large number of processes and >>>>>>>>> subcomms. >>>>>>>>> This is code that we use daily on multiple clusters without >>>>>>>>> problems, and that runs valgrind clean for small test problems. >>>>>>>>> >>>>>>>>> The run generates the following messages, but doesn’t crash, just >>>>>>>>> seems to hang with all processes continuing to show activity: >>>>>>>>> >>>>>>>>> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in >>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c >>>>>>>>> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in >>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c >>>>>>>>> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in >>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c >>>>>>>>> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in >>>>>>>>> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c >>>>>>>>> >>>>>>>>> >>>>>>>>> Looking at line 117 in PetscGatherMessageLengths we find the >>>>>>>>> offending statement is the MPI_Isend: >>>>>>>>> >>>>>>>>> >>>>>>>>> /* Post the Isends with the message length-info */ >>>>>>>>> for (i=0,j=0; i<size; ++i) { >>>>>>>>> if (ilengths[i]) { >>>>>>>>> ierr = >>>>>>>>> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr); >>>>>>>>> j++; >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> We have tried this with Intel MPI 2018, 2019, and mpich, all >>>>>>>>> giving the same problem. >>>>>>>>> >>>>>>>>> We suspect there is some limit being set on this cloud cluster on >>>>>>>>> the number of file connections or something, but we don’t know. >>>>>>>>> >>>>>>>>> Anyone have any ideas? We are sort of grasping for straws at this >>>>>>>>> point. >>>>>>>>> >>>>>>>>> Thanks, Randy M. >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> >