Thanks we’ll try and report back. Randy M.
> On Apr 13, 2020, at 8:53 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > Randy, > Someone reported similar problem before. It turned out an Intel MPI > MPI_Allreduce bug. A workaround is setting the environment variable > I_MPI_ADJUST_ALLREDUCE=1.arr > But you mentioned mpich also had the error. So maybe the problem is not > the same. So let's try the workaround first. If it doesn't work, add another > petsc option -build_twosided allreduce, which is a workaround for Intel > MPI_Ibarrier bugs we met. > Thanks. > --Junchao Zhang > > > On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie...@gmail.com > <mailto:rlmackie...@gmail.com>> wrote: > Dear PETSc users, > > We are trying to understand an issue that has come up in running our code on > a large cloud cluster with a large number of processes and subcomms. > This is code that we use daily on multiple clusters without problems, and > that runs valgrind clean for small test problems. > > The run generates the following messages, but doesn’t crash, just seems to > hang with all processes continuing to show activity: > > [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in > /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c > [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in > /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c > [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in > /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c > [492]PETSC ERROR: #4 VecScatterCreate() line 282 in > /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c > > > Looking at line 117 in PetscGatherMessageLengths we find the offending > statement is the MPI_Isend: > > > /* Post the Isends with the message length-info */ > for (i=0,j=0; i<size; ++i) { > if (ilengths[i]) { > ierr = > MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr); > j++; > } > } > > We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same > problem. > > We suspect there is some limit being set on this cloud cluster on the number > of file connections or something, but we don’t know. > > Anyone have any ideas? We are sort of grasping for straws at this point. > > Thanks, Randy M.