Thanks we’ll try and report back.

Randy M.

> On Apr 13, 2020, at 8:53 AM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> Randy,
>    Someone reported similar problem before. It turned out an Intel MPI 
> MPI_Allreduce bug.  A workaround is setting the environment variable 
> I_MPI_ADJUST_ALLREDUCE=1.arr
>    But you mentioned mpich also had the error. So maybe the problem is not 
> the same. So let's try the workaround first. If it doesn't work, add another 
> petsc option -build_twosided allreduce, which is a workaround for Intel 
> MPI_Ibarrier bugs we met.
>    Thanks.
> --Junchao Zhang
> 
> 
> On Mon, Apr 13, 2020 at 10:38 AM Randall Mackie <rlmackie...@gmail.com 
> <mailto:rlmackie...@gmail.com>> wrote:
> Dear PETSc users,
> 
> We are trying to understand an issue that has come up in running our code on 
> a large cloud cluster with a large number of processes and subcomms.
> This is code that we use daily on multiple clusters without problems, and 
> that runs valgrind clean for small test problems.
> 
> The run generates the following messages, but doesn’t crash, just seems to 
> hang with all processes continuing to show activity:
> 
> [492]PETSC ERROR: #1 PetscGatherMessageLengths() line 117 in 
> /mnt/home/cgg/PETSc/petsc-3.12.4/src/sys/utils/mpimesg.c
> [492]PETSC ERROR: #2 VecScatterSetUp_SF() line 658 in 
> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/impls/sf/vscatsf.c
> [492]PETSC ERROR: #3 VecScatterSetUp() line 209 in 
> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscatfce.c
> [492]PETSC ERROR: #4 VecScatterCreate() line 282 in 
> /mnt/home/cgg/PETSc/petsc-3.12.4/src/vec/vscat/interface/vscreate.c
> 
> 
> Looking at line 117 in PetscGatherMessageLengths we find the offending 
> statement is the MPI_Isend:
> 
>  
>   /* Post the Isends with the message length-info */
>   for (i=0,j=0; i<size; ++i) {
>     if (ilengths[i]) {
>       ierr = 
> MPI_Isend((void*)(ilengths+i),1,MPI_INT,i,tag,comm,s_waits+j);CHKERRQ(ierr);
>       j++;
>     }
>   } 
> 
> We have tried this with Intel MPI 2018, 2019, and mpich, all giving the same 
> problem.
> 
> We suspect there is some limit being set on this cloud cluster on the number 
> of file connections or something, but we don’t know.
> 
> Anyone have any ideas? We are sort of grasping for straws at this point.
> 
> Thanks, Randy M.

Reply via email to