Hi Junchao, So I was able to create a small test code that duplicates the issue we have been having, and it is attached to this email in a zip file. Included is the test.F90 code, the commands to duplicate crash and to duplicate a successful run, output errors, and our petsc configuration. Our findings to date include: The error is reproducible in a very short time with this script It is related to nproc*nsubs and (although to a less extent) to DM grid size It happens regardless of MPI implementation (mpich, intel mpi 2018, 2019, openmpi) or compiler (gfortran/gcc , intel 2018) No effect changing vecscatter_type to mpi1 or mpi3. Mpi1 seems to slightly increase the limit, but still fails on the full machine set. Nothing looks interesting on valgrind Our initial tests were carried out on an Azure cluster, but we also tested on our smaller cluster, and we found the following: Works: $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 1280 -hostfile hostfile ./test -nsubs 80 -nx 100 -ny 100 -nz 100 Crashes (this works on Azure) $PETSC_DIR/lib/petsc/bin/petscmpiexec -n 2560 -hostfile hostfile ./test -nsubs 80 -nx 100 -ny 100 -nz 100 So it looks like it may also be related to the physical number of nodes as well. In any case, even with 2560 processes on 192 cores the memory does not go above 3.5 Gbyes so you don’t need a huge cluster to test. Thanks, Randy M. |
<<attachment: to_petsc-users.zip>>
|