Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case:
terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Program received signal SIGABRT: Process abort signal. Backtrace for this error: #0 0x15264731ead0 in ??? #1 0x15264731dc35 in ??? #2 0x15264711551f in ??? #3 0x152647169a7c in ??? #4 0x152647115475 in ??? #5 0x1526470fb7f2 in ??? #6 0x152647678bbd in ??? #7 0x15264768424b in ??? #8 0x1526476842b6 in ??? #9 0x152647684517 in ??? #10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224 #11 0x55bb46342ebb in _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_ at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316 #12 0x55bb46342ebb in _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_ at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544 #13 0x55bb46342ebb in _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_ at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669 #14 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_ at /usr/local/cuda/include/thrust/detail/sort.inl:115 #15 0x55bb46317bc5 in _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_ at /usr/local/cuda/include/thrust/detail/sort.inl:305 #16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452 #17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173 #18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222 #19 0x55bb468e01cf in MatSetPreallocationCOO at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606 #20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547 #21 0x55bb469015e5 in MatProductSymbolic at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803 #22 0x55bb4694ade2 in MatPtAP at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897 #23 0x55bb4696d3ec in MatCoarsenApply_MISK_private at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283 #24 0x55bb4696eb67 in MatCoarsenApply_MISK at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368 #25 0x55bb4695bd91 in MatCoarsenApply at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97 #26 0x55bb478294d8 in PCGAMGCoarsen_AGG at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524 #27 0x55bb471d1cb4 in PCSetUp_GAMG at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631 #28 0x55bb464022cf in PCSetUp at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994 #29 0x55bb4718b8a7 in KSPSetUp at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406 #30 0x55bb4718f22e in KSPSolve_Private at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824 #31 0x55bb47192c0c in KSPSolve at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070 #32 0x55bb463efd35 in kspsolve_ at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320 #33 0x55bb45e94b32 in ??? #34 0x55bb46048044 in ??? #35 0x55bb46052ea1 in ??? #36 0x55bb45ac5f8e in ??? #37 0x1526470fcd8f in ??? #38 0x1526470fce3f in ??? #39 0x55bb45aef55d in ??? #40 0xffffffffffffffff in ??? -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited on signal 6 (Aborted). -------------------------------------------------------------------------- BTW, I'm curious. If I set n MPI processes, each of them building a part of the linear system, and g GPUs, how does PETSc distribute those n pieces of system matrix and rhs in the g GPUs? Does it do some load balancing algorithm? Where can I read about this? Thank you and best Regards, I can also point you to my code repo in GitHub if you want to take a closer look. Best Regards, Marcos ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com> Sent: Friday, August 11, 2023 10:52 AM To: Vanella, Marcos (Fed) <marcos.vane...@nist.gov> Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU Hi, Marcos, Could you build petsc in debug mode and then copy and paste the whole error stack message? Thanks --Junchao Zhang On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote: Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error: terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Program received signal SIGABRT: Process abort signal. Backtrace for this error: terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Program received signal SIGABRT: Process abort signal. I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it: #!/bin/bash #SBATCH -J test #SBATCH -e /home/Issues/PETSc/test.err #SBATCH -o /home/Issues/PETSc/test.log #SBATCH --partition=batch #SBATCH --ntasks=2 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=2 #SBATCH --time=01:00:00 #SBATCH --gres=gpu:1 export OMP_NUM_THREADS=1 module load cuda/11.5 module load openmpi/4.1.1 cd /home/Issues/PETSc mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg If anyone has any suggestions on how o troubleshoot this please let me know. Thanks! Marcos