Hi Mark, Thanks for the information.
@Junchao: Given that there are known issues with GPU aware MPI, it might be best to wait until there is an updated version of cray-mpich (which hopefully contains the relevant fixes). Thank You, Sajid Ali (he/him) | Research Associate Scientific Computing Division Fermi National Accelerator Laboratory s-sajid-ali.github.io<http://s-sajid-ali.github.io> ________________________________ From: Mark Adams <mfad...@lbl.gov> Sent: Thursday, February 10, 2022 8:47 PM To: Junchao Zhang <junchao.zh...@gmail.com> Cc: Sajid Ali Syed <sas...@fnal.gov>; petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs Perlmutter has problems with GPU aware MPI. This is being actively worked on at NERSc. Mark On Thu, Feb 10, 2022 at 9:22 PM Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote: Hi, Sajid Ali, I have no clue. I have access to perlmutter. I am thinking how to debug that. If your app is open-sourced and easy to build, then I can build and debug it. Otherwise, suppose you build and install petsc (only with options needed by your app) to a shared directory, and I can access your executable (which uses RPATH for libraries), then maybe I can debug it (I only need to install my own petsc to the shared directory) --Junchao Zhang On Thu, Feb 10, 2022 at 6:04 PM Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>> wrote: Hi Junchao, With "-use_gpu_aware_mpi 0" there is no error. I'm attaching the log for this case with this email. I also ran with gpu aware mpi to see if I could reproduce the error and got the error but from a different location. This logfile is also attached. This was using the newest cray-mpich on NERSC-perlmutter (8.1.12). Let me know if I can share further information to help with debugging this. Thank You, Sajid Ali (he/him) | Research Associate Scientific Computing Division Fermi National Accelerator Laboratory s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=Fea4VIbc4UoqdTFjAk3kg3Hp94LYXkjR3gHIdP08lMeT-3zEDZNKDcHjRejBIggW&s=ezCw13eIYUcCzUki3rlnpGZWZrdcTxlGpG57GqrEz_s&e=> ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> Sent: Thursday, February 10, 2022 1:43 PM To: Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>> Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs Also, try "-use_gpu_aware_mpi 0" to see if there is a difference. --Junchao Zhang On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote: Did it fail without GPU at 64 MPI ranks? --Junchao Zhang On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>> wrote: Hi PETSc-developers, I’m seeing the following crash that occurs during the setup phase of the preconditioner when using multiple GPUs. The relevant error trace is shown below: (GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272 [24]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [24]PETSC ERROR: General MPI error [24]PETSC ERROR: MPI error 1 Invalid buffer pointer [24]PETSC ERROR: See https://petsc.org/release/faq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__petsc.org_release_faq_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=ZpvtorGvQdUD8O-wLBTUYUUb6-Kccver8Cc4kXlZ7J0&e=> for trouble shooting. [24]PETSC ERROR: Petsc Development GIT revision: f351d5494b5462f62c419e00645ac2e477b88cae GIT Date: 2022-02-08 15:08:19 +0000 ... [24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54 [24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274 [24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218 [24]PETSC ERROR: #4 PetscSFBcastEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499 [24]PETSC ERROR: #5 VecScatterEnd_Internal() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87 [24]PETSC ERROR: #6 VecScatterEnd() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366 [24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302<https://urldefense.proofpoint.com/v2/url?u=http-3A__mpiaijcusparse.cu-3A302&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=eMW4lGCKOn_tzQeT5gnM0i9mgEMwwbOe1EkCAtKG9M8&e=> [24]PETSC ERROR: #8 MatMult() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438 [24]PETSC ERROR: #9 PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730 [24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421 [24]PETSC ERROR: #11 KSPGMRESCycle() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162 [24]PETSC ERROR: #12 KSPSolve_GMRES() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247 [24]PETSC ERROR: #13 KSPSolve_Private() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925 [24]PETSC ERROR: #14 KSPSolve() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103 [24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127 [24]PETSC ERROR: #16 PCSetUp_GAMG() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626 [24]PETSC ERROR: #17 PCSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017 [24]PETSC ERROR: #18 KSPSetUp() at /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417 [24]PETSC ERROR: #19 main() at poisson3d.c:69 [24]PETSC ERROR: PETSc Option Table entries: [24]PETSC ERROR: -dm_mat_type aijcusparse [24]PETSC ERROR: -dm_vec_type cuda [24]PETSC ERROR: -ksp_monitor [24]PETSC ERROR: -ksp_norm_type unpreconditioned [24]PETSC ERROR: -ksp_type cg [24]PETSC ERROR: -ksp_view [24]PETSC ERROR: -log_view [24]PETSC ERROR: -mg_levels_esteig_ksp_type cg [24]PETSC ERROR: -mg_levels_ksp_type chebyshev [24]PETSC ERROR: -mg_levels_pc_type jacobi [24]PETSC ERROR: -pc_gamg_agg_nsmooths 1 [24]PETSC ERROR: -pc_gamg_square_graph 1 [24]PETSC ERROR: -pc_gamg_threshold 0.0 [24]PETSC ERROR: -pc_gamg_threshold_scale 0.0 [24]PETSC ERROR: -pc_gamg_type agg [24]PETSC ERROR: -pc_type gamg [24]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-ma...@mcs.anl.gov---------- Attached with this email is the full error log and the submit script for a 8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did not crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks respectively) and attach those logs as well if that helps. Could someone let me know what this error means and what can be done to prevent it? Thank You, Sajid Ali (he/him) | Research Associate Scientific Computing Division Fermi National Accelerator Laboratory s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=6Fj7FO5IQGRCkPfC22pD7hAo0AxsVgu3kG9LNOftqK0&e=>