Hi Mark,

Thanks for the information.

@Junchao: Given that there are known issues with GPU aware MPI, it might be 
best to wait until there is an updated version of cray-mpich (which hopefully 
contains the relevant fixes).

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<http://s-sajid-ali.github.io>

________________________________
From: Mark Adams <mfad...@lbl.gov>
Sent: Thursday, February 10, 2022 8:47 PM
To: Junchao Zhang <junchao.zh...@gmail.com>
Cc: Sajid Ali Syed <sas...@fnal.gov>; petsc-users@mcs.anl.gov 
<petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Perlmutter has problems with GPU aware MPI.
This is being actively worked on at NERSc.

Mark

On Thu, Feb 10, 2022 at 9:22 PM Junchao Zhang 
<junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote:
Hi, Sajid Ali,
  I have no clue. I have access to perlmutter.  I am thinking how to debug that.
  If your app is open-sourced and easy to build, then I can build and debug it. 
Otherwise, suppose you build and install petsc (only with options needed by 
your app) to a shared directory, and I can access your executable (which uses 
RPATH for libraries), then maybe I can debug it (I only need to install my own 
petsc to the shared directory)

--Junchao Zhang


On Thu, Feb 10, 2022 at 6:04 PM Sajid Ali Syed 
<sas...@fnal.gov<mailto:sas...@fnal.gov>> wrote:
Hi Junchao,

With "-use_gpu_aware_mpi 0" there is no error. I'm attaching the log for this 
case with this email.

I also ran with gpu aware mpi to see if I could reproduce the error and got the 
error but from a different location. This logfile is also attached.

This was using the newest cray-mpich on NERSC-perlmutter (8.1.12). Let me know 
if I can share further information to help with debugging this.

Thank You,
Sajid Ali (he/him) | Research Associate
Scientific Computing Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=Fea4VIbc4UoqdTFjAk3kg3Hp94LYXkjR3gHIdP08lMeT-3zEDZNKDcHjRejBIggW&s=ezCw13eIYUcCzUki3rlnpGZWZrdcTxlGpG57GqrEz_s&e=>

________________________________
From: Junchao Zhang <junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>>
Sent: Thursday, February 10, 2022 1:43 PM
To: Sajid Ali Syed <sas...@fnal.gov<mailto:sas...@fnal.gov>>
Cc: petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov> 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] GAMG crash during setup when using multiple GPUs

Also, try "-use_gpu_aware_mpi 0" to see if there is a difference.

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang 
<junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote:
Did it fail without GPU at 64 MPI ranks?

--Junchao Zhang


On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed 
<sas...@fnal.gov<mailto:sas...@fnal.gov>> wrote:

Hi PETSc-developers,

I’m seeing the following crash that occurs during the setup phase of the 
preconditioner when using multiple GPUs. The relevant error trace is shown 
below:

(GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, 
CUDA_ERROR_ALREADY_MAPPED, line no 272
[24]PETSC ERROR: --------------------- Error Message 
--------------------------------------------------------------
[24]PETSC ERROR: General MPI error
[24]PETSC ERROR: MPI error 1 Invalid buffer pointer
[24]PETSC ERROR: See 
https://petsc.org/release/faq/<https://urldefense.proofpoint.com/v2/url?u=https-3A__petsc.org_release_faq_&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=ZpvtorGvQdUD8O-wLBTUYUUb6-Kccver8Cc4kXlZ7J0&e=>
 for trouble shooting.
[24]PETSC ERROR: Petsc Development GIT revision: 
f351d5494b5462f62c419e00645ac2e477b88cae  GIT Date: 2022-02-08 15:08:19 +0000
...
[24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54
[24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274
[24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218
[24]PETSC ERROR: #4 PetscSFBcastEnd() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499
[24]PETSC ERROR: #5 VecScatterEnd_Internal() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87
[24]PETSC ERROR: #6 VecScatterEnd() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366
[24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302<https://urldefense.proofpoint.com/v2/url?u=http-3A__mpiaijcusparse.cu-3A302&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=eMW4lGCKOn_tzQeT5gnM0i9mgEMwwbOe1EkCAtKG9M8&e=>
[24]PETSC ERROR: #8 MatMult() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438
[24]PETSC ERROR: #9 PCApplyBAorAB() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730
[24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421
[24]PETSC ERROR: #11 KSPGMRESCycle() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162
[24]PETSC ERROR: #12 KSPSolve_GMRES() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247
[24]PETSC ERROR: #13 KSPSolve_Private() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925
[24]PETSC ERROR: #14 KSPSolve() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103
[24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127
[24]PETSC ERROR: #16 PCSetUp_GAMG() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626
[24]PETSC ERROR: #17 PCSetUp() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017
[24]PETSC ERROR: #18 KSPSetUp() at 
/tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417
[24]PETSC ERROR: #19 main() at poisson3d.c:69
[24]PETSC ERROR: PETSc Option Table entries:
[24]PETSC ERROR: -dm_mat_type aijcusparse
[24]PETSC ERROR: -dm_vec_type cuda
[24]PETSC ERROR: -ksp_monitor
[24]PETSC ERROR: -ksp_norm_type unpreconditioned
[24]PETSC ERROR: -ksp_type cg
[24]PETSC ERROR: -ksp_view
[24]PETSC ERROR: -log_view
[24]PETSC ERROR: -mg_levels_esteig_ksp_type cg
[24]PETSC ERROR: -mg_levels_ksp_type chebyshev
[24]PETSC ERROR: -mg_levels_pc_type jacobi
[24]PETSC ERROR: -pc_gamg_agg_nsmooths 1
[24]PETSC ERROR: -pc_gamg_square_graph 1
[24]PETSC ERROR: -pc_gamg_threshold 0.0
[24]PETSC ERROR: -pc_gamg_threshold_scale 0.0
[24]PETSC ERROR: -pc_gamg_type agg
[24]PETSC ERROR: -pc_type gamg
[24]PETSC ERROR: ----------------End of Error Message -------send entire error 
message to petsc-ma...@mcs.anl.gov----------


Attached with this email is the full error log and the submit script for a 
8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did not 
crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks respectively) 
and attach those logs as well if that helps. Could someone let me know what 
this error means and what can be done to prevent it?

Thank You,
Sajid Ali (he/him) | Research Associate

Scientific Computing Division

Fermi National Accelerator Laboratory

s-sajid-ali.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__s-2Dsajid-2Dali.github.io&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=w-DPglgoOUOz8eiEyHKz0g&m=3AFKDE-HT__MEeFxdxlc6bMDLLjchFccw_htjVmWkOsApaEairnUJYnKT28tfsiN&s=6Fj7FO5IQGRCkPfC22pD7hAo0AxsVgu3kG9LNOftqK0&e=>

​

Reply via email to