Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI 
processes meshes but only working on 2 of them?
It says in the script it has allocated 2.4GB
Best,
Marcos
________________________________
From: Junchao Zhang <junchao.zh...@gmail.com>
Sent: Monday, August 21, 2023 3:29 PM
To: Vanella, Marcos (Fed) <marcos.vane...@nist.gov>
Cc: PETSc users list <petsc-users@mcs.anl.gov>; Guan, Collin X. (Fed) 
<collin.g...@nist.gov>
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi 
processes and 1 GPU

Hi, Macros,
  If you look at the PIDs of the nvidia-smi output, you will only find 8 unique 
PIDs, which is expected since you allocated 8 MPI ranks per node.
  The duplicate PIDs are usually for threads spawned by the MPI runtime (for 
example, progress threads in MPI implementation).   So your job script and 
output are all good.

  Thanks.

On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) 
<marcos.vane...@nist.gov<mailto:marcos.vane...@nist.gov>> wrote:
Hi Junchao, something I'm noting related to running with cuda enabled linear 
solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the 
GPU 0 in the node is taking what seems to be all sub-matrices corresponding to 
all the MPI processes in the node. This is the result of the nvidia-smi command 
on a node with 8 MPI processes (each advancing the same number of unknowns in 
the calculation) and 4 GPU V100s:

Mon Aug 21 14:36:07 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 
12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile 
Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  
Compute M. |
|                                         |                      |              
 MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000004:04:00.0 Off |              
      0 |
| N/A   34C    P0              63W / 300W |   2488MiB / 16384MiB |      0%      
Default |
|                                         |                      |              
    N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           On  | 00000004:05:00.0 Off |              
      0 |
| N/A   38C    P0              56W / 300W |    638MiB / 16384MiB |      0%      
Default |
|                                         |                      |              
    N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           On  | 00000035:03:00.0 Off |              
      0 |
| N/A   35C    P0              52W / 300W |    638MiB / 16384MiB |      0%      
Default |
|                                         |                      |              
    N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           On  | 00000035:04:00.0 Off |              
      0 |
| N/A   38C    P0              53W / 300W |    638MiB / 16384MiB |      0%      
Default |
|                                         |                      |              
    N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                    
        |
|  GPU   GI   CI        PID   Type   Process name                            
GPU Memory |
|        ID   ID                                                             
Usage      |
|=======================================================================================|
|    0   N/A  N/A    214626      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    0   N/A  N/A    214627      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    0   N/A  N/A    214628      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    0   N/A  N/A    214629      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    0   N/A  N/A    214630      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    0   N/A  N/A    214631      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    0   N/A  N/A    214632      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    0   N/A  N/A    214633      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 308MiB |
|    1   N/A  N/A    214627      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    1   N/A  N/A    214631      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    2   N/A  N/A    214628      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    2   N/A  N/A    214632      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    3   N/A  N/A    214629      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
|    3   N/A  N/A    214633      C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux     
 318MiB |
+---------------------------------------------------------------------------------------+


You can see that GPU 0 is connected to all 8 MPI Processes, each taking about 
300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm 
wondering if this is expected or there are some changes I need to do on my 
submission script/runtime parameters.
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):

#!/bin/bash
# ../../Utilities/Scripts/qfds.sh -p 2  -T db -d test.fds
#SBATCH -J test
#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
#SBATCH --partition=gpu
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:4

export OMP_NUM_THREADS=1
# modules
module load cuda/11.7
module load gcc/11.2.1/toolset
module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7

cd /home/mnv/Firemodels_fork/fds/Issues/PETSc

srun -N 2 -n 16 
/home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
-pc_type gamg -mat_type aijcusparse -vec_type cuda

Thank you for the advice,
Marcos



Reply via email to