Re: [QE-users] Out of Memory on large available RAM for QE-GPU version

Pietro Bonfa' Tue, 26 Jan 2021 05:15:01 -0800

Dear Sandra,

unfortunately PW still doesn't predict the amount of GPU memory to beused during the simulation, but the estimate for RAM is also a goodguess for the GPU memory.

The error message that you see is actually a failed allocation on theGPU side, not on the RAM.

Even if you had nvidia cards with 16 GB of memory, the prediction, thatin your case reads


Estimated max dynamical RAM per process >      13.87 GB

is, as you can see, generally underestimated. I wouldn't be surprised ofa 15-20% inaccuracy.


Hope this helps,
best,
Pietro

--
Pietro Bonfà
Department of Mathematical, Physical and Computer Sciences
University of Parma


On 1/26/21 12:58 PM, Romero Molina, Sandra wrote:

Dear Community,
I have compiled Quantum ESPRESSO (Program PWSCF v.6.7MaX) for GPUacceleration (hybrid MPI/OpenMP) with the next options:
               module load compiler/intel/2020.1

               module load hpc_sdk/20.9
./configure F90=pgf90 CC=pgcc MPIF90=mpif90--with-cuda=yes --enable-cuda-env-check=no --with-cuda-runtime=11.0--with-cuda-cc=70 --enable-openmp BLAS_LIBS='-lmkl_intel_lp64-lmkl_intel_thread -lmkl_core'
               make -j8 pw
Apparently the compilation ends successfully and then, I execute theprogram:
               module load compiler/intel/2020.1

               module load hpc_sdk/20.9

               export OMP_NUM_THREADS=1
mpirun -n 2 /home/my_user/q-e-gpu-qe-gpu-6.7/bin/pw.x <silverslab32.in > silver4.out
Then, the program starts and output:

     Parallel version (MPI & OpenMP), running on       8 processor cores

     Number of MPI processes:                 2

     Threads/MPI process:                     4

               ...

               GPU acceleration is ACTIVE

               ...

     Estimated max dynamical RAM per process >      13.87 GB

     Estimated total dynamical RAM >      27.75 GB

But after 2 minutes of execution the job ends with error:

0: ALLOCATE: 4345479360 bytes requested; status = 2(out of memory)

0: ALLOCATE: 4345482096 bytes requested; status = 2(out of memory)

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,thus causing
the job to be terminated. The first process to do so was:

   Process name: [[47946,1],1]

   Exit code:    127

--------------------------------------------------------------------------
This node has > 180GB of available RAM. With the top commands this isthe memory consume:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
89681 my_user   20   0   30.1g   3.6g   2.1g R 100.0  1.9   1:39.45 pw.x

89682 my_user   20   0   29.8g   3.2g   2.0g R 100.0  1.7   1:39.30 pw.x
When the RES memory arise the 4GB the processes stop and the error isdisplayed
This are the characteristics of the node:

(base) [my_user@gpu001]$ numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 3637 38 39 40 41
node 0 size: 95313 MB

node 0 free: 41972 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 4748 49 50 51 52 53 54 55
node 1 size: 96746 MB

node 1 free: 70751 MB

node distances:

node   0   1

   0:  10  21

   1:  21  10

(base) [my_user@gpu001]$ free -lm
total used free shared buff/cacheavailable
Mem: 192059 2561 112716 260 76781188505
Low:         192059       79342      112716

High:             0           0           0

Swap:          8191           0        8191

(base) [my_user@gpu001]$ ulimit -a

core file size             (blocks, -c) 0

data seg size              (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 768049

max locked memory       (kbytes, -l) unlimited

max memory size         (kbytes, -m) unlimited

open files                      (-n) 100000

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) unlimited

cpu time               (seconds, -t) unlimited

max user processes              (-u) 4096

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

The version of MPI is: (Open MPI) 3.1.5
This node is a compute node in a cluster, but no matter if I submit thejob with SLURM or run it directly on the node, the error is the same.
Note that I compile it on the login node and run it on this GPU node,the difference is that on the login node it has no GPU connected.
I would really appreciate it if you could help me figure out what couldbe going on.
Thank you.

Ms.C. Sandra Romero Molina
Ph.D. student
Computational Biochemistry

T03 R01 D48
Faculty of Biology
University of Duisburg-Essen
Universitätsstr. 2, 45117 Essen
emails: [email protected]<mailto:[email protected]>
Phone: +49 176 2341 8772
ORCID: https://orcid.org/0000-0002-4990-1649<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Forcid.org%2F0000-0002-4990-1649&data=04%7C01%7Cpietro.bonfa%40unipr.it%7Ca1de5ab08af943f66b1708d8c1f1b760%7Cbb064bc5b7a841ecbabed7beb3faeb1c%7C0%7C0%7C637472591941309421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HXIms%2FJQluVA0107%2FdJsGrRA1216p7OHfIqFWeEw1l8%3D&reserved=0>
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Re: [QE-users] Out of Memory on large available RAM for QE-GPU version

Reply via email to