Dear Sandra,

unfortunately PW still doesn't predict the amount of GPU memory to be used during the simulation, but the estimate for RAM is also a good guess for the GPU memory.

The error message that you see is actually a failed allocation on the GPU side, not on the RAM.

Even if you had nvidia cards with 16 GB of memory, the prediction, that in your case reads

Estimated max dynamical RAM per process >      13.87 GB

is, as you can see, generally underestimated. I wouldn't be surprised of a 15-20% inaccuracy.

Hope this helps,
best,
Pietro

--
Pietro Bonfà
Department of Mathematical, Physical and Computer Sciences
University of Parma


On 1/26/21 12:58 PM, Romero Molina, Sandra wrote:
Dear Community,

I have compiled Quantum ESPRESSO (Program PWSCF v.6.7MaX) for GPU acceleration (hybrid MPI/OpenMP) with the next options:

               module load compiler/intel/2020.1

               module load hpc_sdk/20.9

              ./configure F90=pgf90 CC=pgcc MPIF90=mpif90 --with-cuda=yes --enable-cuda-env-check=no --with-cuda-runtime=11.0 --with-cuda-cc=70 --enable-openmp BLAS_LIBS='-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core'

               make -j8 pw

Apparently the compilation ends successfully and then, I execute the program:

               module load compiler/intel/2020.1

               module load hpc_sdk/20.9

               export OMP_NUM_THREADS=1

              mpirun -n 2 /home/my_user/q-e-gpu-qe-gpu-6.7/bin/pw.x < silverslab32.in > silver4.out

Then, the program starts and output:

     Parallel version (MPI & OpenMP), running on       8 processor cores

     Number of MPI processes:                 2

     Threads/MPI process:                     4

               ...

               GPU acceleration is ACTIVE

               ...

     Estimated max dynamical RAM per process >      13.87 GB

     Estimated total dynamical RAM >      27.75 GB

But after 2 minutes of execution the job ends with error:

0: ALLOCATE: 4345479360 bytes requested; status = 2(out of memory)

0: ALLOCATE: 4345482096 bytes requested; status = 2(out of memory)

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

   Process name: [[47946,1],1]

   Exit code:    127

--------------------------------------------------------------------------

This node has > 180GB of available RAM. With the top commands this is the memory consume:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

89681 my_user   20   0   30.1g   3.6g   2.1g R 100.0  1.9   1:39.45 pw.x

89682 my_user   20   0   29.8g   3.2g   2.0g R 100.0  1.7   1:39.30 pw.x

When the RES memory arise the 4GB the processes stop and the error is displayed

This are the characteristics of the node:

(base) [my_user@gpu001]$ numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41

node 0 size: 95313 MB

node 0 free: 41972 MB

node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55

node 1 size: 96746 MB

node 1 free: 70751 MB

node distances:

node   0   1

   0:  10  21

   1:  21  10

(base) [my_user@gpu001]$ free -lm

              total        used        free      shared  buff/cache available

Mem:         192059        2561      112716         260       76781 188505

Low:         192059       79342      112716

High:             0           0           0

Swap:          8191           0        8191

(base) [my_user@gpu001]$ ulimit -a

core file size             (blocks, -c) 0

data seg size              (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 768049

max locked memory       (kbytes, -l) unlimited

max memory size         (kbytes, -m) unlimited

open files                      (-n) 100000

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) unlimited

cpu time               (seconds, -t) unlimited

max user processes              (-u) 4096

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

The version of MPI is: (Open MPI) 3.1.5

This node is a compute node in a cluster, but no matter if I submit the job with SLURM or run it directly on the node, the error is the same.

Note that I compile it on the login node and run it on this GPU node, the difference is that on the login node it has no GPU connected.

I would really appreciate it if you could help me figure out what could be going on.

Thank you.

Ms.C. Sandra Romero Molina
Ph.D. student
Computational Biochemistry

T03 R01 D48
Faculty of Biology
University of Duisburg-Essen
Universitätsstr. 2, 45117 Essen
emails: [email protected] <mailto:[email protected]>

Phone: +49 176 2341 8772
ORCID: https://orcid.org/0000-0002-4990-1649 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Forcid.org%2F0000-0002-4990-1649&data=04%7C01%7Cpietro.bonfa%40unipr.it%7Ca1de5ab08af943f66b1708d8c1f1b760%7Cbb064bc5b7a841ecbabed7beb3faeb1c%7C0%7C0%7C637472591941309421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HXIms%2FJQluVA0107%2FdJsGrRA1216p7OHfIqFWeEw1l8%3D&reserved=0>


_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Reply via email to