Hi, I've run 30 tests with the -notunepme option. I got the following error from one of them(which is still the same *cudaStreamSynchronize failed* error):
``` DD step 1422999 vol min/aver 0.639 load imb.: force 1.1% pme mesh/force 1.079 Step Time 1423000 2846.00000 Energies (kJ/mol) Bond U-B Proper Dih. Improper Dih. CMAP Dih. 3.79755e+04 1.78943e+05 1.22798e+05 2.83835e+03 -9.19303e+02 LJ-14 Coulomb-14 LJ (SR) Coulomb (SR) Coul. recip. 2.56547e+04 5.11714e+05 9.77218e+03 -2.07148e+06 8.64504e+03 Potential Kinetic En. Total Energy Conserved En. Temperature 7.64126e+13 4.79398e+05 7.64126e+13 7.64126e+13 3.58009e+02 Pressure (bar) Constr. rmsd -6.03201e+01 4.56399e-06 ------------------------------------------------------- Program: gmx mdrun, version 2019.4 Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) MPI rank: 2 (out of 8) Fatal error: cudaStreamSynchronize failed: an illegal memory access was encountered For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ------------------------------------------------------- ``` Here is the command and the driver info: ``` Command line: gmx mdrun -v -s md_seed_fixed.tpr -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu -ntomp 3 -pme gpu -pmefft gpu -notunepme -npme 1 -gputasks 00112233 -maxh 2 -cpt 60 -cpi -noappend GROMACS version: 2019.4 Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: CUDA SIMD instructions: AVX2_256 FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512 RDTSCP usage: enabled TNG support: enabled Hwloc support: hwloc-1.11.2 Tracing support: disabled C compiler: /packages/7x/gcc/gcc-7.3.0/bin/gcc GNU 7.3.0 C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /packages/7x/gcc/gcc-7.3.0/bin/g++ GNU 7.3.0 C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /packages/7x/cuda/9.2.88.1/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Wed_Apr_11_23:16:29_CDT_2018;Cuda compilation tools, release 9.2, V9.2.88 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver: 9.20 CUDA runtime: 9.20 Running on 1 node with total 24 cores, 24 logical cores, 4 compatible GPUs Hardware detected: CPU info: Vendor: Intel Brand: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz Family: 6 Model: 79 Stepping: 1 Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic Hardware topology: Full, with devices Sockets, cores, and logical processors: Socket 0: [ 0] [ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [ 10] [ 11] Socket 1: [ 12] [ 13] [ 14] [ 15] [ 16] [ 17] [ 18] [ 19] [ 20] [ 21] [ 22] [ 23] Numa nodes: Node 0 (34229563392 bytes mem): 0 1 2 3 4 5 6 7 8 9 10 11 Node 1 (34359738368 bytes mem): 12 13 14 15 16 17 18 19 20 21 22 23 Latency: 0 1 0 1.00 2.10 1 2.10 1.00 Caches: L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways L3: 31457280 bytes, linesize 64 bytes, assoc. 20, shared 12 ways PCI devices: 0000:01:00.0 Id: 15b3:1007 Class: 0x0200 Numa: 0 0000:02:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0 0000:03:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 0 0000:00:11.4 Id: 8086:8d62 Class: 0x0106 Numa: 0 0000:06:00.0 Id: 1a03:2000 Class: 0x0300 Numa: 0 0000:00:1f.2 Id: 8086:8d02 Class: 0x0106 Numa: 0 0000:81:00.0 Id: 8086:1521 Class: 0x0200 Numa: 1 0000:81:00.1 Id: 8086:1521 Class: 0x0200 Numa: 1 0000:82:00.0 Id: 15b3:1007 Class: 0x0280 Numa: 1 0000:83:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1 0000:84:00.0 Id: 10de:1b06 Class: 0x0300 Numa: 1 GPU info: Number of GPUs detected: 4 #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #2: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible #3: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC: no, stat: compatible ``` Note that the simulation ran for about 2.8ns and we got a weird high potential energy at the end of it. On Mon, Dec 2, 2019 at 2:13 PM Mark Abraham <mark.j.abra...@gmail.com> wrote: > Hi, > > What driver version is reported in the respective log files? Does the error > persist if mdrun -notunepme is used? > > Mark > > On Mon., 2 Dec. 2019, 21:18 Chenou Zhang, <czhan...@asu.edu> wrote: > > > Hi Gromacs developers, > > > > I'm currently running gromacs 2019.4 on our university's HPC cluster. To > > fully utilize the GPU nodes, I followed notes on > > > > > http://manual.gromacs.org/documentation/current/user-guide/mdrun-performance.html > > . > > > > > > And here is the command I used for my runs. > > ``` > > gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on -nb gpu > -ntomp > > 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS -cpt 60 > -cpi > > -noappend > > ``` > > > > And for some of those runs, they might fail with the following error: > > ``` > > ------------------------------------------------------- > > > > Program: gmx mdrun, version 2019.4 > > > > Source file: src/gromacs/gpu_utils/cudautils.cuh (line 229) > > > > MPI rank: 3 (out of 8) > > > > > > > > Fatal error: > > > > cudaStreamSynchronize failed: an illegal memory access was encountered > > > > > > > > For more information and tips for troubleshooting, please check the > GROMACS > > > > website at http://www.gromacs.org/Documentation/Errors > > ``` > > > > we also had a different error from slurm system: > > ``` > > ^Mstep 4400: timed with pme grid 96 96 60, coulomb cutoff 1.446: 467.9 > > M-cycles > > ^Mstep 4600: timed with pme grid 96 96 64, coulomb cutoff 1.372: 451.4 > > M-cycles > > /var/spool/slurmd/job2321134/slurm_script: line 44: 29866 Segmentation > > fault gmx mdrun -v -s $TPR -deffnm md_seed_fixed -ntmpi 8 -pin on > -nb > > gpu -ntomp 3 -pme gpu -pmefft gpu -npme 1 -gputasks 00112233 -maxh $HOURS > > -cpt 60 -cpi -noappend > > ``` > > > > We first thought this could due to compiler issue and tried different > > settings as following: > > ===test1=== > > <source> > > module load cuda/9.2.88.1 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test1/bin/GMXRC > > </source> > > ===test2=== > > <source> > > module load cuda/9.2.88.1 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test2/bin/GMXRC > > </source> > > ===test3=== > > <source> > > module load cuda/9.2.148 > > module load gcc/7.3.0 > > . /home/rsexton2/Library/gromacs/2019.4/test3/bin/GMXRC > > </source> > > ===test4=== > > <source> > > module load cuda/9.2.148 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test4/bin/GMXRC > > </source> > > ===test5=== > > <source> > > module load cuda/9.1.85 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test5/bin/GMXRC > > </source> > > ===test6=== > > <source> > > module load cuda/9.0.176 > > module load gcc/6x > > . /home/rsexton2/Library/gromacs/2019.4/test6/bin/GMXRC > > </source> > > ===test7=== > > <source> > > module load cuda/9.2.88.1 > > module load gccgpu/7.4.0 > > . /home/rsexton2/Library/gromacs/2019.4/test7/bin/GMXRC > > </source> > > > > However we still ended up with the same errors showed above. Does anyone > > know where does the cudaStreamSynchronize come in? Or am I wrongly using > > those gmx gpu commands? > > > > Any input will be appreciated! > > > > Thanks! > > Chenou > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.