Okay, we're positively unable to run a Gromacs (2019.1) test on Power9. The test procedure is simple, using slurm: 1. Request an interactive session: > srun -N 1 -n 20 --pty --partition=debug --time=1:00:00 --gres=gpu:1 bash 2. Load CUDA library: module load cuda 3. Run test batch. This starts with a CPU-only static EM, which, despite the mdrun variables, runs on a single thread. Any help will be highly appreciated.
md.log below: GROMACS: gmx mdrun, version 2019.1 Executable: /home/reida/ppc64le/stow/gromacs/bin/gmx Data prefix: /home/reida/ppc64le/stow/gromacs Working dir: /home/smolyan/gmx_test1 Process ID: 115831 Command line: gmx mdrun -pin on -pinstride 2 -ntomp 4 -ntmpi 4 -pme cpu -nb cpu -s em.tpr -o traj.trr -g md.log -c after_em.pdb GROMACS version: 2019.1 Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: CUDA SIMD instructions: IBM_VSX FFT library: fftw-3.3.8 RDTSCP usage: disabled TNG support: enabled Hwloc support: hwloc-1.11.8 Tracing support: disabled C compiler: /opt/rh/devtoolset-7/root/usr/bin/cc GNU 7.3.1 C compiler flags: -mcpu=power9 -mtune=power9 -mvsx -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /opt/rh/devtoolset-7/root/usr/bin/c++ GNU 7.3.1 C++ compiler flags: -mcpu=power9 -mtune=power9 -mvsx -std=c++11 -O2 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /usr/local/cuda-10.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2018 NVIDIA Corporation;Built on Sat_Aug_25_21:10:00_CDT_2018;Cuda compilation tools, release 10.0, V10.0.130 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; -mcpu=power9;-mtune=power9;-mvsx;-std=c++11;-O2;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver: 10.10 CUDA runtime: 10.0 Running on 1 node with total 160 cores, 160 logical cores, 1 compatible GPU Hardware detected: CPU info: Vendor: IBM Brand: POWER9, altivec supported Family: 0 Model: 0 Stepping: 0 Features: vmx vsx Hardware topology: Only logical processor count GPU info: Number of GPUs detected: 1 #0: NVIDIA Tesla V100-SXM2-16GB, compute cap.: 7.0, ECC: yes, stat: compatible ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ *SKIPPED* Input Parameters: integrator = steep tinit = 0 dt = 0.001 nsteps = 50000 init-step = 0 simulation-part = 1 comm-mode = Linear nstcomm = 100 bd-fric = 0 ld-seed = 1941752878 emtol = 100 emstep = 0.01 niter = 20 fcstep = 0 nstcgsteep = 1000 nbfgscorr = 10 rtpi = 0.05 nstxout = 0 nstvout = 0 nstfout = 0 nstlog = 1000 nstcalcenergy = 100 nstenergy = 1000 nstxout-compressed = 0 compressed-x-precision = 1000 cutoff-scheme = Verlet nstlist = 1 ns-type = Grid pbc = xyz periodic-molecules = true verlet-buffer-tolerance = 0.005 rlist = 1.2 coulombtype = PME coulomb-modifier = Potential-shift rcoulomb-switch = 0 rcoulomb = 1.2 epsilon-r = 1 epsilon-rf = inf vdw-type = Cut-off vdw-modifier = Potential-shift rvdw-switch = 0 rvdw = 1.2 DispCorr = No table-extension = 1 fourierspacing = 0.12 fourier-nx = 52 fourier-ny = 52 fourier-nz = 52 pme-order = 4 ewald-rtol = 1e-05 ewald-rtol-lj = 0.001 lj-pme-comb-rule = Geometric ewald-geometry = 0 epsilon-surface = 0 tcoupl = No nsttcouple = -1 nh-chain-length = 0 print-nose-hoover-chain-variables = false pcoupl = No pcoupltype = Isotropic nstpcouple = -1 tau-p = 1 compressibility (3x3): compressibility[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} compressibility[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p (3x3): ref-p[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} refcoord-scaling = No posres-com (3): posres-com[0]= 0.00000e+00 posres-com[1]= 0.00000e+00 posres-com[2]= 0.00000e+00 posres-comB (3): posres-comB[0]= 0.00000e+00 posres-comB[1]= 0.00000e+00 posres-comB[2]= 0.00000e+00 QMMM = false QMconstraints = 0 QMMMscheme = 0 MMChargeScaleFactor = 1 qm-opts: ngQM = 0 constraint-algorithm = Lincs continuation = false Shake-SOR = false shake-tol = 0.0001 lincs-order = 4 lincs-iter = 1 lincs-warnangle = 30 nwall = 0 wall-type = 9-3 wall-r-linpot = -1 wall-atomtype[0] = -1 wall-atomtype[1] = -1 wall-density[0] = 0 wall-density[1] = 0 wall-ewald-zfac = 3 pull = false awh = false rotation = false interactiveMD = false disre = No disre-weighting = Conservative disre-mixed = false dr-fc = 1000 dr-tau = 0 nstdisreout = 100 orire-fc = 0 orire-tau = 0 nstorireout = 100 free-energy = no cos-acceleration = 0 deform (3x3): deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00} simulated-tempering = false swapcoords = no userint1 = 0 userint2 = 0 userint3 = 0 userint4 = 0 userreal1 = 0 userreal2 = 0 userreal3 = 0 userreal4 = 0 applied-forces: electric-field: x: E0 = 0 omega = 0 t0 = 0 sigma = 0 y: E0 = 0 omega = 0 t0 = 0 sigma = 0 z: E0 = 0 omega = 0 t0 = 0 sigma = 0 grpopts: nrdf: 47805 ref-t: 0 tau-t: 0 annealing: No annealing-npoints: 0 acc: 0 0 0 nfreeze: N N N energygrp-flags[ 0]: 0 Initializing Domain Decomposition on 4 ranks NOTE: disabling dynamic load balancing as it is only supported with dynamics, not with integrator 'steep'. Dynamic load balancing: auto Using update groups, nr 10529, average size 2.5 atoms, max. radius 0.078 nm Minimum cell size due to atom displacement: 0.000 nm NOTE: Periodic molecules are present in this system. Because of this, the domain decomposition algorithm cannot easily determine the minimum cell size that it requires for treating bonded interactions. Instead, domain decomposition will assume that half the non-bonded cut-off will be a suitable lower bound. Minimum cell size due to bonded interactions: 0.678 nm Using 0 separate PME ranks, as there are too few total ranks for efficient splitting Optimizing the DD grid for 4 cells with a minimum initial size of 0.678 nm The maximum allowed number of cells is: X 8 Y 8 Z 8 Domain decomposition grid 1 x 4 x 1, separate PME ranks 0 PME domain decomposition: 1 x 4 x 1 Domain decomposition rank 0, coordinates 0 0 0 The initial number of communication pulses is: Y 1 The initial domain decomposition cell size is: Y 1.50 nm The maximum allowed distance for atom groups involved in interactions is: non-bonded interactions 1.356 nm two-body bonded interactions (-rdd) 1.356 nm multi-body bonded interactions (-rdd) 1.356 nm virtual site constructions (-rcon) 1.503 nm Using 4 MPI threads Using 4 OpenMP threads per tMPI thread Overriding thread affinity set outside gmx mdrun Pinning threads with a user-specified logical core stride of 2 NOTE: Thread affinity was not set. System total charge: 0.000 Will do PME sum in reciprocal space for electrostatic interactions. ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen A smooth particle mesh Ewald method J. Chem. Phys. 103 (1995) pp. 8577-8592 -------- -------- --- Thank You --- -------- -------- Using a Gaussian width (1/beta) of 0.384195 nm for Ewald Potential shift: LJ r^-12: -1.122e-01 r^-6: -3.349e-01, Ewald -8.333e-06 Initialized non-bonded Ewald correction tables, spacing: 1.02e-03 size: 1176 Generated table with 1100 data points for 1-4 COUL. Tabscale = 500 points/nm Generated table with 1100 data points for 1-4 LJ6. Tabscale = 500 points/nm Generated table with 1100 data points for 1-4 LJ12. Tabscale = 500 points/nm Using SIMD 4x4 nonbonded short-range kernels Using a 4x4 pair-list setup: updated every 1 steps, buffer 0.000 nm, rlist 1.200 nm Using geometric Lennard-Jones combination rule ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ S. Miyamoto and P. A. Kollman SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid Water Models J. Comp. Chem. 13 (1992) pp. 952-962 -------- -------- --- Thank You --- -------- -------- Linking all bonded interactions to atoms There are 5407 inter charge-group virtual sites, will an extra communication step for selected coordinates and forces Note that activating steepest-descent energy minimization via the integrator .mdp option and the command gmx mdrun may be available in a different form in a future version of GROMACS, e.g. gmx minimize and an .mdp option. Initiating Steepest Descents Atom distribution over 4 domains: av 6687 stddev 134 min 6515 max 6792 Started Steepest Descents on rank 0 Thu May 9 15:49:36 2019 -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.