Hello:

Here is the information that you asked for.

gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 -gpu_id 0 -pin on
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GROMACS:      gmx mdrun, VERSION 5.1.3
Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
Data prefix:  /soft/gromacs/5.1.3_intel
Command line:
gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 -gpu_id 0 -pin on

GROMACS version:    VERSION 5.1.3
Precision:          single
Memory model:       64 bit
MPI library:        MPI
OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:        enabled
OpenCL support:     disabled
invsqrt routine:    gmx_software_invsqrt(x)
SIMD instructions:  AVX_256
FFT library:        fftw-3.3.4-sse2
RDTSCP usage:       enabled
C++11 compilation:  disabled
TNG support:        enabled
Tracing support:    disabled
Built on:           Thu Aug 11 16:15:26 CEST 2016
Built by:           albert@cudaB [CMAKE]
Build OS/arch:      Linux 3.16.7-35-desktop x86_64
Build CPU vendor:   GenuineIntel
Build CPU brand:    Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Build CPU family:   6   Model: 62   Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1
sse4.2 ssse3 tdt x2apic
C compiler:         /soft/intel/impi/5.1.3.223/bin64/mpicc GNU 4.8.3
C compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -
O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler:       /soft/intel/impi/5.1.3.223/bin64/mpicxx GNU 4.8.3
C++ compiler flags: -mavx -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops -fexcess-pre
cision=fast  -Wno-array-bounds
Boost version:      1.54.0 (external)
CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on Wed_May__4_21:01:56_CDT
_2016;Cuda compilation tools, release 8.0, V8.0.26
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=
sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode
;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;; ;-mavx;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wal
l;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:        8.0
CUDA runtime:       8.0

Running on 1 node with total 10 cores, 20 logical cores, 2 compatible GPUs
Hardware detected on host cudaB (the node of MPI rank 0):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
    Family:  6  model: 62  stepping:  4
CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 ss
e4.2 ssse3 tdt x2apic
    SIMD instructions most likely to fit this hardware: AVX_256
    SIMD instructions selected at GROMACS compile time: AVX_256
  GPU info:
    Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat: compatible
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------







gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 -gpu_id 1 -pin on -cpi -append -pinoffset 8
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GROMACS:      gmx mdrun, VERSION 5.1.3
Executable:   /soft/gromacs/5.1.3_intel/bin/gmx_mpi
Data prefix:  /soft/gromacs/5.1.3_intel
Command line:
gmx_mpi mdrun -s 7.tpr -v -g 7.log -c 7.gro -x 7.xtc -ntomp 8 -gpu_id 1 -pin on -cpi -append -pinoffset 8


Running on 1 node with total 10 cores, 20 logical cores, 2 compatible GPUs
Hardware detected on host cudaB (the node of MPI rank 0):
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
    SIMD instructions most likely to fit this hardware: AVX_256
    SIMD instructions selected at GROMACS compile time: AVX_256
  GPU info:
    Number of GPUs detected: 2
#0: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat: compatible #1: NVIDIA GeForce GTX 780 Ti, compute cap.: 3.5, ECC: no, stat: compatible

Reading file 7.tpr, VERSION 5.1.3 (single precision)

Reading checkpoint file state.cpt generated: Wed Aug 17 09:01:46 2016


Using 1 MPI process
Using 8 OpenMP threads

1 GPU user-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 1

Applying core pinning offset 8
starting mdrun 'Title'
50000000 steps, 100000.0 ps (continuing from step 5746000, 11492.0 ps).
step 5746080: timed with pme grid 60 60 84, coulomb cutoff 1.000: 2451.9 M-cycles







On 08/16/2016 05:27 PM, Szilárd Páll wrote:
Most of that copy-pasted info is not what I asked for and overall not
very useful. You have still not shown any log files (or details on the
hardware). Share the *relevant* stuff, please!
--
Szilárd


On Tue, Aug 16, 2016 at 5:07 PM, Albert <mailmd2...@gmail.com> wrote:
Hello:

Here is my MDP file:

define                  = -DREST_ON -DSTEP6_4
integrator              = md
dt                      = 0.002
nsteps                  = 1000000
nstlog                  = 1000
nstxout                 = 0
nstvout                 = 0
nstfout                 = 0
nstcalcenergy           = 100
nstenergy               = 1000
nstxout-compressed      = 10000
;
cutoff-scheme           = Verlet
nstlist                 = 20
rlist                   = 1.0
coulombtype             = pme
rcoulomb                = 1.0
vdwtype                 = Cut-off
vdw-modifier            = Force-switch
rvdw_switch             = 0.9
rvdw                    = 1.0
;
tcoupl                  = berendsen
tc_grps                 = PROT   MEMB   SOL_ION
tau_t                   = 1.0    1.0    1.0
ref_t                   = 310   310   310
;
pcoupl                  = berendsen
pcoupltype              = semiisotropic
tau_p                   = 5.0
compressibility         = 4.5e-5  4.5e-5
ref_p                   = 1.0     1.0
;
constraints             = h-bonds
constraint_algorithm    = LINCS
continuation            = yes
;
nstcomm                 = 100
comm_mode               = linear
comm_grps               = PROT   MEMB   SOL_ION
;
refcoord_scaling        = com


I compiled Gromacs with the following settings, using Intel MPI:

env CC=mpicc CXX=mpicxx F77=mpif90 FC=mpif90 LDF90=mpif90
CMAKE_PREFIX_PATH=/soft/gromacs/fftw-3.3.4:/soft/intel/impi/5.1.3.223 cmake
.. -DBUILD_SHARED_LIB=OFF -DBUILD_TESTING=OFF
-DCMAKE_INSTALL_PREFIX=/soft/gromacs/5.1.3_intel -DGMX_MPI=ON -DGMX_GPU=ON
-DGMX_PREFER_STATIC_LIBS=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda


I tried it again with one of the job with options:

-ntomp 8 -pin on -pinoffset 8


The two submitted jobs can still only use 8 CPU and the speed is extremely
slow (10ns/day)....when I remove option "-pin on" from one of the job, it
fasten a lot (32ns/day) and 16 CPU were used..... If I only submit one job
with option "-pin on", I can obtain 52ns/day..........


thx a lot


On 08/16/2016 04:59 PM, Szilárd Páll wrote:
Hi,

Without log and hw configs, I it's hard to tell what's happening.

By turning off pinning the OS is free to move threads around and it
will try to ensure cores are utilized. However, by leaving threads
up-pinned you risk taking a significant performance hit. So I'd
recommend that you run with correct settings.

If you start with "-ntomp 8 -pin on -pioffset 8" (and you indeed have
16 cores no HT), you should be able to see in htop the first eight
cores empty while the last eight occupied.

Cheers,
--
Szilárd

--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Reply via email to