> Persistence is enabled so I don't have to overclock again.

Sure, makes sense. Note that strictly speaking this is not an "overclock",
but a manual "boost clock" (to use terminology CPU vendors use). Consumer
GPUs automatically scale their clock speeds above their nominal/base clock
(just as CPUs do), but Tesla GPUs don't do that but rather give the option
on the user (or put the burden if we want to look at it differently).

> To be honest, I
> am still not entirely comfortable with the notion of ranks, after reading
> the acceleration document a bunch of times.

Feel free to ask if you need clarification.
Briefly: ranks are the execution units, typically MPI processes, that tasks
get assigned to when decomposing work across multiple compute units (nodes,
processors). In general, data or tasks can be decomposed (also called
data-/task-parallelization), and GROMACS does employ both, the former for
the spatial domain decomposition, the latter for offloading PME work to a
subset of the ranks.

> Parts of log file below and I
> will obviously appreciate suggestions/clarifications:

In the future, please share the full log by uploading it somewhere.

> Command line:
>   gmx mdrun -nt 4 -ntmpi 2 -npme 1 -pme gpu -nb gpu -s run_unstretch.tpr -o
> traj_unstretch.trr -g md.log -c unstretched.gro

As noted before, I doubt that you benefit from using a separate PME rank
with a single GPU.

I suggest that instead you simply run:
gmx mdrun -ntmpi 1 -pme gpu -nb gpu
optionally, you can pass -ntomp 4, but that's the default so it's not

> GROMACS version:    2018
> Precision:          single
> Memory model:       64 bit
> MPI library:        thread_mpi
> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:        CUDA
> SIMD instructions:  SSE4.1
> FFT library:        fftw-3.3.5-sse2
> RDTSCP usage:       enabled
> TNG support:        enabled
> Hwloc support:      disabled
> Tracing support:    disabled
> Built on:           2018-02-13 19:43:29
> Built by:           smolyan@MINTbox [CMAKE]
> Build OS/arch:      Linux 4.4.0-112-generic x86_64
> Build CPU vendor:   Intel
> Build CPU brand:    Intel(R) Xeon(R) CPU           W3530  @ 2.80GHz
> Build CPU family:   6   Model: 26   Stepping: 5
> Build CPU features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> C compiler:         /usr/bin/cc GNU 5.4.0
> C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler:       /usr/bin/c++ GNU 5.4.0
> C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:      /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on
> Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85
> CUDA compiler
> flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;;
> ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver:        9.10
> CUDA runtime:       9.10
> Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU
> Hardware detected:
>   CPU info:
>     Vendor: Intel
>     Brand:  Intel(R) Xeon(R) CPU           W3530  @ 2.80GHz
>     Family: 6   Model: 26   Stepping: 5
>     Features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc
> pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
>   Hardware topology: Basic
>     Sockets, cores, and logical processors:
>       Socket  0: [   0]
>       Socket  1: [   1]
>       Socket  2: [   2]
>       Socket  3: [   3]
>   GPU info:
>     Number of GPUs detected: 1
>     #0: NVIDIA Tesla K40c, compute cap.: 3.5, ECC:  no, stat: compatible
> ................
> M E G A - F L O P S   A C C O U N T I N G
>  NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>  Computing:                               M-Number         M-Flops  % Flops
> -----------------------------------------------------------------------------
>  Pair Search distance check          547029.956656     4923269.610     0.0
>  NxN Ewald Elec. + LJ [F]         485658021.416832 32053429413.511    98.0
>  NxN Ewald Elec. + LJ [V&F]         4905656.839680   524905281.846     1.6
>  1,4 nonbonded interactions          140625.005625    12656250.506     0.0
>  Reset In Box                          4599.000000       13797.000     0.0
>  CG-CoM                                4599.018396       13797.055     0.0
>  Bonds                                48000.001920     2832000.113     0.0
>  Angles                               94650.003786    15901200.636     0.0
>  RB-Dihedrals                        186600.007464    46090201.844     0.1
>  Pos. Restr.                           2600.000104      130000.005     0.0
>  Virial                                4610.268441       82984.832     0.0
>  Stop-CM                                 91.998396         919.984     0.0
>  Calc-Ekin                            45990.036792     1241730.993     0.0
>  Constraint-V                        318975.012759     2551800.102     0.0
>  Constraint-Vir                        3189.762759       76554.306     0.0
>  Settle                              106325.004253    34342976.374     0.1
>  Virtual Site 3                      107388.258506     3973365.565     0.0
> -----------------------------------------------------------------------------
>  Total                                             32703165544.282   100.0
> -----------------------------------------------------------------------------
>     D O M A I N   D E C O M P O S I T I O N   S T A T I S T I C S
>  av. #atoms communicated per step for force:  2 x 0.0
>  av. #atoms communicated per step for vsites: 3 x 0.0
>  av. #atoms communicated per step for LINCS:  2 x 0.0
>  Average PME mesh/force load: 1.193
>  Part of the total run time spent waiting due to PP/PME imbalance: 5.1 %
> NOTE: 5.1 % performance was lost because the PME ranks
>       had more work to do than the PP ranks.
>       You might want to increase the number of PME ranks
>       or increase the cut-off and the grid spacing.
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> On 1 MPI rank doing PP, using 2 OpenMP threads, and
> on 1 MPI rank doing PME, using 2 OpenMP threads
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> -----------------------------------------------------------------------------
>  Domain decomp.         1    2     250000     975.157       5461.106   1.0
>  DD comm. load          1    2      25002       0.009          0.053   0.0
>  Vsite constr.          1    2   25000001    2997.638      16787.470   3.1
>  Send X to PME          1    2   25000001     806.884       4518.740   0.8
>  Neighbor search        1    2     250001    1351.275       7567.455   1.4
>  Launch GPU ops.        1    2   50000002    7767.373      43499.093   8.0
>  Comm. coord.           1    2   24750000       4.359         24.410   0.0
>  Force                  1    2   25000001    8994.482      50371.185   9.3
>  Wait + Comm. F         1    2   25000001       3.992         22.355   0.0
>  PME mesh *             1    2   25000001   30757.016     172246.434  31.7
>  PME wait for PP *                          17821.979      99807.221  18.3
>  Wait + Recv. PME F     1    2   25000001    3355.753      18792.998   3.5
>  Wait PME GPU gather    1    2   25000001   25539.917     143029.467  26.3
>  Wait GPU NB nonloc.    1    2   25000001      61.503        344.432   0.1
>  Wait GPU NB local      1    2   25000001   15384.720      86158.005  15.8
>  NB X/F buffer ops.     1    2   99500002    1817.951      10180.950   1.9
>  Vsite spread           1    2   25250002    3417.205      19137.139   3.5
>  Write traj.            1    2       2554      18.100        101.362   0.0
>  Update                 1    2   25000001    1832.047      10259.890   1.9
>  Constraints            1    2   25000001    3232.961      18105.330   3.3
>  Comm. energies         1    2    1250001       5.858         32.805   0.0
> -----------------------------------------------------------------------------
>  Total                                      48578.997     544107.322 100.0
> -----------------------------------------------------------------------------
> (*) Note that with separate PME ranks, the walltime column actually sums to
>     twice the total reported, but the cycle count total and % are correct.
> -----------------------------------------------------------------------------
>                Core t (s)   Wall t (s)        (%)
>        Time:   194315.986    48578.997      400.0
>                          13h29:38
>                  (ns/day)    (hour/ns)
> Performance:       88.927        0.270
> Finished mdrun on rank 0 Mon Jun 18 07:42:59 2018
> > > Thanks for the heads up. With the K40c instead of GTX 960 here's what I
> > > did and here are the results:
> > >
> > > 1. Enabled persistence mode and overclocked the card via nvidia-smi:
> > > http://acceleware.com/blog/gpu-boost-nvidias-tesla-k40-gpus
> >
> >
> > Note that: persistence mode is only for convenience.
> >
> >
> > > 2. Offloaded PME's FFT to GPU (which wasn't the case with GTX 960),
> this
> > > brough the "pme mesh / force" ratio to something like 1.07.
> > >
> >
> > I still think you are running multiple ranks which is unlikely to be
> ideal,
> > but without seeing a log file, it's hard to tell..
> >
> > The result is a solid increase in performance on a small-ish system (20K
> > > atoms): 90 ns/day instead of 65-70. I don't use this box for anything
> > > except prototyping, but still the swap + tweaks were pretty useful.
> >
> >
> > >
> > > Alex
> > >
> > >
> > >
> > >
> > >> Hi,
> > >>
> > >> Regarding the K40 vs GTX 960 question, the K40 will likely be a bit
> > >> faster (though it'l consume more power if that matters). The
> > >> difference will be at most 20% in total performance, I think -- and
> > >> with small systems likely negligible (as a smaller card with higher
> > >> clocks is more efficient at small tasks than a large card with lower
> > >> clocks).
> > >>
> > >> Regarding the load balance note, you are correct, the "pme mesh/force"
> > >> means the ratio of time spent in computing PME forces on a separate
> > >> task/rank and the rest of the forces (including nonbonded, bonded,
> > >> etc.). With GPU offload this is a bit more tricky as the observed time
> > >> is the time spent waiting for the GPU results, but the take-away is
> > >> the same: when a run shows "pme mesh/force" far from 1, there is
> > >> imbalance affecting performance.
> > >>
> > >> However, note that with a single GPU I've yet to see a case where you
> > >> get better performance by running multiple ranks rather than simply
> > >> running OpenMP-only. Also note that what a "weak GPU" can
> > >> case-by-case, so I recommend taking the 1-2 minutes to do a short run
> > >> and check for a certain hardware + simulation setup is it better to
> > >> offload all of PME or keep the FFTs on the CPU.
> > >>
> > >> We'll do our best to automate more of these choices, but for now if
> > >> you care about performance it's useful to test before doing long runs.
> > >>
> > >> Cheers,
> > >> --
> > >> Szilárd
> > >>
> > >>
> > >>
> > >>> Question: in the DD output (md.log) that looks like "DD  step xxxxxx
> > pme
> > >>> mesh/force 1.229," what is the ratio? Does it mean the pme
> calculations
> > >>> take longer by the shown factor than the nonbonded interactions?
> > >>> With GTX 960, the ratio is consistently ~0.85, with Tesla K40 it's
> > ~1.25.
> > >>> My mdrun line contains  -pmefft cpu (per Szilard's advice for weak
> > GPUs,
> > >>> I
> > >>> believe). Would it then make sense to offload the fft to the K40?
> > >>>
> > >>> Thank you,
> > >>>
> > >>> Alex
> > >>>
> > >>>
> > >>> So, swap, then? Thank you!
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>   flops trumps clock speed…..
> > >>>>>
> > >>>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> I have an old "prototyping" box with a 4-core Xeon and an old GTX
> > 960.
> > >>>>>>
> > >>>>> We
> > >>>>>
> > >>>>>> have a Tesla K40 laying around and there's only one PCIE slot
> > >>>>>> available
> > >>>>>>
> > >>>>> in
> > >>>>>
> > >>>>>> this machine. Would it make sense to swap the cards, or is it
> > already
> > >>>>>> bottlenecked by the CPU? I compared the specs and 960 has a higher
> > >>>>>> clock
> > >>>>>> speed, while K40's FP performance is better. Should I swap the
> GPUs?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Alex
> > >>>>
> > >>>>
