On Mon, Jun 18, 2018 at 11:35 PM Alex <nedoma...@gmail.com> wrote: > Persistence is enabled so I don't have to overclock again.
Sure, makes sense. Note that strictly speaking this is not an "overclock", but a manual "boost clock" (to use terminology CPU vendors use). Consumer GPUs automatically scale their clock speeds above their nominal/base clock (just as CPUs do), but Tesla GPUs don't do that but rather give the option on the user (or put the burden if we want to look at it differently). > To be honest, I > am still not entirely comfortable with the notion of ranks, after reading > the acceleration document a bunch of times. Feel free to ask if you need clarification. Briefly: ranks are the execution units, typically MPI processes, that tasks get assigned to when decomposing work across multiple compute units (nodes, processors). In general, data or tasks can be decomposed (also called data-/task-parallelization), and GROMACS does employ both, the former for the spatial domain decomposition, the latter for offloading PME work to a subset of the ranks. > Parts of log file below and I > will obviously appreciate suggestions/clarifications: > In the future, please share the full log by uploading it somewhere. > Command line: > gmx mdrun -nt 4 -ntmpi 2 -npme 1 -pme gpu -nb gpu -s run_unstretch.tpr -o > traj_unstretch.trr -g md.log -c unstretched.gro > As noted before, I doubt that you benefit from using a separate PME rank with a single GPU. I suggest that instead you simply run: gmx mdrun -ntmpi 1 -pme gpu -nb gpu optionally, you can pass -ntomp 4, but that's the default so it's not needed. > > GROMACS version: 2018 > Precision: single > Memory model: 64 bit > MPI library: thread_mpi > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) > GPU support: CUDA > SIMD instructions: SSE4.1 > FFT library: fftw-3.3.5-sse2 > RDTSCP usage: enabled > TNG support: enabled > Hwloc support: disabled > Tracing support: disabled > Built on: 2018-02-13 19:43:29 > Built by: smolyan@MINTbox [CMAKE] > Build OS/arch: Linux 4.4.0-112-generic x86_64 > Build CPU vendor: Intel > Build CPU brand: Intel(R) Xeon(R) CPU W3530 @ 2.80GHz > Build CPU family: 6 Model: 26 Stepping: 5 > Build CPU features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr > nonstop_tsc pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > C compiler: /usr/bin/cc GNU 5.4.0 > C compiler flags: -msse4.1 -O3 -DNDEBUG -funroll-all-loops > -fexcess-precision=fast > C++ compiler: /usr/bin/c++ GNU 5.4.0 > C++ compiler flags: -msse4.1 -std=c++11 -O3 -DNDEBUG > -funroll-all-loops -fexcess-precision=fast > CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler > driver;Copyright (c) 2005-2017 NVIDIA Corporation;Built on > Fri_Nov__3_21:07:56_CDT_2017;Cuda compilation tools, release 9.1, V9.1.85 > CUDA compiler > > flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_70,code=compute_70;-use_fast_math;-D_FORCE_INLINES;; > > ;-msse4.1;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; > CUDA driver: 9.10 > CUDA runtime: 9.10 > > > Running on 1 node with total 4 cores, 4 logical cores, 1 compatible GPU > Hardware detected: > CPU info: > Vendor: Intel > Brand: Intel(R) Xeon(R) CPU W3530 @ 2.80GHz > Family: 6 Model: 26 Stepping: 5 > Features: apic clfsh cmov cx8 cx16 htt intel lahf mmx msr nonstop_tsc > pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > Hardware topology: Basic > Sockets, cores, and logical processors: > Socket 0: [ 0] > Socket 1: [ 1] > Socket 2: [ 2] > Socket 3: [ 3] > GPU info: > Number of GPUs detected: 1 > #0: NVIDIA Tesla K40c, compute cap.: 3.5, ECC: no, stat: compatible > > ................ > > M E G A - F L O P S A C C O U N T I N G > > NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels > RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table > W3=SPC/TIP3p W4=TIP4p (single or pairs) > V&F=Potential and force V=Potential only F=Force only > > Computing: M-Number M-Flops % Flops > > ----------------------------------------------------------------------------- > Pair Search distance check 547029.956656 4923269.610 0.0 > NxN Ewald Elec. + LJ [F] 485658021.416832 32053429413.511 98.0 > NxN Ewald Elec. + LJ [V&F] 4905656.839680 524905281.846 1.6 > 1,4 nonbonded interactions 140625.005625 12656250.506 0.0 > Reset In Box 4599.000000 13797.000 0.0 > CG-CoM 4599.018396 13797.055 0.0 > Bonds 48000.001920 2832000.113 0.0 > Angles 94650.003786 15901200.636 0.0 > RB-Dihedrals 186600.007464 46090201.844 0.1 > Pos. Restr. 2600.000104 130000.005 0.0 > Virial 4610.268441 82984.832 0.0 > Stop-CM 91.998396 919.984 0.0 > Calc-Ekin 45990.036792 1241730.993 0.0 > Constraint-V 318975.012759 2551800.102 0.0 > Constraint-Vir 3189.762759 76554.306 0.0 > Settle 106325.004253 34342976.374 0.1 > Virtual Site 3 107388.258506 3973365.565 0.0 > > ----------------------------------------------------------------------------- > Total 32703165544.282 100.0 > > ----------------------------------------------------------------------------- > > > D O M A I N D E C O M P O S I T I O N S T A T I S T I C S > > av. #atoms communicated per step for force: 2 x 0.0 > av. #atoms communicated per step for vsites: 3 x 0.0 > av. #atoms communicated per step for LINCS: 2 x 0.0 > > Average PME mesh/force load: 1.193 > Part of the total run time spent waiting due to PP/PME imbalance: 5.1 % > > NOTE: 5.1 % performance was lost because the PME ranks > had more work to do than the PP ranks. > You might want to increase the number of PME ranks > or increase the cut-off and the grid spacing. > > > R E A L C Y C L E A N D T I M E A C C O U N T I N G > > On 1 MPI rank doing PP, using 2 OpenMP threads, and > on 1 MPI rank doing PME, using 2 OpenMP threads > > Computing: Num Num Call Wall time Giga-Cycles > Ranks Threads Count (s) total sum % > > ----------------------------------------------------------------------------- > Domain decomp. 1 2 250000 975.157 5461.106 1.0 > DD comm. load 1 2 25002 0.009 0.053 0.0 > Vsite constr. 1 2 25000001 2997.638 16787.470 3.1 > Send X to PME 1 2 25000001 806.884 4518.740 0.8 > Neighbor search 1 2 250001 1351.275 7567.455 1.4 > Launch GPU ops. 1 2 50000002 7767.373 43499.093 8.0 > Comm. coord. 1 2 24750000 4.359 24.410 0.0 > Force 1 2 25000001 8994.482 50371.185 9.3 > Wait + Comm. F 1 2 25000001 3.992 22.355 0.0 > PME mesh * 1 2 25000001 30757.016 172246.434 31.7 > PME wait for PP * 17821.979 99807.221 18.3 > Wait + Recv. PME F 1 2 25000001 3355.753 18792.998 3.5 > Wait PME GPU gather 1 2 25000001 25539.917 143029.467 26.3 > Wait GPU NB nonloc. 1 2 25000001 61.503 344.432 0.1 > Wait GPU NB local 1 2 25000001 15384.720 86158.005 15.8 > NB X/F buffer ops. 1 2 99500002 1817.951 10180.950 1.9 > Vsite spread 1 2 25250002 3417.205 19137.139 3.5 > Write traj. 1 2 2554 18.100 101.362 0.0 > Update 1 2 25000001 1832.047 10259.890 1.9 > Constraints 1 2 25000001 3232.961 18105.330 3.3 > Comm. energies 1 2 1250001 5.858 32.805 0.0 > > ----------------------------------------------------------------------------- > Total 48578.997 544107.322 100.0 > > ----------------------------------------------------------------------------- > (*) Note that with separate PME ranks, the walltime column actually sums to > twice the total reported, but the cycle count total and % are correct. > > ----------------------------------------------------------------------------- > > Core t (s) Wall t (s) (%) > Time: 194315.986 48578.997 400.0 > 13h29:38 > (ns/day) (hour/ns) > Performance: 88.927 0.270 > Finished mdrun on rank 0 Mon Jun 18 07:42:59 2018 > > > > > On Mon, Jun 18, 2018 at 3:23 PM, Szilárd Páll <pall.szil...@gmail.com> > wrote: > > > On Mon, Jun 18, 2018 at 2:22 AM, Alex <nedoma...@gmail.com> wrote: > > > > > Thanks for the heads up. With the K40c instead of GTX 960 here's what I > > > did and here are the results: > > > > > > 1. Enabled persistence mode and overclocked the card via nvidia-smi: > > > http://acceleware.com/blog/gpu-boost-nvidias-tesla-k40-gpus > > > > > > Note that: persistence mode is only for convenience. > > > > > > > 2. Offloaded PME's FFT to GPU (which wasn't the case with GTX 960), > this > > > brough the "pme mesh / force" ratio to something like 1.07. > > > > > > > I still think you are running multiple ranks which is unlikely to be > ideal, > > but without seeing a log file, it's hard to tell.. > > > > The result is a solid increase in performance on a small-ish system (20K > > > atoms): 90 ns/day instead of 65-70. I don't use this box for anything > > > except prototyping, but still the swap + tweaks were pretty useful. > > > > > > > > > > Alex > > > > > > > > > > > > On 6/15/2018 1:20 PM, Szilárd Páll wrote: > > > > > >> Hi, > > >> > > >> Regarding the K40 vs GTX 960 question, the K40 will likely be a bit > > >> faster (though it'l consume more power if that matters). The > > >> difference will be at most 20% in total performance, I think -- and > > >> with small systems likely negligible (as a smaller card with higher > > >> clocks is more efficient at small tasks than a large card with lower > > >> clocks). > > >> > > >> Regarding the load balance note, you are correct, the "pme mesh/force" > > >> means the ratio of time spent in computing PME forces on a separate > > >> task/rank and the rest of the forces (including nonbonded, bonded, > > >> etc.). With GPU offload this is a bit more tricky as the observed time > > >> is the time spent waiting for the GPU results, but the take-away is > > >> the same: when a run shows "pme mesh/force" far from 1, there is > > >> imbalance affecting performance. > > >> > > >> However, note that with a single GPU I've yet to see a case where you > > >> get better performance by running multiple ranks rather than simply > > >> running OpenMP-only. Also note that what a "weak GPU" can > > >> case-by-case, so I recommend taking the 1-2 minutes to do a short run > > >> and check for a certain hardware + simulation setup is it better to > > >> offload all of PME or keep the FFTs on the CPU. > > >> > > >> We'll do our best to automate more of these choices, but for now if > > >> you care about performance it's useful to test before doing long runs. > > >> > > >> Cheers, > > >> -- > > >> Szilárd > > >> > > >> > > >> On Thu, Jun 14, 2018 at 2:09 AM, Alex <nedoma...@gmail.com> wrote: > > >> > > >>> Question: in the DD output (md.log) that looks like "DD step xxxxxx > > pme > > >>> mesh/force 1.229," what is the ratio? Does it mean the pme > calculations > > >>> take longer by the shown factor than the nonbonded interactions? > > >>> With GTX 960, the ratio is consistently ~0.85, with Tesla K40 it's > > ~1.25. > > >>> My mdrun line contains -pmefft cpu (per Szilard's advice for weak > > GPUs, > > >>> I > > >>> believe). Would it then make sense to offload the fft to the K40? > > >>> > > >>> Thank you, > > >>> > > >>> Alex > > >>> > > >>> On Wed, Jun 13, 2018 at 4:53 PM, Alex <nedoma...@gmail.com> wrote: > > >>> > > >>> So, swap, then? Thank you! > > >>>> > > >>>> > > >>>> > > >>>> On Wed, Jun 13, 2018 at 4:49 PM, paul buscemi <pbusc...@q.com> > wrote: > > >>>> > > >>>> flops trumps clock speed….. > > >>>>> > > >>>>> On Jun 13, 2018, at 3:45 PM, Alex <nedoma...@gmail.com> wrote: > > >>>>>> > > >>>>>> Hi all, > > >>>>>> > > >>>>>> I have an old "prototyping" box with a 4-core Xeon and an old GTX > > 960. > > >>>>>> > > >>>>> We > > >>>>> > > >>>>>> have a Tesla K40 laying around and there's only one PCIE slot > > >>>>>> available > > >>>>>> > > >>>>> in > > >>>>> > > >>>>>> this machine. Would it make sense to swap the cards, or is it > > already > > >>>>>> bottlenecked by the CPU? I compared the specs and 960 has a higher > > >>>>>> clock > > >>>>>> speed, while K40's FP performance is better. Should I swap the > GPUs? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> > > >>>>>> Alex > > >>>>>> -- > > >>>>>> Gromacs Users mailing list > > >>>>>> > > >>>>>> * Please search the archive at http://www.gromacs.org/Support > > >>>>>> > > >>>>> /Mailing_Lists/GMX-Users_List before posting! > > >>>>> > > >>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >>>>>> > > >>>>>> * For (un)subscribe requests visit > > >>>>>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > or > > >>>>>> > > >>>>> send a mail to gmx-users-requ...@gromacs.org. > > >>>>> > > >>>>> -- > > >>>>> Gromacs Users mailing list > > >>>>> > > >>>>> * Please search the archive at http://www.gromacs.org/Support > > >>>>> /Mailing_Lists/GMX-Users_List before posting! > > >>>>> > > >>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >>>>> > > >>>>> * For (un)subscribe requests visit > > >>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > > or > > >>>>> send a mail to gmx-users-requ...@gromacs.org. > > >>>>> > > >>>> > > >>>> > > >>>> -- > > >>> Gromacs Users mailing list > > >>> > > >>> * Please search the archive at http://www.gromacs.org/Support > > >>> /Mailing_Lists/GMX-Users_List before posting! > > >>> > > >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > >>> > > >>> * For (un)subscribe requests visit > > >>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users > or > > >>> send a mail to gmx-users-requ...@gromacs.org. > > >>> > > >> > > > -- > > > Gromacs Users mailing list > > > > > > * Please search the archive at http://www.gromacs.org/Support > > > /Mailing_Lists/GMX-Users_List before posting! > > > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > * For (un)subscribe requests visit > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > > send a mail to gmx-users-requ...@gromacs.org. > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at http://www.gromacs.org/ > > Support/Mailing_Lists/GMX-Users_List before posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.