Kevin, Mark, thanks for sharing advices and experience. I am facing some strange behaviour trying to run with the two gpus: there are some combinations that "simply" make the system crash (the workstation turns off after few seconds of running); in particular the following runs: gmx mdrun -deffnm run (-gpu_id 01) -pin on
which produces the following log "GROMACS version: 2019.2 Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64) GPU support: CUDA SIMD instructions: AVX2_256 FFT library: fftw-3.3.8-sse2-avx-avx2-avx2_128 RDTSCP usage: enabled TNG support: enabled Hwloc support: disabled Tracing support: disabled C compiler: /usr/bin/cc GNU 4.8.5 C compiler flags: -mavx2 -mfma -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast C++ compiler: /usr/bin/c++ GNU 4.8.5 C++ compiler flags: -mavx2 -mfma -std=c++11 -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2019 NVIDIA Corporation;Built on Wed_Apr_24_19:10:27_PDT_2019;Cuda compilation tools, release 10.1, V10.1.168 CUDA compiler flags:-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=compute_75;-use_fast_math;;; ;-mavx2;-mfma;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast; CUDA driver: 10.10 CUDA runtime: 10.10 Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs Hardware detected: CPU info: Vendor: AMD Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor Family: 23 Model: 8 Stepping: 2 Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 Hardware topology: Basic Sockets, cores, and logical processors: Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] GPU info: Number of GPUs detected: 2 #0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible #1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible ... Using 16 MPI threads Using 4 OpenMP threads per tMPI thread On host pcpharm018 2 GPUs selected for this run. Mapping of GPU IDs to the 16 GPU tasks in the 16 ranks on this node: PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:0,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1,PP:1 PP tasks will do (non-perturbed) short-ranged and most bonded interactions on the GPU Pinning threads with an auto-selected logical core stride of 1" Running the following command seems to work without crashing, with 1 tmpi and 32 omp threads on 1 gpu only: gmx mdrun -deffnm run -gpu_id 01 -pin on -pinstride 1 -pinoffset 32 -ntmpi 1. The most efficient way to run a single run seems to be produced by: gmx mdrun -deffnm run -gpu_id 0 -ntmpi 1 -ntomp 28 which makes 86 ns/day for a system of about 100K atoms (1000 res. protein with membrane and water). I also tried to run two independent simulations, and again with the following commands the system crashes: gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 32 -ntomp 32 -ntmpi 1 gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntomp 32 -ntmpi 1 with the log "Running on 1 node with total 32 cores, 64 logical cores, 2 compatible GPUs Hardware detected: CPU info: Vendor: AMD Brand: AMD Ryzen Threadripper 2990WX 32-Core Processor Family: 23 Model: 8 Stepping: 2 Features: aes amd apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf misalignsse mmx msr nonstop_tsc pclmuldq pdpe1gb popcnt pse rdrnd rdtscp sha sse2 sse3 sse4a sse4.1 sse4.2 ssse3 Hardware topology: Basic Sockets, cores, and logical processors: Socket 0: [ 0 32] [ 1 33] [ 2 34] [ 3 35] [ 4 36] [ 5 37] [ 6 38] [ 7 39] [ 16 48] [ 17 49] [ 18 50] [ 19 51] [ 20 52] [ 21 53] [ 22 54] [ 23 55] [ 8 40] [ 9 41] [ 10 42] [ 11 43] [ 12 44] [ 13 45] [ 14 46] [ 15 47] [ 24 56] [ 25 57] [ 26 58] [ 27 59] [ 28 60] [ 29 61] [ 30 62] [ 31 63] GPU info: Number of GPUs detected: 2 #0: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible #1: NVIDIA GeForce RTX 2080 Ti, compute cap.: 7.5, ECC: no, stat: compatible ... Using 1 MPI thread Using 32 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Applying core pinning offset 32." Two runs can be carried out with the command: gmx mdrun -deffnm run1 -gpu_id 1 -pin on -pinstride 1 -pinoffset 14 -ntmpi 1 -ntomp 28 gmx mdrun -deffnm run0 -gpu_id 0 -pin on -pinstride 1 -pinoffset 0 -ntmpi 1 -ntomp 28 "Using 1 MPI thread Using 28 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Applying core pinning offset 14 Pinning threads with a user-specified logical core stride of 1" or gmx mdrun -deffnm run1 -gpu_id 1 -pin on -ntmpi 1 -ntomp 28 gmx mdrun -deffnm run0 -gpu_id 0 -pin on -ntmpi 1 -ntomp 28 "Using 1 MPI thread Using 28 OpenMP threads 1 GPU selected for this run. Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on this node: PP:1,PME:1 PP tasks will do (non-perturbed) short-ranged interactions on the GPU PME tasks will do all aspects on the GPU Pinning threads with an auto-selected logical core stride of 2" With some disappointment in both situations there was a substantial degrading of performance, about 35-40 ns/day for the same system, with a gpu usage of 25-30%, compared to 50-55% for the single run on a single gpu, and much below the power cap. I hope not to have been confusing and will be grateful for any suggestions. Thanks Stefano Il giorno ven 26 lug 2019 alle ore 15:00 Kevin Boyd <kevin.b...@uconn.edu> ha scritto: > Sure - you can do it 2 ways with normal Gromacs. Either run the simulations > in separate terminals, or use ampersands to run them in the background of 1 > terminal. > > I'll give a concrete example for your threadripper, using 32 of your cores, > so that you could run some other computation on the other 32. I typically > make a bash variable with all the common arguments. > > Given tprs run1.tpr ...run4.tpr > > gmx_common="gmx mdrun -ntomp 8 -ntmpi 1 -pme gpu -nb gpu -pin on -pinstride > 1" > $gmx_common -deffnm run1 -pinoffset 32 -gputasks 00 & > $gmx_common -deffnm run2 -pinoffest 40 -gputasks 00 & > $gmx_common -deffnm run3 -pinoffset 48 -gputasks 11 & > $gmx_common -deffnm run3 -pinoffset 56 -gputasks 11 > > So run1 will run on cores 32-39, on GPU 0, run2 on cores 40-47 on the same > GPU, and the other 2 runs will use GPU 1. Note the ampersands on the first > 3 runs, so they'll go off in the background > > I should also have mentioned one peculiarity with running with -ntmpi 1 and > -pme gpu, in that even though there's now only one rank (with nonbonded and > PME both running on it), you still need 2 gpu tasks for that one rank, one > for each type of interaction. > > As for multidir, I forget what troubles I ran into exactly, but I was > unable to run some subset of simulations. Anyhow if you aren't running on a > cluster, I see no reason to compile with MPI and have to use srun or slurm, > and need to use gmx_mpi rather than gmx. The built-in thread-mpi gives you > up to 64 threads, and can have a minor (<5% in my experience) performance > benefit over MPI. > > Kevin > > On Fri, Jul 26, 2019 at 8:21 AM Gregory Man Kai Poon <gp...@gsu.edu> > wrote: > > > Hi Kevin, > > Thanks for your very useful post. Could you give a few command line > > examples on how to start multiple runs at different times (e.g., > allocate a > > subset of CPU/GPU to one run, and start another run later using another > > unsubset of yet-unallocated CPU/GPU). Also, could you elaborate on the > > drawbacks of the MPI compilation that you hinted at? > > Gregory > > > > From: Kevin Boyd<mailto:kevin.b...@uconn.edu> > > Sent: Thursday, July 25, 2019 10:31 PM > > To: gmx-us...@gromacs.org<mailto:gmx-us...@gromacs.org> > > Subject: Re: [gmx-users] simulation on 2 gpus > > > > Hi, > > > > I've done a lot of research/experimentation on this, so I can maybe get > you > > started - if anyone has any questions about the essay to follow, feel > free > > to email me personally, and I'll link it to the email thread if it ends > up > > being pertinent. > > > > First, there's some more internet resources to checkout. See Mark's talk > at > > - > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbioexcel.eu%2Fwebinar-performance-tuning-and-optimization-of-gromacs%2F&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=6KGjrrBb8w%2FSqMvTtdzoiufYtOpmOKX5QyYNFAivTMo%3D&reserved=0 > > Gromacs development moves fast, but a lot of it is still relevant. > > > > I'll expand a bit here, with the caveat that Gromacs GPU development is > > moving very fast and so the correct commands for optimal performance are > > both system-dependent and a moving target between versions. This is a > good > > thing - GPUs have revolutionized the field, and with each iteration we > make > > better use of them. The downside is that it's unclear exactly what sort > of > > CPU-GPU balance you should look to purchase to take advantage of future > > developments, though the trend is certainly that more and more > computation > > is being offloaded to the GPUs. > > > > The most important consideration is that to get maximum total throughput > > performance, you should be running not one but multiple simulations > > simultaneously. You can do this through the -multidir option, but I don't > > recommend that in this case, as it requires compiling with MPI and limits > > some of your options. My run scripts usually use "gmx mdrun ... &" to > > initiate subprocesses, with combinations of -ntomp, -ntmpi, -pin > > -pinoffset, and -gputasks. I can give specific examples if you're > > interested. > > > > Another important point is that you can run more simulations than the > > number of GPUs you have. Depending on CPU-GPU balance and quality, you > > won't double your throughput by e.g. putting 4 simulations on 2 GPUs, but > > you might increase it up to 1.5x. This would involve targeting the same > GPU > > with -gputasks. > > > > Within a simulation, you should set up a benchmarking script to figure > out > > the best combination of thread-mpi ranks and open-mp threads - this can > > have pretty drastic effects on performance. For example, if you want to > use > > your entire machine for one simulation (not recommended for maximal > > efficiency), you have a lot of decomposition options (ignoring PME - > which > > is important, see below): > > > > -ntmpi 2 -ntomp 32 -gputasks 01 > > -ntmpi 4 -ntomp 16 -gputasks 0011 > > -ntmpi 8 -ntomp 8 -gputasks 00001111 > > -ntmpi 16 -ntomp 4 -gputasks 000000001111111 > > (and a few others - note that ntmpi * ntomp = total threads available) > > > > In my experience, you need to scan the options in a benchmarking script > for > > each simulation size/content you want to simulate, and the difference > > between the best and the worst can be up to a factor of 2-4 in terms of > > performance. If you're splitting your machine among multiple > simulations, I > > suggest running 1 mpi thread (-ntmpi 1) per simulation, unless your > > benchmarking suggests that the optimal performance lies elsewhere. > > > > Things get more complicated when you start putting PME on the GPUs. For > the > > machines I work on, putting PME on GPUs absolutely improves performance, > > but I'm not fully confident in that assessment without testing your > > specific machine - you have a lot of cores with that threadripper, and > this > > is another area where I expect Gromacs 2020 might shift the GPU-CPU > optimal > > balance. > > > > The issue with PME on GPUs is that we can (currently) only have one rank > > doing GPU PME work. So, if we have a machine with say 20 cores and 2 > gpus, > > if I run the following > > > > gmx mdrun .... -ntomp 10 -ntmpi 2 -pme gpu -npme 1 -gputasks 01 > > > > , two ranks will be started - one with cores 0-9, will work on the > > short-range interactions, offloading where it can to GPU 0, and the PME > > rank (cores 10-19) will offload to GPU 1. There is one significant > problem > > (and one minor problem) with this setup. First, it is massively > inefficient > > in terms of load balance. In a typical system (there are exceptions), PME > > takes up ~1/3 of the computation that short-range interactions take. So, > we > > are offloading 1/4 of our interactions to one GPU and 3/4 to the other, > > which leads to imbalance. In this specific case (2 GPUs and sufficient > > cores), the most optimal solution is often (but not always) to run with > > -ntmpi 4 (in this example, then -ntomp 5), as the PME rank then gets 1/4 > of > > the GPU instructions, proportional to the computation needed. > > > > The second(less critical - don't worry about this unless you're > > CPU-limited) problem is that PME-GPU mpi ranks only use 1 CPU core in > their > > calculations. So, with a node of 20 cores and 2 GPUs, if I run a > simulation > > with -ntmpi 4 -ntmpi 5 -pme gpu -npme 1 -pme gpu, each one of those ranks > > will have 5 CPUs, but the PME rank will only use one of them. You can > > specify the number of PME cores per rank with -ntomp_pme. This is useful > in > > restricted cases. For example, given the above architecture setup (20 > > cores, 2 GPUs), I could maximally exploit my CPUs with the following > > commands: > > > > gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks > > 0000 -pin on -pinoffset 0 & > > gmx mdrun .... -ntmpi 4 -ntomp 3 -ntomp_pme 1 -pme gpu -npme 1 -gputasks > > 1111 -pin on -pinoffset 10 > > > > where the 1st 10 cores are (0-2 - PP) (3-5 - PP) (6-8 -PP) (9 - PME) > > and similar for the other 10 cores. > > > > There are a few other parameters to scan for minor improvements - for > > example nstlist, which I typically scan in a range between 80-140 for GPU > > simulations, with an effect between 2-5% of performance > > > > I'm happy to expand the discussion with anyone who's interested. > > > > Kevin > > > > > > On Thu, Jul 25, 2019 at 1:47 PM Stefano Guglielmo < > > stefano.guglie...@unito.it> wrote: > > > > > Dear all, > > > I am trying to run simulation with Gromacs 2019.2 on a workstation with > > an > > > amd Threadripper cpu (32 core, 64 threads, 128 GB RAM and with two rtx > > 2080 > > > ti with nvlink bridge. I read user's guide section regarding > performance > > > and I am exploring some possibile combinations of cpu/gpu work to run > as > > > fast as possible. I was wondering if some of you has experience of > > running > > > on more than one gpu with several cores and can give some hints as > > starting > > > point. > > > Thanks > > > Stefano > > > > > > > > > -- > > > Stefano GUGLIELMO PhD > > > Assistant Professor of Medicinal Chemistry > > > Department of Drug Science and Technology > > > Via P. Giuria 9 > > > 10125 Turin, ITALY > > > ph. +39 (0)11 6707178 > > > -- > > > Gromacs Users mailing list > > > > > > * Please search the archive at > > > > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=2%2FCC3SZgnYolAwNRUaPg1%2BmCc1%2Bb%2FZwU38g9FxqJp2A%3D&reserved=0 > > > before posting! > > > > > > * Can't post? Read > > > > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053958340&sdata=esfRK00iIHBlFN285W6JkWFr8S4HQ3%2B9jn3R45v%2FBvY%3D&reserved=0 > > > > > > * For (un)subscribe requests visit > > > > > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0 > > > or send a mail to gmx-users-requ...@gromacs.org. > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=xMCvfr9LVUW37ZurhHDi%2BqW76PZnH78E2MIR7yQF6Qw%3D&reserved=0 > > before posting! > > > > * Can't post? Read > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=fHDaHZPUZf57P%2FMXkIxN%2FqmtkRtvDu4B%2B9EQiU20BnA%3D&reserved=0 > > > > * For (un)subscribe requests visit > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0 > > or send a mail to gmx-users-requ...@gromacs.org. > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists%2FGMX-Users_List&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=xMCvfr9LVUW37ZurhHDi%2BqW76PZnH78E2MIR7yQF6Qw%3D&reserved=0 > > before posting! > > > > * Can't post? Read > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.gromacs.org%2FSupport%2FMailing_Lists&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=fHDaHZPUZf57P%2FMXkIxN%2FqmtkRtvDu4B%2B9EQiU20BnA%3D&reserved=0 > > > > * For (un)subscribe requests visit > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmaillist.sys.kth.se%2Fmailman%2Flistinfo%2Fgromacs.org_gmx-users&data=02%7C01%7Ckevin.boyd%40uconn.edu%7C370b1f9ca9ad4a5e52af08d711c3d178%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636997405053968336&sdata=5bBv3pZe45cZk8wKpCZgYlCxZEQ5sD0RO8e8EcgqnOw%3D&reserved=0 > > or send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Stefano GUGLIELMO PhD Assistant Professor of Medicinal Chemistry Department of Drug Science and Technology Via P. Giuria 9 10125 Turin, ITALY ph. +39 (0)11 6707178 -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.