Hi, The SIMD-accelerated RB dihedrals got implemented a few days ago and as it turned out to be a relatively minor addition, we accepted the change for the 5.0 series and it even made it into today's release!
Expect a considerable performance improvement in GPU accelerated simulations of: * of systems that contain a large amount of RB dihedrals; * of inhomogeneous systems that contain some RB dihedrals when running in parallel (due to the decreased load imbalance). Cheers, -- Szilárd On Thu, Sep 18, 2014 at 10:28 AM, Michael Brunsteiner <mbx0...@yahoo.com> wrote: > > Dear Szilard, > thanks for your reply! > one more question ... you wrote that SIMD optimized RB-dihedrals might get > implemented > soon ... is there perhaps a link on gerrit.gromacs.org that i can use to > follow the progress there? > cheers > michael > > > > =============================== > > > Why be happy when you could be normal? > > ________________________________ > From: Szilárd Páll <pall.szil...@gmail.com> > To: Michael Brunsteiner <mbx0...@yahoo.com> > Cc: Discussion list for GROMACS users <gmx-us...@gromacs.org>; > "gromacs.org_gmx-users@maillist.sys.kth.se" > <gromacs.org_gmx-users@maillist.sys.kth.se> > Sent: Wednesday, September 17, 2014 4:18 PM > > Subject: Re: [gmx-users] GPU waits for CPU, any remedies? > > Dear Michael, > > I checked and indeed, the Ryckaert-Bellman dihedrals are not SIMD > accelerated - that's why they are quite slow. While you CPU is the > bottleneck and you're quite right that the PP-PME balancing can't do > much about this kind of imbalance, the good news is that it can be > faster - even without a new CPU. > > With SIMD this will accelerate quite well and will likely cut down > your bonded time by a lot (I'd guess at least 3-4x with AVX, maybe > more with FMA). This code has ben been SIMD optimized yet mostly > because in typical runs the RB computation takes relatively little > time, and additionally it is not quite developer-friendly the way > these kernels need to be written/rewritten for SIMD-acceleration. > However it will likely get implemented soon which in your case will > bring big improvements. > > Cheers, > -- > Szilárd > > > On Wed, Sep 17, 2014 at 3:01 PM, Michael Brunsteiner <mbx0...@yahoo.com> > wrote: >> >> Dear Szilard, >> yes it seems i just should have done a bit more reserarch regarding >> the optimal CPU/GPU combination ... and as you point out, the >> bonded interactions are the culprits ... most often people probably >> simulate aqueous systems, in which LINCS does most of this work >> here i have a polymer glass ... different story ... >> the flops table you miss was in my previous mail (see below for another >> copy) and indeed it tells me that 65% of ther CPU load is "Force" while >> only 15.5% is for PME mesh, and i assume only the latter is what can >> be modified by dynamic load balancing ... i assume this means >> there is no way to improve things ... i guess i just have to live >> with the fact that for this type of system my slow CPU is the >> bottleneck ... if you have any other ideas please let me know... >> regards >> mic >> >> >> >> : >> >> Computing: Num Num Call Wall time Giga-Cycles >> Ranks Threads Count (s) total sum % >> >> ----------------------------------------------------------------------------- >> Neighbor search 1 12 251 0.574 23.403 2.1 >> Launch GPU ops. 1 12 10001 0.627 25.569 2.3 >> Force 1 12 10001 17.392 709.604 64.5 >> PME mesh 1 12 10001 4.172 170.234 15.5 >> Wait GPU local 1 12 10001 0.206 8.401 0.8 >> NB X/F buffer ops. 1 12 19751 0.239 9.736 0.9 >> Write traj. 1 12 11 0.381 15.554 1.4 >> Update 1 12 10001 0.303 12.365 1.1 >> Constraints 1 12 10001 1.458 59.489 5.4 >> Rest 1.621 66.139 6.0 >> >> ----------------------------------------------------------------------------- >> Total 26.973 1100.493 100.0 >> >> =============================== >> >> Why be happy when you could be normal? >> >> -------------------------------------------- >> On Tue, 9/16/14, Szilárd Páll <pall.szil...@gmail.com> wrote: >> >> Subject: Re: [gmx-users] GPU waits for CPU, any remedies? >> To: "Michael Brunsteiner" <mbx0...@yahoo.com> >> Cc: "Discussion list for GROMACS users" <gmx-us...@gromacs.org>, >> "gromacs.org_gmx-users@maillist.sys.kth.se" >> <gromacs.org_gmx-users@maillist.sys.kth.se> >> Date: Tuesday, September 16, 2014, 6:52 PM >> >> Well, it looks like you are i) >> unlucky ii) limited by the huge bonded workload. >> >> i) As your system is quite small, mdrun thinks that there >> are no >> convenient grids between 32x32x32 and 28x28x28 (see the >> PP-PME tuning >> output). As the latter corresponds to quite a big jump in >> cut-off >> (from 1.296 to 1.482) which more than doubles the non-bonded >> workload >> and is slower than the former, mdrun sticks to using 1.296 >> nm as >> coulomb cut-off. You may be able to gain some performance by >> tweaking >> your fourier grid spacing a bit to help mdrun generating >> some >> additional grids that could give more cut-off settings in >> the 1.3-1.48 >> range. However, on a second thought, there aren't more >> convenient grid >> sizes between 28 and 32, I guess. >> >> ii) The primary issue is however that your bonded workload >> is much >> higher than it normally is. I'm not fully familiar with the >> implementation, but I think this may be due to the RB term >> which is >> quite slow. This time it's the flops table that could >> confirm this >> this, but as you still have not shared the entire log file, >> we/I can't >> tell. >> >> Cheers, >> -- >> Szilárd >> >> > > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.