Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
The paper looks good. Do some more work and publish many Sent from my iPhone On 17-Jan-2013, at 8:18 PM, "James Starlight" wrote: Dear Gromacs Developers! Using sd1 integrator I've obtain good performance with the core-5 + GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on 30% better than with the sd integrator. Buit on my another work-station which differs only by slower GPU ( GT 640). I've obtained some gpu\cpu mis-match. Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373( # on the first station with GTX 670 I ve obtained GPU/CPU: ratio close to 1. At both cases I'm using the same simulation parameters with 0,8 cutoffs (it's also important that in the second case I've calculated another system consisted of 33k atoms by means of umbrella sampling pulling)). Could you tell me how I could increase performance on my second station ( to reduce gpucpu ratio) ? I've attached log for that simulation here http://www.sendspace.com/file/x0e3z8 James 2013/1/17 Szilárd Páll : > Hi, > > Just to note for the users who might read this: the report is valid, some > non-thread-parallel code is the reason and we hope to have a fix for 4.6.0. > > For updates, follow the issue #1211. > > Cheers, > > -- > Szilárd > > > On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess wrote: > >> >> The issue I'm referring to is about a factor of 2 in update and >> constraints, but here it's much more. >> I just found out that the SD update is not OpenMP threaded (and I even >> noted in the code why this is). >> I reopened the issue and will find a solution. >> >> Cheers. >> >> Berk >> >> >>> Date: Wed, 16 Jan 2013 16:20:32 +0100 >>> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd >> integrator >>> From: mark.j.abra...@gmail.com >>> To: gmx-users@gromacs.org >>> >>> We should probably note this effect on the wiki somewhere? >>> >>> Mark >>> >>> On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: >>> >>>> >>>> Hi, >>>> >>>> Unfortunately this is not a bug, but a feature! >>>> We made the non-bondeds so fast on the GPU that integration and >>>> constraints take more time. >>>> The sd1 integrator is almost as fast as the md integrator, but slightly >>>> less accurate. >>>> In most cases that's a good solution. >>>> >>>> I closed the redmine issue: >>>> http://redmine.gromacs.org/issues/1121 >>>> >>>> Cheers, >>>> >>>> Berk >>>> >>>> >>>>> Date: Wed, 16 Jan 2013 17:26:18 +0300 >>>>> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd >>>> integrator >>>>> From: jmsstarli...@gmail.com >>>>> To: gmx-users@gromacs.org >>>>> >>>>> Hi all! >>>>> >>>>> I've also done some calculations with the SD integraator used as the >>>>> thermostat ( without t_coupl ) with the system of 65k atoms I >> obtained >>>>> 10ns\day performance on gtc 670 and 4th core i5. >>>>> I haventrun any simulations with MD integrator yet so It should test >> it. >>>>> >>>>> James >>>>> >>>>> 2013/1/15 Szilárd Páll : >>>>>> Hi Floris, >>>>>> >>>>>> Great feedback, this needs to be looked into. Could you please >> file a >>>> bug >>>>>> report, preferably with a tpr (and/or all inputs) as well as log >> files. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -- >>>>>> Szilárd >>>>>> >>>>>> >>>>>> On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens < >>>> floris_buel...@yahoo.com>wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> I'm seeing MD simulation running a lot slower with the sd >> integrator >>>> than >>>>>>> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found >> no >>>>>>> documented indication that this should be the case. >>>>>>> Timings and logs pasted in below - wall time seems to be >> accumulating >&
RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi, Please use the fix I put on the redmine issue, as that's even faster and you can use sd again. We should probably rephrase the note a bit in case the GPU has more work to do than the CPU. In your case there is simple no work to do for the CPU. Ideally we would let the CPU handle some non-bonded work, but that probably won't happen in 4.6. A solution might be buying a 3x a fast GPU. Cheers, Berk > Date: Thu, 17 Jan 2013 18:48:17 +0400 > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator > From: jmsstarli...@gmail.com > To: gmx-users@gromacs.org > > Dear Gromacs Developers! > > Using sd1 integrator I've obtain good performance with the core-5 + > GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on > 30% better than with the sd integrator. > > Buit on my another work-station which differs only by slower GPU ( GT > 640). I've obtained some gpu\cpu mis-match. > > Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373 ( # on > the first station with GTX 670 I ve obtained GPU/CPU: ratio close to > 1. > > At both cases I'm using the same simulation parameters with 0,8 > cutoffs (it's also important that in the second case I've calculated > another system consisted of 33k atoms by means of umbrella sampling > pulling)). Could you tell me how I could increase performance on my > second station ( to reduce gpucpu ratio) ? I've attached log for that > simulation here http://www.sendspace.com/file/x0e3z8 > > James > > 2013/1/17 Szilárd Páll : > > Hi, > > > > Just to note for the users who might read this: the report is valid, some > > non-thread-parallel code is the reason and we hope to have a fix for 4.6.0. > > > > For updates, follow the issue #1211. > > > > Cheers, > > > > -- > > Szilárd > > > > > > On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess wrote: > > > >> > >> The issue I'm referring to is about a factor of 2 in update and > >> constraints, but here it's much more. > >> I just found out that the SD update is not OpenMP threaded (and I even > >> noted in the code why this is). > >> I reopened the issue and will find a solution. > >> > >> Cheers. > >> > >> Berk > >> > >> > >> > Date: Wed, 16 Jan 2013 16:20:32 +0100 > >> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > >> integrator > >> > From: mark.j.abra...@gmail.com > >> > To: gmx-users@gromacs.org > >> > > >> > We should probably note this effect on the wiki somewhere? > >> > > >> > Mark > >> > > >> > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: > >> > > >> > > > >> > > Hi, > >> > > > >> > > Unfortunately this is not a bug, but a feature! > >> > > We made the non-bondeds so fast on the GPU that integration and > >> > > constraints take more time. > >> > > The sd1 integrator is almost as fast as the md integrator, but slightly > >> > > less accurate. > >> > > In most cases that's a good solution. > >> > > > >> > > I closed the redmine issue: > >> > > http://redmine.gromacs.org/issues/1121 > >> > > > >> > > Cheers, > >> > > > >> > > Berk > >> > > > >> > > > >> > > > Date: Wed, 16 Jan 2013 17:26:18 +0300 > >> > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > >> > > integrator > >> > > > From: jmsstarli...@gmail.com > >> > > > To: gmx-users@gromacs.org > >> > > > > >> > > > Hi all! > >> > > > > >> > > > I've also done some calculations with the SD integraator used as the > >> > > > thermostat ( without t_coupl ) with the system of 65k atoms I > >> obtained > >> > > > 10ns\day performance on gtc 670 and 4th core i5. > >> > > > I haventrun any simulations with MD integrator yet so It should test > >> it. > >> > > > > >> > > > James > >> > > > > >> > > > 2013/1/15 Szilárd Páll : > >> > > > > Hi Floris, > >> > > > > > >> > > > > Great f
Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Dear Gromacs Developers! Using sd1 integrator I've obtain good performance with the core-5 + GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on 30% better than with the sd integrator. Buit on my another work-station which differs only by slower GPU ( GT 640). I've obtained some gpu\cpu mis-match. Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373( # on the first station with GTX 670 I ve obtained GPU/CPU: ratio close to 1. At both cases I'm using the same simulation parameters with 0,8 cutoffs (it's also important that in the second case I've calculated another system consisted of 33k atoms by means of umbrella sampling pulling)). Could you tell me how I could increase performance on my second station ( to reduce gpucpu ratio) ? I've attached log for that simulation here http://www.sendspace.com/file/x0e3z8 James 2013/1/17 Szilárd Páll : > Hi, > > Just to note for the users who might read this: the report is valid, some > non-thread-parallel code is the reason and we hope to have a fix for 4.6.0. > > For updates, follow the issue #1211. > > Cheers, > > -- > Szilárd > > > On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess wrote: > >> >> The issue I'm referring to is about a factor of 2 in update and >> constraints, but here it's much more. >> I just found out that the SD update is not OpenMP threaded (and I even >> noted in the code why this is). >> I reopened the issue and will find a solution. >> >> Cheers. >> >> Berk >> >> ---- >> > Date: Wed, 16 Jan 2013 16:20:32 +0100 >> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd >> integrator >> > From: mark.j.abra...@gmail.com >> > To: gmx-users@gromacs.org >> > >> > We should probably note this effect on the wiki somewhere? >> > >> > Mark >> > >> > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: >> > >> > > >> > > Hi, >> > > >> > > Unfortunately this is not a bug, but a feature! >> > > We made the non-bondeds so fast on the GPU that integration and >> > > constraints take more time. >> > > The sd1 integrator is almost as fast as the md integrator, but slightly >> > > less accurate. >> > > In most cases that's a good solution. >> > > >> > > I closed the redmine issue: >> > > http://redmine.gromacs.org/issues/1121 >> > > >> > > Cheers, >> > > >> > > Berk >> > > >> > > >> > > > Date: Wed, 16 Jan 2013 17:26:18 +0300 >> > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd >> > > integrator >> > > > From: jmsstarli...@gmail.com >> > > > To: gmx-users@gromacs.org >> > > > >> > > > Hi all! >> > > > >> > > > I've also done some calculations with the SD integraator used as the >> > > > thermostat ( without t_coupl ) with the system of 65k atoms I >> obtained >> > > > 10ns\day performance on gtc 670 and 4th core i5. >> > > > I haventrun any simulations with MD integrator yet so It should test >> it. >> > > > >> > > > James >> > > > >> > > > 2013/1/15 Szilárd Páll : >> > > > > Hi Floris, >> > > > > >> > > > > Great feedback, this needs to be looked into. Could you please >> file a >> > > bug >> > > > > report, preferably with a tpr (and/or all inputs) as well as log >> files. >> > > > > >> > > > > Thanks, >> > > > > >> > > > > -- >> > > > > Szilárd >> > > > > >> > > > > >> > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens < >> > > floris_buel...@yahoo.com>wrote: >> > > > > >> > > > >> Hi, >> > > > >> >> > > > >> >> > > > >> I'm seeing MD simulation running a lot slower with the sd >> integrator >> > > than >> > > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found >> no >> > > > >> documented indication that this should be the case. >> > > > >> Timings and logs pasted in below - wall time seems to be >> accumulating >&
Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi, Just to note for the users who might read this: the report is valid, some non-thread-parallel code is the reason and we hope to have a fix for 4.6.0. For updates, follow the issue #1211. Cheers, -- Szilárd On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess wrote: > > The issue I'm referring to is about a factor of 2 in update and > constraints, but here it's much more. > I just found out that the SD update is not OpenMP threaded (and I even > noted in the code why this is). > I reopened the issue and will find a solution. > > Cheers. > > Berk > > > > Date: Wed, 16 Jan 2013 16:20:32 +0100 > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > integrator > > From: mark.j.abra...@gmail.com > > To: gmx-users@gromacs.org > > > > We should probably note this effect on the wiki somewhere? > > > > Mark > > > > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: > > > > > > > > Hi, > > > > > > Unfortunately this is not a bug, but a feature! > > > We made the non-bondeds so fast on the GPU that integration and > > > constraints take more time. > > > The sd1 integrator is almost as fast as the md integrator, but slightly > > > less accurate. > > > In most cases that's a good solution. > > > > > > I closed the redmine issue: > > > http://redmine.gromacs.org/issues/1121 > > > > > > Cheers, > > > > > > Berk > > > > > > > > > > Date: Wed, 16 Jan 2013 17:26:18 +0300 > > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > > > integrator > > > > From: jmsstarli...@gmail.com > > > > To: gmx-users@gromacs.org > > > > > > > > Hi all! > > > > > > > > I've also done some calculations with the SD integraator used as the > > > > thermostat ( without t_coupl ) with the system of 65k atoms I > obtained > > > > 10ns\day performance on gtc 670 and 4th core i5. > > > > I haventrun any simulations with MD integrator yet so It should test > it. > > > > > > > > James > > > > > > > > 2013/1/15 Szilárd Páll : > > > > > Hi Floris, > > > > > > > > > > Great feedback, this needs to be looked into. Could you please > file a > > > bug > > > > > report, preferably with a tpr (and/or all inputs) as well as log > files. > > > > > > > > > > Thanks, > > > > > > > > > > -- > > > > > Szilárd > > > > > > > > > > > > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens < > > > floris_buel...@yahoo.com>wrote: > > > > > > > > > >> Hi, > > > > >> > > > > >> > > > > >> I'm seeing MD simulation running a lot slower with the sd > integrator > > > than > > > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found > no > > > > >> documented indication that this should be the case. > > > > >> Timings and logs pasted in below - wall time seems to be > accumulating > > > up > > > > >> in Update and Rest, adding up to >60% of total. The effect is > still > > > there > > > > >> without GPU, ca. 40% slowdown when switching from group to Verlet > > > with the > > > > >> SD integrator > > > > >> System: Xeon E5-1620, 1x GTX 680, gromacs > > > > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 > > > > >> > > > > >> I didn't file a bug report yet as I don't have much variety of > testing > > > > >> conditions available right now, I hope someone else has a moment > to > > > try to > > > > >> reproduce? > > > > >> > > > > >> Timings: > > > > >> > > > > >> cpu (ns/day) > > > > >> sd / verlet: 6 > > > > >> sd / group: 10 > > > > >> md / verlet: 9.2 > > > > >> md / group: 11.4 > > > > >> > > > > >> gpu (ns/day) > > > > >> sd / verlet: 11 > > > > >> md / verlet: 29.8 > > > > >> > > > > >> > > > > >> >
RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
The issue I'm referring to is about a factor of 2 in update and constraints, but here it's much more. I just found out that the SD update is not OpenMP threaded (and I even noted in the code why this is). I reopened the issue and will find a solution. Cheers. Berk > Date: Wed, 16 Jan 2013 16:20:32 +0100 > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator > From: mark.j.abra...@gmail.com > To: gmx-users@gromacs.org > > We should probably note this effect on the wiki somewhere? > > Mark > > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: > > > > > Hi, > > > > Unfortunately this is not a bug, but a feature! > > We made the non-bondeds so fast on the GPU that integration and > > constraints take more time. > > The sd1 integrator is almost as fast as the md integrator, but slightly > > less accurate. > > In most cases that's a good solution. > > > > I closed the redmine issue: > > http://redmine.gromacs.org/issues/1121 > > > > Cheers, > > > > Berk > > > > ---- > > > Date: Wed, 16 Jan 2013 17:26:18 +0300 > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > > integrator > > > From: jmsstarli...@gmail.com > > > To: gmx-users@gromacs.org > > > > > > Hi all! > > > > > > I've also done some calculations with the SD integraator used as the > > > thermostat ( without t_coupl ) with the system of 65k atoms I obtained > > > 10ns\day performance on gtc 670 and 4th core i5. > > > I haventrun any simulations with MD integrator yet so It should test it. > > > > > > James > > > > > > 2013/1/15 Szilárd Páll : > > > > Hi Floris, > > > > > > > > Great feedback, this needs to be looked into. Could you please file a > > bug > > > > report, preferably with a tpr (and/or all inputs) as well as log files. > > > > > > > > Thanks, > > > > > > > > -- > > > > Szilárd > > > > > > > > > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens < > > floris_buel...@yahoo.com>wrote: > > > > > > > >> Hi, > > > >> > > > >> > > > >> I'm seeing MD simulation running a lot slower with the sd integrator > > than > > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no > > > >> documented indication that this should be the case. > > > >> Timings and logs pasted in below - wall time seems to be accumulating > > up > > > >> in Update and Rest, adding up to >60% of total. The effect is still > > there > > > >> without GPU, ca. 40% slowdown when switching from group to Verlet > > with the > > > >> SD integrator > > > >> System: Xeon E5-1620, 1x GTX 680, gromacs > > > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 > > > >> > > > >> I didn't file a bug report yet as I don't have much variety of testing > > > >> conditions available right now, I hope someone else has a moment to > > try to > > > >> reproduce? > > > >> > > > >> Timings: > > > >> > > > >> cpu (ns/day) > > > >> sd / verlet: 6 > > > >> sd / group: 10 > > > >> md / verlet: 9.2 > > > >> md / group: 11.4 > > > >> > > > >> gpu (ns/day) > > > >> sd / verlet: 11 > > > >> md / verlet: 29.8 > > > >> > > > >> > > > >> > > > >> **MD integrator, GPU / verlet > > > >> > > > >> M E G A - F L O P S A C C O U N T I N G > > > >> > > > >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels > > > >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table > > > >> W3=SPC/TIP3p W4=TIP4p (single or pairs) > > > >> V&F=Potential and force V=Potential only F=Force only > > > >> > > > >> Computing: M-Number M-Flops % Flops > > > >> > > > >> > > - > > > >> Pair Search distance check 1244.988096 11204.893 0.1 > > > >> NxN QSTab Elec. + VdW [F]
Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
We should probably note this effect on the wiki somewhere? Mark On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess wrote: > > Hi, > > Unfortunately this is not a bug, but a feature! > We made the non-bondeds so fast on the GPU that integration and > constraints take more time. > The sd1 integrator is almost as fast as the md integrator, but slightly > less accurate. > In most cases that's a good solution. > > I closed the redmine issue: > http://redmine.gromacs.org/issues/1121 > > Cheers, > > Berk > > > > Date: Wed, 16 Jan 2013 17:26:18 +0300 > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd > integrator > > From: jmsstarli...@gmail.com > > To: gmx-users@gromacs.org > > > > Hi all! > > > > I've also done some calculations with the SD integraator used as the > > thermostat ( without t_coupl ) with the system of 65k atoms I obtained > > 10ns\day performance on gtc 670 and 4th core i5. > > I haventrun any simulations with MD integrator yet so It should test it. > > > > James > > > > 2013/1/15 Szilárd Páll : > > > Hi Floris, > > > > > > Great feedback, this needs to be looked into. Could you please file a > bug > > > report, preferably with a tpr (and/or all inputs) as well as log files. > > > > > > Thanks, > > > > > > -- > > > Szilárd > > > > > > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens < > floris_buel...@yahoo.com>wrote: > > > > > >> Hi, > > >> > > >> > > >> I'm seeing MD simulation running a lot slower with the sd integrator > than > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no > > >> documented indication that this should be the case. > > >> Timings and logs pasted in below - wall time seems to be accumulating > up > > >> in Update and Rest, adding up to >60% of total. The effect is still > there > > >> without GPU, ca. 40% slowdown when switching from group to Verlet > with the > > >> SD integrator > > >> System: Xeon E5-1620, 1x GTX 680, gromacs > > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 > > >> > > >> I didn't file a bug report yet as I don't have much variety of testing > > >> conditions available right now, I hope someone else has a moment to > try to > > >> reproduce? > > >> > > >> Timings: > > >> > > >> cpu (ns/day) > > >> sd / verlet: 6 > > >> sd / group: 10 > > >> md / verlet: 9.2 > > >> md / group: 11.4 > > >> > > >> gpu (ns/day) > > >> sd / verlet: 11 > > >> md / verlet: 29.8 > > >> > > >> > > >> > > >> **MD integrator, GPU / verlet > > >> > > >> M E G A - F L O P S A C C O U N T I N G > > >> > > >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels > > >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table > > >> W3=SPC/TIP3p W4=TIP4p (single or pairs) > > >> V&F=Potential and force V=Potential only F=Force only > > >> > > >> Computing: M-Number M-Flops % Flops > > >> > > >> > - > > >> Pair Search distance check 1244.988096 11204.893 0.1 > > >> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9 > > >> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4 > > >> 1,4 nonbonded interactions 31.616322 2845.469 0.0 > > >> Calc Weights 703.010574 25308.381 0.3 > > >> Spread Q Bspline 14997.558912 29995.118 0.3 > > >> Gather F Bspline 14997.558912 89985.353 1.0 > > >> 3D-FFT 47658.567884 381268.543 4.4 > > >> Solve PME 20.580896 1317.177 0.0 > > >> Shift-X 9.418458 56.511 0.0 > > >> Angles 21.879375 3675.735 0.0 > > >> Propers 48.599718 11129.335 0.1 > > >> Virial 23.498403 422.971 0.0 > > >> Stop-CM 2.436616 24.366 0.0 > > >> Calc-Ekin 93.809716 2532.862 0.0 > > >> Lincs 12.147284 728.837 0.0 > > >> Lincs-Mat 131.328750 525.315 0.0 > > >> Constraint-V 246.633614 1973.069 0.0 > > >> Constraint-Vir 2
RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi, Unfortunately this is not a bug, but a feature! We made the non-bondeds so fast on the GPU that integration and constraints take more time. The sd1 integrator is almost as fast as the md integrator, but slightly less accurate. In most cases that's a good solution. I closed the redmine issue: http://redmine.gromacs.org/issues/1121 Cheers, Berk > Date: Wed, 16 Jan 2013 17:26:18 +0300 > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator > From: jmsstarli...@gmail.com > To: gmx-users@gromacs.org > > Hi all! > > I've also done some calculations with the SD integraator used as the > thermostat ( without t_coupl ) with the system of 65k atoms I obtained > 10ns\day performance on gtc 670 and 4th core i5. > I haventrun any simulations with MD integrator yet so It should test it. > > James > > 2013/1/15 Szilárd Páll : > > Hi Floris, > > > > Great feedback, this needs to be looked into. Could you please file a bug > > report, preferably with a tpr (and/or all inputs) as well as log files. > > > > Thanks, > > > > -- > > Szilárd > > > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens > > wrote: > > > >> Hi, > >> > >> > >> I'm seeing MD simulation running a lot slower with the sd integrator than > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no > >> documented indication that this should be the case. > >> Timings and logs pasted in below - wall time seems to be accumulating up > >> in Update and Rest, adding up to >60% of total. The effect is still there > >> without GPU, ca. 40% slowdown when switching from group to Verlet with the > >> SD integrator > >> System: Xeon E5-1620, 1x GTX 680, gromacs > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 > >> > >> I didn't file a bug report yet as I don't have much variety of testing > >> conditions available right now, I hope someone else has a moment to try to > >> reproduce? > >> > >> Timings: > >> > >> cpu (ns/day) > >> sd / verlet: 6 > >> sd / group: 10 > >> md / verlet: 9.2 > >> md / group: 11.4 > >> > >> gpu (ns/day) > >> sd / verlet: 11 > >> md / verlet: 29.8 > >> > >> > >> > >> **MD integrator, GPU / verlet > >> > >> M E G A - F L O P S A C C O U N T I N G > >> > >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels > >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table > >> W3=SPC/TIP3p W4=TIP4p (single or pairs) > >> V&F=Potential and force V=Potential only F=Force only > >> > >> Computing: M-Number M-Flops % Flops > >> > >> - > >> Pair Search distance check 1244.988096 11204.893 0.1 > >> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9 > >> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4 > >> 1,4 nonbonded interactions 31.616322 2845.469 0.0 > >> Calc Weights 703.010574 25308.381 0.3 > >> Spread Q Bspline 14997.558912 29995.118 0.3 > >> Gather F Bspline 14997.558912 89985.353 1.0 > >> 3D-FFT 47658.567884 381268.543 4.4 > >> Solve PME 20.580896 1317.177 0.0 > >> Shift-X 9.418458 56.511 0.0 > >> Angles 21.879375 3675.735 0.0 > >> Propers 48.599718 11129.335 0.1 > >> Virial 23.498403 422.971 0.0 > >> Stop-CM 2.436616 24.366 0.0 > >> Calc-Ekin 93.809716 2532.862 0.0 > >> Lincs 12.147284 728.837 0.0 > >> Lincs-Mat 131.328750 525.315 0.0 > >> Constraint-V 246.633614 1973.069 0.0 > >> Constraint-Vir 23.486379 563.673 0.0 > >> Settle 74.129451 23943.813 0.3 > >> > >> - > >> Total 8694798.114 100.0 > >> > >> - > >> > >> > >> R E A L C Y C L E A N D T I M E A C C O U N T I N G > >> > >> Computing: Nodes Th. Count Wall t (s) G-Cycles % > >> > >> - > >> Neighbor search 1 8 201 0.944 27.206 3.3 > >> Launch GPU ops. 1 8 5001 0.371 10.690 1.3 > >> Force 1 8 5001 2.185 62.987 7.7 > >> PME mesh 1 8 5001 15.033 433.441 52.9 > >> Wait GPU
Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi all! I've also done some calculations with the SD integraator used as the thermostat ( without t_coupl ) with the system of 65k atoms I obtained 10ns\day performance on gtc 670 and 4th core i5. I haventrun any simulations with MD integrator yet so It should test it. James 2013/1/15 Szilárd Páll : > Hi Floris, > > Great feedback, this needs to be looked into. Could you please file a bug > report, preferably with a tpr (and/or all inputs) as well as log files. > > Thanks, > > -- > Szilárd > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens > wrote: > >> Hi, >> >> >> I'm seeing MD simulation running a lot slower with the sd integrator than >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no >> documented indication that this should be the case. >> Timings and logs pasted in below - wall time seems to be accumulating up >> in Update and Rest, adding up to >60% of total. The effect is still there >> without GPU, ca. 40% slowdown when switching from group to Verlet with the >> SD integrator >> System: Xeon E5-1620, 1x GTX 680, gromacs >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 >> >> I didn't file a bug report yet as I don't have much variety of testing >> conditions available right now, I hope someone else has a moment to try to >> reproduce? >> >> Timings: >> >> cpu (ns/day) >> sd / verlet: 6 >> sd / group: 10 >> md / verlet: 9.2 >> md / group: 11.4 >> >> gpu (ns/day) >> sd / verlet: 11 >> md / verlet: 29.8 >> >> >> >> **MD integrator, GPU / verlet >> >> M E G A - F L O P S A C C O U N T I N G >> >> NB=Group-cutoff nonbonded kernelsNxN=N-by-N cluster Verlet kernels >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table >> W3=SPC/TIP3p W4=TIP4p (single or pairs) >> V&F=Potential and force V=Potential only F=Force only >> >> Computing: M-Number M-Flops % Flops >> >> - >> Pair Search distance check1244.988096 11204.893 0.1 >> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.23591.9 >> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4 >> 1,4 nonbonded interactions 31.6163222845.469 0.0 >> Calc Weights 703.010574 25308.381 0.3 >> Spread Q Bspline 14997.558912 29995.118 0.3 >> Gather F Bspline 14997.558912 89985.353 1.0 >> 3D-FFT 47658.567884 381268.543 4.4 >> Solve PME 20.5808961317.177 0.0 >> Shift-X 9.418458 56.511 0.0 >> Angles 21.8793753675.735 0.0 >> Propers 48.599718 11129.335 0.1 >> Virial 23.498403 422.971 0.0 >> Stop-CM 2.436616 24.366 0.0 >> Calc-Ekin 93.8097162532.862 0.0 >> Lincs 12.147284 728.837 0.0 >> Lincs-Mat 131.328750 525.315 0.0 >> Constraint-V 246.6336141973.069 0.0 >> Constraint-Vir 23.486379 563.673 0.0 >> Settle 74.129451 23943.813 0.3 >> >> - >> Total 8694798.114 100.0 >> >> - >> >> >> R E A L C Y C L E A N D T I M E A C C O U N T I N G >> >> Computing: Nodes Th. Count Wall t (s) G-Cycles % >> >> - >> Neighbor search18201 0.944 27.206 3.3 >> Launch GPU ops.18 5001 0.371 10.690 1.3 >> Force 18 5001 2.185 62.987 7.7 >> PME mesh 18 5001 15.033 433.44152.9 >> Wait GPU local 18 5001 1.551 44.719 5.5 >> NB X/F buffer ops. 18 9801 0.538 15.499 1.9 >> Write traj.18 2 0.725 20.912 2.6 >> Update 18 5001 2.318 66.826 8.2 >> Constraints18 5001 2.898 83.55110.2 >> Rest 1 1.832 52.828 6.5 >> >> - >> Total 1 28.394 818.659 100.0 >> >> --
Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi Floris, Great feedback, this needs to be looked into. Could you please file a bug report, preferably with a tpr (and/or all inputs) as well as log files. Thanks, -- Szilárd On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens wrote: > Hi, > > > I'm seeing MD simulation running a lot slower with the sd integrator than > with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no > documented indication that this should be the case. > Timings and logs pasted in below - wall time seems to be accumulating up > in Update and Rest, adding up to >60% of total. The effect is still there > without GPU, ca. 40% slowdown when switching from group to Verlet with the > SD integrator > System: Xeon E5-1620, 1x GTX 680, gromacs > 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 > > I didn't file a bug report yet as I don't have much variety of testing > conditions available right now, I hope someone else has a moment to try to > reproduce? > > Timings: > > cpu (ns/day) > sd / verlet: 6 > sd / group: 10 > md / verlet: 9.2 > md / group: 11.4 > > gpu (ns/day) > sd / verlet: 11 > md / verlet: 29.8 > > > > **MD integrator, GPU / verlet > > M E G A - F L O P S A C C O U N T I N G > > NB=Group-cutoff nonbonded kernelsNxN=N-by-N cluster Verlet kernels > RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table > W3=SPC/TIP3p W4=TIP4p (single or pairs) > V&F=Potential and force V=Potential only F=Force only > > Computing: M-Number M-Flops % Flops > > - > Pair Search distance check1244.988096 11204.893 0.1 > NxN QSTab Elec. + VdW [F] 194846.615488 7988711.23591.9 > NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4 > 1,4 nonbonded interactions 31.6163222845.469 0.0 > Calc Weights 703.010574 25308.381 0.3 > Spread Q Bspline 14997.558912 29995.118 0.3 > Gather F Bspline 14997.558912 89985.353 1.0 > 3D-FFT 47658.567884 381268.543 4.4 > Solve PME 20.5808961317.177 0.0 > Shift-X 9.418458 56.511 0.0 > Angles 21.8793753675.735 0.0 > Propers 48.599718 11129.335 0.1 > Virial 23.498403 422.971 0.0 > Stop-CM 2.436616 24.366 0.0 > Calc-Ekin 93.8097162532.862 0.0 > Lincs 12.147284 728.837 0.0 > Lincs-Mat 131.328750 525.315 0.0 > Constraint-V 246.6336141973.069 0.0 > Constraint-Vir 23.486379 563.673 0.0 > Settle 74.129451 23943.813 0.3 > > - > Total 8694798.114 100.0 > > - > > > R E A L C Y C L E A N D T I M E A C C O U N T I N G > > Computing: Nodes Th. Count Wall t (s) G-Cycles % > > - > Neighbor search18201 0.944 27.206 3.3 > Launch GPU ops.18 5001 0.371 10.690 1.3 > Force 18 5001 2.185 62.987 7.7 > PME mesh 18 5001 15.033 433.44152.9 > Wait GPU local 18 5001 1.551 44.719 5.5 > NB X/F buffer ops. 18 9801 0.538 15.499 1.9 > Write traj.18 2 0.725 20.912 2.6 > Update 18 5001 2.318 66.826 8.2 > Constraints18 5001 2.898 83.55110.2 > Rest 1 1.832 52.828 6.5 > > - > Total 1 28.394 818.659 100.0 > > - > > - > PME spread/gather 18 10002 8.745 252.14430.8 > PME 3D-FFT 18 10002 5.392 155.45819.0 > PME solve 18 5001 0.869 25.069 3.1 > >
[gmx-users] >60% slowdown with GPU / verlet and sd integrator
Hi, I'm seeing MD simulation running a lot slower with the sd integrator than with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no documented indication that this should be the case. Timings and logs pasted in below - wall time seems to be accumulating up in Update and Rest, adding up to >60% of total. The effect is still there without GPU, ca. 40% slowdown when switching from group to Verlet with the SD integrator System: Xeon E5-1620, 1x GTX 680, gromacs 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0 I didn't file a bug report yet as I don't have much variety of testing conditions available right now, I hope someone else has a moment to try to reproduce? Timings: cpu (ns/day) sd / verlet: 6 sd / group: 10 md / verlet: 9.2 md / group: 11.4 gpu (ns/day) sd / verlet: 11 md / verlet: 29.8 **MD integrator, GPU / verlet M E G A - F L O P S A C C O U N T I N G NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table W3=SPC/TIP3p W4=TIP4p (single or pairs) V&F=Potential and force V=Potential only F=Force only Computing: M-Number M-Flops % Flops - Pair Search distance check 1244.988096 11204.893 0.1 NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9 NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4 1,4 nonbonded interactions 31.616322 2845.469 0.0 Calc Weights 703.010574 25308.381 0.3 Spread Q Bspline 14997.558912 29995.118 0.3 Gather F Bspline 14997.558912 89985.353 1.0 3D-FFT 47658.567884 381268.543 4.4 Solve PME 20.580896 1317.177 0.0 Shift-X 9.418458 56.511 0.0 Angles 21.879375 3675.735 0.0 Propers 48.599718 11129.335 0.1 Virial 23.498403 422.971 0.0 Stop-CM 2.436616 24.366 0.0 Calc-Ekin 93.809716 2532.862 0.0 Lincs 12.147284 728.837 0.0 Lincs-Mat 131.328750 525.315 0.0 Constraint-V 246.633614 1973.069 0.0 Constraint-Vir 23.486379 563.673 0.0 Settle 74.129451 23943.813 0.3 - Total 8694798.114 100.0 - R E A L C Y C L E A N D T I M E A C C O U N T I N G Computing: Nodes Th. Count Wall t (s) G-Cycles % - Neighbor search 1 8 201 0.944 27.206 3.3 Launch GPU ops. 1 8 5001 0.371 10.690 1.3 Force 1 8 5001 2.185 62.987 7.7 PME mesh 1 8 5001 15.033 433.441 52.9 Wait GPU local 1 8 5001 1.551 44.719 5.5 NB X/F buffer ops. 1 8 9801 0.538 15.499 1.9 Write traj. 1 8 2 0.725 20.912 2.6 Update 1 8 5001 2.318 66.826 8.2 Constraints 1 8 5001 2.898 83.551 10.2 Rest 1 1.832 52.828 6.5 - Total 1 28.394 818.659 100.0 - - PME spread/gather 1 8 10002 8.745 252.144 30.8 PME 3D-FFT 1 8 10002 5.392 155.458 19.0 PME solve 1 8 5001 0.869 25.069 3.1 - GPU timings - Computing: Count Wall t (s) ms/step % - Pair list H2D 201 0.080 0.397 0.4 X / q H2D 5001