Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-19 Thread victor doss
The paper looks good.  Do some more work and publish many


Sent from my iPhone

On 17-Jan-2013, at 8:18 PM, "James Starlight"  wrote:

Dear Gromacs Developers!

Using sd1 integrator I've obtain good performance with the core-5 +
GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on
30% better than with the sd integrator.

Buit on my another work-station  which differs only by slower GPU ( GT
640). I've obtained some gpu\cpu mis-match.

Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373( # on
the first station with GTX 670 I ve obtained GPU/CPU: ratio close to
1.

At both cases I'm using the same simulation parameters with 0,8
cutoffs (it's also important that in the second case I've calculated
another system consisted of 33k atoms by means of umbrella sampling
pulling)). Could you tell me how I could increase performance on my
second station ( to reduce gpucpu ratio) ?  I've attached log for that
simulation here http://www.sendspace.com/file/x0e3z8

James

2013/1/17 Szilárd Páll :
> Hi,
> 
> Just to note for the users who might read this: the report is valid, some
> non-thread-parallel code is the reason and we hope to have a fix for 4.6.0.
> 
> For updates, follow the issue #1211.
> 
> Cheers,
> 
> --
> Szilárd
> 
> 
> On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess  wrote:
> 
>> 
>> The issue I'm referring to is about a factor of 2 in update and
>> constraints, but here it's much more.
>> I just found out that the SD update is not OpenMP threaded (and I even
>> noted in the code why this is).
>> I reopened the issue and will find a solution.
>> 
>> Cheers.
>> 
>> Berk
>> 
>> 
>>> Date: Wed, 16 Jan 2013 16:20:32 +0100
>>> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
>> integrator
>>> From: mark.j.abra...@gmail.com
>>> To: gmx-users@gromacs.org
>>> 
>>> We should probably note this effect on the wiki somewhere?
>>> 
>>> Mark
>>> 
>>> On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:
>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> Unfortunately this is not a bug, but a feature!
>>>> We made the non-bondeds so fast on the GPU that integration and
>>>> constraints take more time.
>>>> The sd1 integrator is almost as fast as the md integrator, but slightly
>>>> less accurate.
>>>> In most cases that's a good solution.
>>>> 
>>>> I closed the redmine issue:
>>>> http://redmine.gromacs.org/issues/1121
>>>> 
>>>> Cheers,
>>>> 
>>>> Berk
>>>> 
>>>> 
>>>>> Date: Wed, 16 Jan 2013 17:26:18 +0300
>>>>> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
>>>> integrator
>>>>> From: jmsstarli...@gmail.com
>>>>> To: gmx-users@gromacs.org
>>>>> 
>>>>> Hi all!
>>>>> 
>>>>> I've also done some calculations with the SD integraator used as the
>>>>> thermostat ( without t_coupl ) with the system of 65k atoms I
>> obtained
>>>>> 10ns\day performance on gtc 670 and 4th core i5.
>>>>> I haventrun any simulations with MD integrator yet so It should test
>> it.
>>>>> 
>>>>> James
>>>>> 
>>>>> 2013/1/15 Szilárd Páll :
>>>>>> Hi Floris,
>>>>>> 
>>>>>> Great feedback, this needs to be looked into. Could you please
>> file a
>>>> bug
>>>>>> report, preferably with a tpr (and/or all inputs) as well as log
>> files.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> --
>>>>>> Szilárd
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <
>>>> floris_buel...@yahoo.com>wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> 
>>>>>>> I'm seeing MD simulation running a lot slower with the sd
>> integrator
>>>> than
>>>>>>> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found
>> no
>>>>>>> documented indication that this should be the case.
>>>>>>> Timings and logs pasted in below - wall time seems to be
>> accumulating
>&

RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-17 Thread Berk Hess

Hi,

Please use the fix I put on the redmine issue, as that's even faster and you 
can use sd again.

We should probably rephrase the note a bit in case the GPU has more work to do 
than the CPU.
In your case there is simple no work to do for the CPU.
Ideally we would let the CPU handle some non-bonded work, but that probably 
won't happen in 4.6.

A solution might be buying a 3x a fast GPU.

Cheers,

Berk


> Date: Thu, 17 Jan 2013 18:48:17 +0400
> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
> From: jmsstarli...@gmail.com
> To: gmx-users@gromacs.org
>
> Dear Gromacs Developers!
>
> Using sd1 integrator I've obtain good performance with the core-5 +
> GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on
> 30% better than with the sd integrator.
>
> Buit on my another work-station which differs only by slower GPU ( GT
> 640). I've obtained some gpu\cpu mis-match.
>
> Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373 ( # on
> the first station with GTX 670 I ve obtained GPU/CPU: ratio close to
> 1.
>
> At both cases I'm using the same simulation parameters with 0,8
> cutoffs (it's also important that in the second case I've calculated
> another system consisted of 33k atoms by means of umbrella sampling
> pulling)). Could you tell me how I could increase performance on my
> second station ( to reduce gpucpu ratio) ? I've attached log for that
> simulation here http://www.sendspace.com/file/x0e3z8
>
> James
>
> 2013/1/17 Szilárd Páll :
> > Hi,
> >
> > Just to note for the users who might read this: the report is valid, some
> > non-thread-parallel code is the reason and we hope to have a fix for 4.6.0.
> >
> > For updates, follow the issue #1211.
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> >
> > On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess  wrote:
> >
> >>
> >> The issue I'm referring to is about a factor of 2 in update and
> >> constraints, but here it's much more.
> >> I just found out that the SD update is not OpenMP threaded (and I even
> >> noted in the code why this is).
> >> I reopened the issue and will find a solution.
> >>
> >> Cheers.
> >>
> >> Berk
> >>
> >> 
> >> > Date: Wed, 16 Jan 2013 16:20:32 +0100
> >> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> >> integrator
> >> > From: mark.j.abra...@gmail.com
> >> > To: gmx-users@gromacs.org
> >> >
> >> > We should probably note this effect on the wiki somewhere?
> >> >
> >> > Mark
> >> >
> >> > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:
> >> >
> >> > >
> >> > > Hi,
> >> > >
> >> > > Unfortunately this is not a bug, but a feature!
> >> > > We made the non-bondeds so fast on the GPU that integration and
> >> > > constraints take more time.
> >> > > The sd1 integrator is almost as fast as the md integrator, but slightly
> >> > > less accurate.
> >> > > In most cases that's a good solution.
> >> > >
> >> > > I closed the redmine issue:
> >> > > http://redmine.gromacs.org/issues/1121
> >> > >
> >> > > Cheers,
> >> > >
> >> > > Berk
> >> > >
> >> > > 
> >> > > > Date: Wed, 16 Jan 2013 17:26:18 +0300
> >> > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> >> > > integrator
> >> > > > From: jmsstarli...@gmail.com
> >> > > > To: gmx-users@gromacs.org
> >> > > >
> >> > > > Hi all!
> >> > > >
> >> > > > I've also done some calculations with the SD integraator used as the
> >> > > > thermostat ( without t_coupl ) with the system of 65k atoms I
> >> obtained
> >> > > > 10ns\day performance on gtc 670 and 4th core i5.
> >> > > > I haventrun any simulations with MD integrator yet so It should test
> >> it.
> >> > > >
> >> > > > James
> >> > > >
> >> > > > 2013/1/15 Szilárd Páll :
> >> > > > > Hi Floris,
> >> > > > >
> >> > > > > Great f

Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-17 Thread James Starlight
Dear Gromacs Developers!

Using sd1 integrator I've obtain good performance with the core-5 +
GTX 670 ( 13ns\per day) for the system of 60k atoms. That results on
30% better than with the sd integrator.

Buit on my another work-station  which differs only by slower GPU ( GT
640). I've obtained some gpu\cpu mis-match.

 Force evaluation time GPU/CPU: 6.835 ms/2.026 ms = 3.373( # on
the first station with GTX 670 I ve obtained GPU/CPU: ratio close to
1.

At both cases I'm using the same simulation parameters with 0,8
cutoffs (it's also important that in the second case I've calculated
another system consisted of 33k atoms by means of umbrella sampling
pulling)). Could you tell me how I could increase performance on my
second station ( to reduce gpucpu ratio) ?  I've attached log for that
simulation here http://www.sendspace.com/file/x0e3z8

James

2013/1/17 Szilárd Páll :
> Hi,
>
> Just to note for the users who might read this: the report is valid, some
> non-thread-parallel code is the reason and we hope to have a fix for 4.6.0.
>
> For updates, follow the issue #1211.
>
> Cheers,
>
> --
> Szilárd
>
>
> On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess  wrote:
>
>>
>> The issue I'm referring to is about a factor of 2 in update and
>> constraints, but here it's much more.
>> I just found out that the SD update is not OpenMP threaded (and I even
>> noted in the code why this is).
>> I reopened the issue and will find a solution.
>>
>> Cheers.
>>
>> Berk
>>
>> ----
>> > Date: Wed, 16 Jan 2013 16:20:32 +0100
>> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
>> integrator
>> > From: mark.j.abra...@gmail.com
>> > To: gmx-users@gromacs.org
>> >
>> > We should probably note this effect on the wiki somewhere?
>> >
>> > Mark
>> >
>> > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:
>> >
>> > >
>> > > Hi,
>> > >
>> > > Unfortunately this is not a bug, but a feature!
>> > > We made the non-bondeds so fast on the GPU that integration and
>> > > constraints take more time.
>> > > The sd1 integrator is almost as fast as the md integrator, but slightly
>> > > less accurate.
>> > > In most cases that's a good solution.
>> > >
>> > > I closed the redmine issue:
>> > > http://redmine.gromacs.org/issues/1121
>> > >
>> > > Cheers,
>> > >
>> > > Berk
>> > >
>> > > 
>> > > > Date: Wed, 16 Jan 2013 17:26:18 +0300
>> > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
>> > > integrator
>> > > > From: jmsstarli...@gmail.com
>> > > > To: gmx-users@gromacs.org
>> > > >
>> > > > Hi all!
>> > > >
>> > > > I've also done some calculations with the SD integraator used as the
>> > > > thermostat ( without t_coupl ) with the system of 65k atoms I
>> obtained
>> > > > 10ns\day performance on gtc 670 and 4th core i5.
>> > > > I haventrun any simulations with MD integrator yet so It should test
>> it.
>> > > >
>> > > > James
>> > > >
>> > > > 2013/1/15 Szilárd Páll :
>> > > > > Hi Floris,
>> > > > >
>> > > > > Great feedback, this needs to be looked into. Could you please
>> file a
>> > > bug
>> > > > > report, preferably with a tpr (and/or all inputs) as well as log
>> files.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > --
>> > > > > Szilárd
>> > > > >
>> > > > >
>> > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <
>> > > floris_buel...@yahoo.com>wrote:
>> > > > >
>> > > > >> Hi,
>> > > > >>
>> > > > >>
>> > > > >> I'm seeing MD simulation running a lot slower with the sd
>> integrator
>> > > than
>> > > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found
>> no
>> > > > >> documented indication that this should be the case.
>> > > > >> Timings and logs pasted in below - wall time seems to be
>> accumulating
>&

Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-16 Thread Szilárd Páll
Hi,

Just to note for the users who might read this: the report is valid, some
non-thread-parallel code is the reason and we hope to have a fix for 4.6.0.

For updates, follow the issue #1211.

Cheers,

--
Szilárd


On Wed, Jan 16, 2013 at 4:45 PM, Berk Hess  wrote:

>
> The issue I'm referring to is about a factor of 2 in update and
> constraints, but here it's much more.
> I just found out that the SD update is not OpenMP threaded (and I even
> noted in the code why this is).
> I reopened the issue and will find a solution.
>
> Cheers.
>
> Berk
>
> 
> > Date: Wed, 16 Jan 2013 16:20:32 +0100
> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> integrator
> > From: mark.j.abra...@gmail.com
> > To: gmx-users@gromacs.org
> >
> > We should probably note this effect on the wiki somewhere?
> >
> > Mark
> >
> > On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:
> >
> > >
> > > Hi,
> > >
> > > Unfortunately this is not a bug, but a feature!
> > > We made the non-bondeds so fast on the GPU that integration and
> > > constraints take more time.
> > > The sd1 integrator is almost as fast as the md integrator, but slightly
> > > less accurate.
> > > In most cases that's a good solution.
> > >
> > > I closed the redmine issue:
> > > http://redmine.gromacs.org/issues/1121
> > >
> > > Cheers,
> > >
> > > Berk
> > >
> > > 
> > > > Date: Wed, 16 Jan 2013 17:26:18 +0300
> > > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> > > integrator
> > > > From: jmsstarli...@gmail.com
> > > > To: gmx-users@gromacs.org
> > > >
> > > > Hi all!
> > > >
> > > > I've also done some calculations with the SD integraator used as the
> > > > thermostat ( without t_coupl ) with the system of 65k atoms I
> obtained
> > > > 10ns\day performance on gtc 670 and 4th core i5.
> > > > I haventrun any simulations with MD integrator yet so It should test
> it.
> > > >
> > > > James
> > > >
> > > > 2013/1/15 Szilárd Páll :
> > > > > Hi Floris,
> > > > >
> > > > > Great feedback, this needs to be looked into. Could you please
> file a
> > > bug
> > > > > report, preferably with a tpr (and/or all inputs) as well as log
> files.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Szilárd
> > > > >
> > > > >
> > > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <
> > > floris_buel...@yahoo.com>wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >>
> > > > >> I'm seeing MD simulation running a lot slower with the sd
> integrator
> > > than
> > > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found
> no
> > > > >> documented indication that this should be the case.
> > > > >> Timings and logs pasted in below - wall time seems to be
> accumulating
> > > up
> > > > >> in Update and Rest, adding up to >60% of total. The effect is
> still
> > > there
> > > > >> without GPU, ca. 40% slowdown when switching from group to Verlet
> > > with the
> > > > >> SD integrator
> > > > >> System: Xeon E5-1620, 1x GTX 680, gromacs
> > > > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
> > > > >>
> > > > >> I didn't file a bug report yet as I don't have much variety of
> testing
> > > > >> conditions available right now, I hope someone else has a moment
> to
> > > try to
> > > > >> reproduce?
> > > > >>
> > > > >> Timings:
> > > > >>
> > > > >> cpu (ns/day)
> > > > >> sd / verlet: 6
> > > > >> sd / group: 10
> > > > >> md / verlet: 9.2
> > > > >> md / group: 11.4
> > > > >>
> > > > >> gpu (ns/day)
> > > > >> sd / verlet: 11
> > > > >> md / verlet: 29.8
> > > > >>
> > > > >>
> > > > >>
> 

RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-16 Thread Berk Hess

The issue I'm referring to is about a factor of 2 in update and constraints, 
but here it's much more.
I just found out that the SD update is not OpenMP threaded (and I even noted in 
the code why this is).
I reopened the issue and will find a solution.

Cheers.

Berk


> Date: Wed, 16 Jan 2013 16:20:32 +0100
> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
> From: mark.j.abra...@gmail.com
> To: gmx-users@gromacs.org
>
> We should probably note this effect on the wiki somewhere?
>
> Mark
>
> On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:
>
> >
> > Hi,
> >
> > Unfortunately this is not a bug, but a feature!
> > We made the non-bondeds so fast on the GPU that integration and
> > constraints take more time.
> > The sd1 integrator is almost as fast as the md integrator, but slightly
> > less accurate.
> > In most cases that's a good solution.
> >
> > I closed the redmine issue:
> > http://redmine.gromacs.org/issues/1121
> >
> > Cheers,
> >
> > Berk
> >
> > ----
> > > Date: Wed, 16 Jan 2013 17:26:18 +0300
> > > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> > integrator
> > > From: jmsstarli...@gmail.com
> > > To: gmx-users@gromacs.org
> > >
> > > Hi all!
> > >
> > > I've also done some calculations with the SD integraator used as the
> > > thermostat ( without t_coupl ) with the system of 65k atoms I obtained
> > > 10ns\day performance on gtc 670 and 4th core i5.
> > > I haventrun any simulations with MD integrator yet so It should test it.
> > >
> > > James
> > >
> > > 2013/1/15 Szilárd Páll :
> > > > Hi Floris,
> > > >
> > > > Great feedback, this needs to be looked into. Could you please file a
> > bug
> > > > report, preferably with a tpr (and/or all inputs) as well as log files.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Szilárd
> > > >
> > > >
> > > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <
> > floris_buel...@yahoo.com>wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >>
> > > >> I'm seeing MD simulation running a lot slower with the sd integrator
> > than
> > > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
> > > >> documented indication that this should be the case.
> > > >> Timings and logs pasted in below - wall time seems to be accumulating
> > up
> > > >> in Update and Rest, adding up to >60% of total. The effect is still
> > there
> > > >> without GPU, ca. 40% slowdown when switching from group to Verlet
> > with the
> > > >> SD integrator
> > > >> System: Xeon E5-1620, 1x GTX 680, gromacs
> > > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
> > > >>
> > > >> I didn't file a bug report yet as I don't have much variety of testing
> > > >> conditions available right now, I hope someone else has a moment to
> > try to
> > > >> reproduce?
> > > >>
> > > >> Timings:
> > > >>
> > > >> cpu (ns/day)
> > > >> sd / verlet: 6
> > > >> sd / group: 10
> > > >> md / verlet: 9.2
> > > >> md / group: 11.4
> > > >>
> > > >> gpu (ns/day)
> > > >> sd / verlet: 11
> > > >> md / verlet: 29.8
> > > >>
> > > >>
> > > >>
> > > >> **MD integrator, GPU / verlet
> > > >>
> > > >> M E G A - F L O P S A C C O U N T I N G
> > > >>
> > > >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> > > >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> > > >> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> > > >> V&F=Potential and force V=Potential only F=Force only
> > > >>
> > > >> Computing: M-Number M-Flops % Flops
> > > >>
> > > >>
> > -
> > > >> Pair Search distance check 1244.988096 11204.893 0.1
> > > >> NxN QSTab Elec. + VdW [F]

Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-16 Thread Mark Abraham
We should probably note this effect on the wiki somewhere?

Mark

On Wed, Jan 16, 2013 at 3:44 PM, Berk Hess  wrote:

>
> Hi,
>
> Unfortunately this is not a bug, but a feature!
> We made the non-bondeds so fast on the GPU that integration and
> constraints take more time.
> The sd1 integrator is almost as fast as the md integrator, but slightly
> less accurate.
> In most cases that's a good solution.
>
> I closed the redmine issue:
> http://redmine.gromacs.org/issues/1121
>
> Cheers,
>
> Berk
>
> 
> > Date: Wed, 16 Jan 2013 17:26:18 +0300
> > Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd
> integrator
> > From: jmsstarli...@gmail.com
> > To: gmx-users@gromacs.org
> >
> > Hi all!
> >
> > I've also done some calculations with the SD integraator used as the
> > thermostat ( without t_coupl ) with the system of 65k atoms I obtained
> > 10ns\day performance on gtc 670 and 4th core i5.
> > I haventrun any simulations with MD integrator yet so It should test it.
> >
> > James
> >
> > 2013/1/15 Szilárd Páll :
> > > Hi Floris,
> > >
> > > Great feedback, this needs to be looked into. Could you please file a
> bug
> > > report, preferably with a tpr (and/or all inputs) as well as log files.
> > >
> > > Thanks,
> > >
> > > --
> > > Szilárd
> > >
> > >
> > > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens <
> floris_buel...@yahoo.com>wrote:
> > >
> > >> Hi,
> > >>
> > >>
> > >> I'm seeing MD simulation running a lot slower with the sd integrator
> than
> > >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
> > >> documented indication that this should be the case.
> > >> Timings and logs pasted in below - wall time seems to be accumulating
> up
> > >> in Update and Rest, adding up to >60% of total. The effect is still
> there
> > >> without GPU, ca. 40% slowdown when switching from group to Verlet
> with the
> > >> SD integrator
> > >> System: Xeon E5-1620, 1x GTX 680, gromacs
> > >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
> > >>
> > >> I didn't file a bug report yet as I don't have much variety of testing
> > >> conditions available right now, I hope someone else has a moment to
> try to
> > >> reproduce?
> > >>
> > >> Timings:
> > >>
> > >> cpu (ns/day)
> > >> sd / verlet: 6
> > >> sd / group: 10
> > >> md / verlet: 9.2
> > >> md / group: 11.4
> > >>
> > >> gpu (ns/day)
> > >> sd / verlet: 11
> > >> md / verlet: 29.8
> > >>
> > >>
> > >>
> > >> **MD integrator, GPU / verlet
> > >>
> > >> M E G A - F L O P S A C C O U N T I N G
> > >>
> > >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> > >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> > >> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> > >> V&F=Potential and force V=Potential only F=Force only
> > >>
> > >> Computing: M-Number M-Flops % Flops
> > >>
> > >>
> -
> > >> Pair Search distance check 1244.988096 11204.893 0.1
> > >> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9
> > >> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4
> > >> 1,4 nonbonded interactions 31.616322 2845.469 0.0
> > >> Calc Weights 703.010574 25308.381 0.3
> > >> Spread Q Bspline 14997.558912 29995.118 0.3
> > >> Gather F Bspline 14997.558912 89985.353 1.0
> > >> 3D-FFT 47658.567884 381268.543 4.4
> > >> Solve PME 20.580896 1317.177 0.0
> > >> Shift-X 9.418458 56.511 0.0
> > >> Angles 21.879375 3675.735 0.0
> > >> Propers 48.599718 11129.335 0.1
> > >> Virial 23.498403 422.971 0.0
> > >> Stop-CM 2.436616 24.366 0.0
> > >> Calc-Ekin 93.809716 2532.862 0.0
> > >> Lincs 12.147284 728.837 0.0
> > >> Lincs-Mat 131.328750 525.315 0.0
> > >> Constraint-V 246.633614 1973.069 0.0
> > >> Constraint-Vir 2

RE: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-16 Thread Berk Hess

Hi,

Unfortunately this is not a bug, but a feature!
We made the non-bondeds so fast on the GPU that integration and constraints 
take more time.
The sd1 integrator is almost as fast as the md integrator, but slightly less 
accurate.
In most cases that's a good solution.

I closed the redmine issue:
http://redmine.gromacs.org/issues/1121

Cheers,

Berk


> Date: Wed, 16 Jan 2013 17:26:18 +0300
> Subject: Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator
> From: jmsstarli...@gmail.com
> To: gmx-users@gromacs.org
>
> Hi all!
>
> I've also done some calculations with the SD integraator used as the
> thermostat ( without t_coupl ) with the system of 65k atoms I obtained
> 10ns\day performance on gtc 670 and 4th core i5.
> I haventrun any simulations with MD integrator yet so It should test it.
>
> James
>
> 2013/1/15 Szilárd Páll :
> > Hi Floris,
> >
> > Great feedback, this needs to be looked into. Could you please file a bug
> > report, preferably with a tpr (and/or all inputs) as well as log files.
> >
> > Thanks,
> >
> > --
> > Szilárd
> >
> >
> > On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens 
> > wrote:
> >
> >> Hi,
> >>
> >>
> >> I'm seeing MD simulation running a lot slower with the sd integrator than
> >> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
> >> documented indication that this should be the case.
> >> Timings and logs pasted in below - wall time seems to be accumulating up
> >> in Update and Rest, adding up to >60% of total. The effect is still there
> >> without GPU, ca. 40% slowdown when switching from group to Verlet with the
> >> SD integrator
> >> System: Xeon E5-1620, 1x GTX 680, gromacs
> >> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
> >>
> >> I didn't file a bug report yet as I don't have much variety of testing
> >> conditions available right now, I hope someone else has a moment to try to
> >> reproduce?
> >>
> >> Timings:
> >>
> >> cpu (ns/day)
> >> sd / verlet: 6
> >> sd / group: 10
> >> md / verlet: 9.2
> >> md / group: 11.4
> >>
> >> gpu (ns/day)
> >> sd / verlet: 11
> >> md / verlet: 29.8
> >>
> >>
> >>
> >> **MD integrator, GPU / verlet
> >>
> >> M E G A - F L O P S A C C O U N T I N G
> >>
> >> NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels
> >> RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table
> >> W3=SPC/TIP3p W4=TIP4p (single or pairs)
> >> V&F=Potential and force V=Potential only F=Force only
> >>
> >> Computing: M-Number M-Flops % Flops
> >>
> >> -
> >> Pair Search distance check 1244.988096 11204.893 0.1
> >> NxN QSTab Elec. + VdW [F] 194846.615488 7988711.235 91.9
> >> NxN QSTab Elec. + VdW [V&F] 2009.923008 118585.457 1.4
> >> 1,4 nonbonded interactions 31.616322 2845.469 0.0
> >> Calc Weights 703.010574 25308.381 0.3
> >> Spread Q Bspline 14997.558912 29995.118 0.3
> >> Gather F Bspline 14997.558912 89985.353 1.0
> >> 3D-FFT 47658.567884 381268.543 4.4
> >> Solve PME 20.580896 1317.177 0.0
> >> Shift-X 9.418458 56.511 0.0
> >> Angles 21.879375 3675.735 0.0
> >> Propers 48.599718 11129.335 0.1
> >> Virial 23.498403 422.971 0.0
> >> Stop-CM 2.436616 24.366 0.0
> >> Calc-Ekin 93.809716 2532.862 0.0
> >> Lincs 12.147284 728.837 0.0
> >> Lincs-Mat 131.328750 525.315 0.0
> >> Constraint-V 246.633614 1973.069 0.0
> >> Constraint-Vir 23.486379 563.673 0.0
> >> Settle 74.129451 23943.813 0.3
> >>
> >> -
> >> Total 8694798.114 100.0
> >>
> >> -
> >>
> >>
> >> R E A L C Y C L E A N D T I M E A C C O U N T I N G
> >>
> >> Computing: Nodes Th. Count Wall t (s) G-Cycles %
> >>
> >> -
> >> Neighbor search 1 8 201 0.944 27.206 3.3
> >> Launch GPU ops. 1 8 5001 0.371 10.690 1.3
> >> Force 1 8 5001 2.185 62.987 7.7
> >> PME mesh 1 8 5001 15.033 433.441 52.9
> >> Wait GPU 

Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-16 Thread James Starlight
Hi all!

I've also done some calculations with the SD integraator used as the
thermostat ( without t_coupl ) with the system of 65k atoms I obtained
10ns\day performance on gtc 670 and 4th core i5.
I haventrun any simulations with MD integrator yet so It should test it.

James

2013/1/15 Szilárd Páll :
> Hi Floris,
>
> Great feedback, this needs to be looked into. Could you please file a bug
> report, preferably with a tpr (and/or all inputs) as well as log files.
>
> Thanks,
>
> --
> Szilárd
>
>
> On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens 
> wrote:
>
>> Hi,
>>
>>
>> I'm seeing MD simulation running a lot slower with the sd integrator than
>> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
>> documented indication that this should be the case.
>> Timings and logs pasted in below - wall time seems to be accumulating up
>> in Update and Rest, adding up to >60% of total. The effect is still there
>> without GPU, ca. 40% slowdown when switching from group to Verlet with the
>> SD integrator
>> System: Xeon E5-1620, 1x GTX 680, gromacs
>> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
>>
>> I didn't file a bug report yet as I don't have much variety of testing
>> conditions available right now, I hope someone else has a moment to try to
>> reproduce?
>>
>> Timings:
>>
>> cpu (ns/day)
>> sd / verlet: 6
>> sd / group: 10
>> md / verlet: 9.2
>> md / group: 11.4
>>
>> gpu (ns/day)
>> sd / verlet: 11
>> md / verlet: 29.8
>>
>>
>>
>> **MD integrator, GPU / verlet
>>
>> M E G A - F L O P S   A C C O U N T I N G
>>
>>  NB=Group-cutoff nonbonded kernelsNxN=N-by-N cluster Verlet kernels
>>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>>  V&F=Potential and force  V=Potential only  F=Force only
>>
>>  Computing:   M-Number M-Flops  % Flops
>>
>> -
>>  Pair Search distance check1244.988096   11204.893 0.1
>>  NxN QSTab Elec. + VdW [F]   194846.615488 7988711.23591.9
>>  NxN QSTab Elec. + VdW [V&F]   2009.923008  118585.457 1.4
>>  1,4 nonbonded interactions  31.6163222845.469 0.0
>>  Calc Weights   703.010574   25308.381 0.3
>>  Spread Q Bspline 14997.558912   29995.118 0.3
>>  Gather F Bspline 14997.558912   89985.353 1.0
>>  3D-FFT   47658.567884  381268.543 4.4
>>  Solve PME   20.5808961317.177 0.0
>>  Shift-X  9.418458  56.511 0.0
>>  Angles  21.8793753675.735 0.0
>>  Propers 48.599718   11129.335 0.1
>>  Virial  23.498403 422.971 0.0
>>  Stop-CM  2.436616  24.366 0.0
>>  Calc-Ekin   93.8097162532.862 0.0
>>  Lincs   12.147284 728.837 0.0
>>  Lincs-Mat  131.328750 525.315 0.0
>>  Constraint-V   246.6336141973.069 0.0
>>  Constraint-Vir  23.486379 563.673 0.0
>>  Settle  74.129451   23943.813 0.3
>>
>> -
>>  Total 8694798.114   100.0
>>
>> -
>>
>>
>>  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>>
>>  Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %
>>
>> -
>>  Neighbor search18201   0.944   27.206 3.3
>>  Launch GPU ops.18   5001   0.371   10.690 1.3
>>  Force  18   5001   2.185   62.987 7.7
>>  PME mesh   18   5001  15.033  433.44152.9
>>  Wait GPU local 18   5001   1.551   44.719 5.5
>>  NB X/F buffer ops. 18   9801   0.538   15.499 1.9
>>  Write traj.18  2   0.725   20.912 2.6
>>  Update 18   5001   2.318   66.826 8.2
>>  Constraints18   5001   2.898   83.55110.2
>>  Rest   1   1.832   52.828 6.5
>>
>> -
>>  Total  1  28.394  818.659   100.0
>>
>> --

Re: [gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-15 Thread Szilárd Páll
Hi Floris,

Great feedback, this needs to be looked into. Could you please file a bug
report, preferably with a tpr (and/or all inputs) as well as log files.

Thanks,

--
Szilárd


On Tue, Jan 15, 2013 at 3:50 AM, Floris Buelens wrote:

> Hi,
>
>
> I'm seeing MD simulation running a lot slower with the sd integrator than
> with md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no
> documented indication that this should be the case.
> Timings and logs pasted in below - wall time seems to be accumulating up
> in Update and Rest, adding up to >60% of total. The effect is still there
> without GPU, ca. 40% slowdown when switching from group to Verlet with the
> SD integrator
> System: Xeon E5-1620, 1x GTX 680, gromacs
> 4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0
>
> I didn't file a bug report yet as I don't have much variety of testing
> conditions available right now, I hope someone else has a moment to try to
> reproduce?
>
> Timings:
>
> cpu (ns/day)
> sd / verlet: 6
> sd / group: 10
> md / verlet: 9.2
> md / group: 11.4
>
> gpu (ns/day)
> sd / verlet: 11
> md / verlet: 29.8
>
>
>
> **MD integrator, GPU / verlet
>
> M E G A - F L O P S   A C C O U N T I N G
>
>  NB=Group-cutoff nonbonded kernelsNxN=N-by-N cluster Verlet kernels
>  RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
>  W3=SPC/TIP3p  W4=TIP4p (single or pairs)
>  V&F=Potential and force  V=Potential only  F=Force only
>
>  Computing:   M-Number M-Flops  % Flops
>
> -
>  Pair Search distance check1244.988096   11204.893 0.1
>  NxN QSTab Elec. + VdW [F]   194846.615488 7988711.23591.9
>  NxN QSTab Elec. + VdW [V&F]   2009.923008  118585.457 1.4
>  1,4 nonbonded interactions  31.6163222845.469 0.0
>  Calc Weights   703.010574   25308.381 0.3
>  Spread Q Bspline 14997.558912   29995.118 0.3
>  Gather F Bspline 14997.558912   89985.353 1.0
>  3D-FFT   47658.567884  381268.543 4.4
>  Solve PME   20.5808961317.177 0.0
>  Shift-X  9.418458  56.511 0.0
>  Angles  21.8793753675.735 0.0
>  Propers 48.599718   11129.335 0.1
>  Virial  23.498403 422.971 0.0
>  Stop-CM  2.436616  24.366 0.0
>  Calc-Ekin   93.8097162532.862 0.0
>  Lincs   12.147284 728.837 0.0
>  Lincs-Mat  131.328750 525.315 0.0
>  Constraint-V   246.6336141973.069 0.0
>  Constraint-Vir  23.486379 563.673 0.0
>  Settle  74.129451   23943.813 0.3
>
> -
>  Total 8694798.114   100.0
>
> -
>
>
>  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
>  Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %
>
> -
>  Neighbor search18201   0.944   27.206 3.3
>  Launch GPU ops.18   5001   0.371   10.690 1.3
>  Force  18   5001   2.185   62.987 7.7
>  PME mesh   18   5001  15.033  433.44152.9
>  Wait GPU local 18   5001   1.551   44.719 5.5
>  NB X/F buffer ops. 18   9801   0.538   15.499 1.9
>  Write traj.18  2   0.725   20.912 2.6
>  Update 18   5001   2.318   66.826 8.2
>  Constraints18   5001   2.898   83.55110.2
>  Rest   1   1.832   52.828 6.5
>
> -
>  Total  1  28.394  818.659   100.0
>
> -
>
> -
>  PME spread/gather  18  10002   8.745  252.14430.8
>  PME 3D-FFT 18  10002   5.392  155.45819.0
>  PME solve  18   5001   0.869   25.069 3.1
>
> 

[gmx-users] >60% slowdown with GPU / verlet and sd integrator

2013-01-15 Thread Floris Buelens
Hi,


I'm seeing MD simulation running a lot slower with the sd integrator than with 
md - ca. 10 vs. 30 ns/day for my 47000 atom system. I found no documented 
indication that this should be the case.
Timings and logs pasted in below - wall time seems to be accumulating up in 
Update and Rest, adding up to >60% of total. The effect is still there without 
GPU, ca. 40% slowdown when switching from group to Verlet with the SD integrator
System: Xeon E5-1620, 1x GTX 680, gromacs 
4.6-beta3-dev-20130107-e66851a-unknown, GCC 4.4.6 and 4.7.0

I didn't file a bug report yet as I don't have much variety of testing 
conditions available right now, I hope someone else has a moment to try to 
reproduce?

Timings: 

cpu (ns/day)
sd / verlet: 6
sd / group: 10
md / verlet: 9.2
md / group: 11.4

gpu (ns/day)
sd / verlet: 11
md / verlet: 29.8



**MD integrator, GPU / verlet

M E G A - F L O P S   A C C O U N T I N G

 NB=Group-cutoff nonbonded kernels    NxN=N-by-N cluster Verlet kernels
 RF=Reaction-Field  VdW=Van der Waals  QSTab=quadratic-spline table
 W3=SPC/TIP3p  W4=TIP4p (single or pairs)
 V&F=Potential and force  V=Potential only  F=Force only

 Computing:                               M-Number         M-Flops  % Flops
-
 Pair Search distance check            1244.988096       11204.893     0.1
 NxN QSTab Elec. + VdW [F]           194846.615488     7988711.235    91.9
 NxN QSTab Elec. + VdW [V&F]           2009.923008      118585.457     1.4
 1,4 nonbonded interactions              31.616322        2845.469     0.0
 Calc Weights                           703.010574       25308.381     0.3
 Spread Q Bspline                     14997.558912       29995.118     0.3
 Gather F Bspline                     14997.558912       89985.353     1.0
 3D-FFT                               47658.567884      381268.543     4.4
 Solve PME                               20.580896        1317.177     0.0
 Shift-X                                  9.418458          56.511     0.0
 Angles                                  21.879375        3675.735     0.0
 Propers                                 48.599718       11129.335     0.1
 Virial                                  23.498403         422.971     0.0
 Stop-CM                                  2.436616          24.366     0.0
 Calc-Ekin                               93.809716        2532.862     0.0
 Lincs                                   12.147284         728.837     0.0
 Lincs-Mat                              131.328750         525.315     0.0
 Constraint-V                           246.633614        1973.069     0.0
 Constraint-Vir                          23.486379         563.673     0.0
 Settle                                  74.129451       23943.813     0.3
-
 Total                                                 8694798.114   100.0
-


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

 Computing:         Nodes   Th.     Count  Wall t (s)     G-Cycles       %
-
 Neighbor search        1    8        201       0.944       27.206     3.3
 Launch GPU ops.        1    8       5001       0.371       10.690     1.3
 Force                  1    8       5001       2.185       62.987     7.7
 PME mesh               1    8       5001      15.033      433.441    52.9
 Wait GPU local         1    8       5001       1.551       44.719     5.5
 NB X/F buffer ops.     1    8       9801       0.538       15.499     1.9
 Write traj.            1    8          2       0.725       20.912     2.6
 Update                 1    8       5001       2.318       66.826     8.2
 Constraints            1    8       5001       2.898       83.551    10.2
 Rest                   1                       1.832       52.828     6.5
-
 Total                  1                      28.394      818.659   100.0
-
-
 PME spread/gather      1    8      10002       8.745      252.144    30.8
 PME 3D-FFT             1    8      10002       5.392      155.458    19.0
 PME solve              1    8       5001       0.869       25.069     3.1
-

 GPU timings
-
 Computing:                         Count  Wall t (s)      ms/step       %
-
 Pair list H2D                        201       0.080        0.397     0.4
 X / q H2D                           5001