Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-04-02 Thread Mark Abraham
Hi,

On Thu, 31 Mar 2016 19:00 Michael Shirts  wrote:

> > One more question: why does the free energy code have to use its own
> kernel?


There are some aspects that are only used in (some kinds of) free-energy
calculations e.g. soft-core options. So maybe coulomb-only transitions have
an optimization available.


> I realize that I'm going to sound like an idiot here, but why can't one
> just tweak parameters outside of the kernel and then have the optimized
> kernel do the dynamics? I presume that one has to step outside of the
> kernel to do the replica exchanges, so why can't the code just use the
> optimized kernel to do the dynamics between exchanges?
>

I think that would make good sense in some cases, I don't know whether
that's little enough work to be worth doing, in the context of likely being
able to do a proper job of SIMD kernels in the future. To do the best job
would require just-in-time compilation. Once a .tpr is read, much of the
mdrun logic is constant - we know it's PME with whatever coupling, etc.
Every conditional branch is very expensive for the CPU - Chris may not have
been able to see the calls to gmx_waste_time_here() but there is one every
few lines in the free-energy kernel :-) Being able to compile such that the
compiler knows that one branch will always be taken will allow it to
eliminate the check and thus the speed penalty. gcc has early-stage support
for such things, but we haven't tried to use it yet. The alternative is to
make kernels for something like all-against-all option combinations, and
that runs to hundreds of kernels.

The optimized energy code only codes very specific interactions;
> vanilla Coulomb, and Lennard-Jones.  If it coded the more general
> interactions (such as soft-core) it would be significantly slower.
>

Nitpick - each of the ~100 kernels targets a specific algorithmic
combination for the hardware selected at cmake time. There is a generic
kernel, but it probably runs about as badly as the free-energy one.

We've come up with some ways to handle free energy calculations with
> optimized inner loops (I think we're talking about the same thing Mark
> -- though we should coordinate to be sure), but that will take a
> little time.
>

I was talking about software engineering along the lines of the existing
kernels. Your work to use tables is also attractive (and orthogonal).

One alternate possibility would be to create N replicas of the system,
> with slightly different parameters, but where all systems use the
> standard functional form, and then combine the results internall.  For
> something like REST2, then this could work, since one is only linearly
> scaling the interactions between two parts A and B of the system; so
> it would be possible to represent each system as a separate physical
> system. This could work fairly well for small numbers of replicas,
> though might have some problems with large numbers of replicas. If
> each replica was an entire new system, it could make replica exchange
> quite slow, since calculating the energies of a given configuration
> with a different replica's energy function would be very costly.
>

In principle, we should be able to have multiple model physics available
where the coordinates already are, to do such calculation with fairly low
overhead. But making that work reliably needs a lot more code modularity
than we have.

Mark

Right now, for alchemical interactions, there are only two
> representations of the system, and then lambda interpolates between
> the two representations of the systems to create the alchemical
> intermediates. Changing the way the code is structured would again
> require some significant changes and time.
>
> On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
>  wrote:
> > Dear Szilárd, Mark, and Michael:
> >
> > Thank you for your suggestions.
> >
> > 1. I did use the icc compiler
> > 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I
> had already set nstlist=10 but mdrun changes that to 25 in my previous
> setup) -- this did not improve the performance
> > 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with
> and without also increasing the mdrun -replex option from 200 to 20,000 --
> this did not improve the performance
> > 4. I tried to test more than one openMP thread per rank. I am fuzzy on
> this part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6"
> and it was a lot slower (I think it only used 4 cores?); I also tried as
> above but with also -ntmpi 4 and it errored with a message about not being
> able to set thread-MPI since I didn't compile with it. In any event, I
> doubted that an optimization here would get back most of the 12x
> performance loss so I didn't pursue this further.
> >
> > The good news is that there is a beta version of the H-REMD fork of
> Plumed that works with gromacs 5.1, so I have a viable route for now.
> >
> > One more question: why 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-04-02 Thread Christopher Neale
Dear Nicolas and Mark:

Indeed, I have spent the last 2 days testing the plumed Hrex code for gromacs 
5.1.2. This is likely the way that I will go. I was looking for an alternate 
route because I am cautious of bugs and I thought that if I could do it easily 
in tested gromacs + a bit of code that I modified myself then I could trust it 
more than plugin whose size is epic enough that I will never read all of the 
new code. Whether plumed knows about something like cmap (etc., not to get 
caught up in details here) takes time to test.

Thank you both for your suggestions, I will move forward with plumed. 

Chris.


From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Nicolas Cheron 

Sent: 02 April 2016 06:18
To: gmx-us...@gromacs.org
Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when 
decoupling a lot of atoms

See: https://groups.google.com/forum/#!topic/plumed-users/eJ0xpnHPb_s and
https://github.com/GiovanniBussi/plumed2/tree/v2.2-hrex

Nicolas

2016-04-02 12:09 GMT+02:00 Mark Abraham :

> Hi,
>
> On Fri, Apr 1, 2016 at 11:39 PM Christopher Neale <
> chris.ne...@alum.utoronto.ca> wrote:
>
> > Dear developers:
> >
> > Is it possible with only a small amount of work to modify calc_delta() in
> > src/programs/mdrun/repl_ex.cpp to add another case to re->type such that
> > the program sends the coordinates to the alternate .tpr information and
> > evaluates the energy completely?
>
>
> This is approximately what the implementation of plumed does.
>
> This would allow for arbitrary exchanges, including bizarre things like
> > exchanging the cutoff distance or whatever (not what I want to do -- I'm
> > just emphasizing the generality of this approach). It will certainly cost
> > some communication and compute cycles to do this, but for e.g. I
> currently
> > have a 12x slowdown using the free energy architecture while decoupling
> 5K
> > atoms. So if I do an exchange attempt every 500 steps, we're still
> breaking
> > even if this complete energy re-evaluation takes 5,500x longer than a
> > regular integration step -- and I don't see how it could possibly take
> > anywhere near that long assuming bMultiEx==FALSE.
> >
>
> One can also implement such schemes as a script that calls grompp and mdrun
> -rerun before relaunching whatever simulation. Horrible, but probably
> better.
>
> Possible problems that I can imagine are:
> > (a) the functions for a complete re-evaluation don't exist (either the
> > communication of coordinates or the energy evaluation)
> >
>
> The code's not modular enough for that, yet.
>
> (b) some issues with changes in temperature or pressure
> >
>
> Yes, such state is not well organised.
>
> (c) the coordinates that are required to do this evaluation no longer exist
> > because the energy evaluation functions are coupled to the timestep so we
> > really need to pass them the coordinates of the previous timestep.
> >
>
> No, those are separate stages, because there are multiple parts of the code
> computing forces.
>
> Any suggestions are really appreciated
> >
>
> Updating plumed for 5.1 is easily the path of least resistance, if it can
> do the job.
>
> Mark
>
> Thank you,
> > Chris.
> >
> > 
> > From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> > gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Michael
> > Shirts 
> > Sent: 31 March 2016 14:00
> > To: Discussion list for GROMACS users
> > Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown
> > when decoupling a lot of atoms
> >
> > > One more question: why does the free energy code have to use its own
> > kernel? I realize that I'm going to sound like an idiot here, but why
> can't
> > one just tweak parameters outside of the kernel and then have the
> optimized
> > kernel do the dynamics? I presume that one has to step outside of the
> > kernel to do the replica exchanges, so why can't the code just use the
> > optimized kernel to do the dynamics between exchanges?
> >
> > The optimized energy code only codes very specific interactions;
> > vanilla Coulomb, and Lennard-Jones.  If it coded the more general
> > interactions (such as soft-core) it would be significantly slower.
> >
> > We've come up with some ways to handle free energy calculations with
> > optimized inner loops (I think we're talking about the same thing Mark
> > -- though we should coordinate to be sure), but that will take a
> > little time.
> >
> > One alternate possibility would be to create N replicas of the system,
> > with slightly different parameters, but where all systems use the
> > standard functional form, and then combine the results internall.  For
> > something like REST2, then this could work, since one is only linearly
> > scaling the interactions between 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-04-02 Thread Nicolas Cheron
See: https://groups.google.com/forum/#!topic/plumed-users/eJ0xpnHPb_s and
https://github.com/GiovanniBussi/plumed2/tree/v2.2-hrex

Nicolas

2016-04-02 12:09 GMT+02:00 Mark Abraham :

> Hi,
>
> On Fri, Apr 1, 2016 at 11:39 PM Christopher Neale <
> chris.ne...@alum.utoronto.ca> wrote:
>
> > Dear developers:
> >
> > Is it possible with only a small amount of work to modify calc_delta() in
> > src/programs/mdrun/repl_ex.cpp to add another case to re->type such that
> > the program sends the coordinates to the alternate .tpr information and
> > evaluates the energy completely?
>
>
> This is approximately what the implementation of plumed does.
>
> This would allow for arbitrary exchanges, including bizarre things like
> > exchanging the cutoff distance or whatever (not what I want to do -- I'm
> > just emphasizing the generality of this approach). It will certainly cost
> > some communication and compute cycles to do this, but for e.g. I
> currently
> > have a 12x slowdown using the free energy architecture while decoupling
> 5K
> > atoms. So if I do an exchange attempt every 500 steps, we're still
> breaking
> > even if this complete energy re-evaluation takes 5,500x longer than a
> > regular integration step -- and I don't see how it could possibly take
> > anywhere near that long assuming bMultiEx==FALSE.
> >
>
> One can also implement such schemes as a script that calls grompp and mdrun
> -rerun before relaunching whatever simulation. Horrible, but probably
> better.
>
> Possible problems that I can imagine are:
> > (a) the functions for a complete re-evaluation don't exist (either the
> > communication of coordinates or the energy evaluation)
> >
>
> The code's not modular enough for that, yet.
>
> (b) some issues with changes in temperature or pressure
> >
>
> Yes, such state is not well organised.
>
> (c) the coordinates that are required to do this evaluation no longer exist
> > because the energy evaluation functions are coupled to the timestep so we
> > really need to pass them the coordinates of the previous timestep.
> >
>
> No, those are separate stages, because there are multiple parts of the code
> computing forces.
>
> Any suggestions are really appreciated
> >
>
> Updating plumed for 5.1 is easily the path of least resistance, if it can
> do the job.
>
> Mark
>
> Thank you,
> > Chris.
> >
> > 
> > From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> > gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Michael
> > Shirts 
> > Sent: 31 March 2016 14:00
> > To: Discussion list for GROMACS users
> > Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown
> > when decoupling a lot of atoms
> >
> > > One more question: why does the free energy code have to use its own
> > kernel? I realize that I'm going to sound like an idiot here, but why
> can't
> > one just tweak parameters outside of the kernel and then have the
> optimized
> > kernel do the dynamics? I presume that one has to step outside of the
> > kernel to do the replica exchanges, so why can't the code just use the
> > optimized kernel to do the dynamics between exchanges?
> >
> > The optimized energy code only codes very specific interactions;
> > vanilla Coulomb, and Lennard-Jones.  If it coded the more general
> > interactions (such as soft-core) it would be significantly slower.
> >
> > We've come up with some ways to handle free energy calculations with
> > optimized inner loops (I think we're talking about the same thing Mark
> > -- though we should coordinate to be sure), but that will take a
> > little time.
> >
> > One alternate possibility would be to create N replicas of the system,
> > with slightly different parameters, but where all systems use the
> > standard functional form, and then combine the results internall.  For
> > something like REST2, then this could work, since one is only linearly
> > scaling the interactions between two parts A and B of the system; so
> > it would be possible to represent each system as a separate physical
> > system. This could work fairly well for small numbers of replicas,
> > though might have some problems with large numbers of replicas. If
> > each replica was an entire new system, it could make replica exchange
> > quite slow, since calculating the energies of a given configuration
> > with a different replica's energy function would be very costly.
> >
> > Right now, for alchemical interactions, there are only two
> > representations of the system, and then lambda interpolates between
> > the two representations of the systems to create the alchemical
> > intermediates. Changing the way the code is structured would again
> > require some significant changes and time.
> >
> > On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
> >  wrote:
> > > Dear Szilárd, Mark, and Michael:
> > >
> > > Thank you for your suggestions.
> > >
> > > 1. I 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-04-02 Thread Mark Abraham
Hi,

On Fri, Apr 1, 2016 at 11:39 PM Christopher Neale <
chris.ne...@alum.utoronto.ca> wrote:

> Dear developers:
>
> Is it possible with only a small amount of work to modify calc_delta() in
> src/programs/mdrun/repl_ex.cpp to add another case to re->type such that
> the program sends the coordinates to the alternate .tpr information and
> evaluates the energy completely?


This is approximately what the implementation of plumed does.

This would allow for arbitrary exchanges, including bizarre things like
> exchanging the cutoff distance or whatever (not what I want to do -- I'm
> just emphasizing the generality of this approach). It will certainly cost
> some communication and compute cycles to do this, but for e.g. I currently
> have a 12x slowdown using the free energy architecture while decoupling 5K
> atoms. So if I do an exchange attempt every 500 steps, we're still breaking
> even if this complete energy re-evaluation takes 5,500x longer than a
> regular integration step -- and I don't see how it could possibly take
> anywhere near that long assuming bMultiEx==FALSE.
>

One can also implement such schemes as a script that calls grompp and mdrun
-rerun before relaunching whatever simulation. Horrible, but probably
better.

Possible problems that I can imagine are:
> (a) the functions for a complete re-evaluation don't exist (either the
> communication of coordinates or the energy evaluation)
>

The code's not modular enough for that, yet.

(b) some issues with changes in temperature or pressure
>

Yes, such state is not well organised.

(c) the coordinates that are required to do this evaluation no longer exist
> because the energy evaluation functions are coupled to the timestep so we
> really need to pass them the coordinates of the previous timestep.
>

No, those are separate stages, because there are multiple parts of the code
computing forces.

Any suggestions are really appreciated
>

Updating plumed for 5.1 is easily the path of least resistance, if it can
do the job.

Mark

Thank you,
> Chris.
>
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
> gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Michael
> Shirts 
> Sent: 31 March 2016 14:00
> To: Discussion list for GROMACS users
> Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown
> when decoupling a lot of atoms
>
> > One more question: why does the free energy code have to use its own
> kernel? I realize that I'm going to sound like an idiot here, but why can't
> one just tweak parameters outside of the kernel and then have the optimized
> kernel do the dynamics? I presume that one has to step outside of the
> kernel to do the replica exchanges, so why can't the code just use the
> optimized kernel to do the dynamics between exchanges?
>
> The optimized energy code only codes very specific interactions;
> vanilla Coulomb, and Lennard-Jones.  If it coded the more general
> interactions (such as soft-core) it would be significantly slower.
>
> We've come up with some ways to handle free energy calculations with
> optimized inner loops (I think we're talking about the same thing Mark
> -- though we should coordinate to be sure), but that will take a
> little time.
>
> One alternate possibility would be to create N replicas of the system,
> with slightly different parameters, but where all systems use the
> standard functional form, and then combine the results internall.  For
> something like REST2, then this could work, since one is only linearly
> scaling the interactions between two parts A and B of the system; so
> it would be possible to represent each system as a separate physical
> system. This could work fairly well for small numbers of replicas,
> though might have some problems with large numbers of replicas. If
> each replica was an entire new system, it could make replica exchange
> quite slow, since calculating the energies of a given configuration
> with a different replica's energy function would be very costly.
>
> Right now, for alchemical interactions, there are only two
> representations of the system, and then lambda interpolates between
> the two representations of the systems to create the alchemical
> intermediates. Changing the way the code is structured would again
> require some significant changes and time.
>
> On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
>  wrote:
> > Dear Szilárd, Mark, and Michael:
> >
> > Thank you for your suggestions.
> >
> > 1. I did use the icc compiler
> > 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I
> had already set nstlist=10 but mdrun changes that to 25 in my previous
> setup) -- this did not improve the performance
> > 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with
> and without also increasing the mdrun -replex option from 200 to 20,000 --
> this did not improve the performance
> > 4. I 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-04-01 Thread Christopher Neale
Dear developers:

Is it possible with only a small amount of work to modify calc_delta() in 
src/programs/mdrun/repl_ex.cpp to add another case to re->type such that the 
program sends the coordinates to the alternate .tpr information and evaluates 
the energy completely? This would allow for arbitrary exchanges, including 
bizarre things like exchanging the cutoff distance or whatever (not what I want 
to do -- I'm just emphasizing the generality of this approach). It will 
certainly cost some communication and compute cycles to do this, but for e.g. I 
currently have a 12x slowdown using the free energy architecture while 
decoupling 5K atoms. So if I do an exchange attempt every 500 steps, we're 
still breaking even if this complete energy re-evaluation takes 5,500x longer 
than a regular integration step -- and I don't see how it could possibly take 
anywhere near that long assuming bMultiEx==FALSE. 

Possible problems that I can imagine are:
(a) the functions for a complete re-evaluation don't exist (either the 
communication of coordinates or the energy evaluation)
(b) some issues with changes in temperature or pressure 
(c) the coordinates that are required to do this evaluation no longer exist 
because the energy evaluation functions are coupled to the timestep so we 
really need to pass them the coordinates of the previous timestep.

Any suggestions are really appreciated.

Thank you,
Chris.


From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Michael Shirts 

Sent: 31 March 2016 14:00
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when 
decoupling a lot of atoms

> One more question: why does the free energy code have to use its own kernel? 
> I realize that I'm going to sound like an idiot here, but why can't one just 
> tweak parameters outside of the kernel and then have the optimized kernel do 
> the dynamics? I presume that one has to step outside of the kernel to do the 
> replica exchanges, so why can't the code just use the optimized kernel to do 
> the dynamics between exchanges?

The optimized energy code only codes very specific interactions;
vanilla Coulomb, and Lennard-Jones.  If it coded the more general
interactions (such as soft-core) it would be significantly slower.

We've come up with some ways to handle free energy calculations with
optimized inner loops (I think we're talking about the same thing Mark
-- though we should coordinate to be sure), but that will take a
little time.

One alternate possibility would be to create N replicas of the system,
with slightly different parameters, but where all systems use the
standard functional form, and then combine the results internall.  For
something like REST2, then this could work, since one is only linearly
scaling the interactions between two parts A and B of the system; so
it would be possible to represent each system as a separate physical
system. This could work fairly well for small numbers of replicas,
though might have some problems with large numbers of replicas. If
each replica was an entire new system, it could make replica exchange
quite slow, since calculating the energies of a given configuration
with a different replica's energy function would be very costly.

Right now, for alchemical interactions, there are only two
representations of the system, and then lambda interpolates between
the two representations of the systems to create the alchemical
intermediates. Changing the way the code is structured would again
require some significant changes and time.

On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
 wrote:
> Dear Szilárd, Mark, and Michael:
>
> Thank you for your suggestions.
>
> 1. I did use the icc compiler
> 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I had 
> already set nstlist=10 but mdrun changes that to 25 in my previous setup) -- 
> this did not improve the performance
> 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with and 
> without also increasing the mdrun -replex option from 200 to 20,000 -- this 
> did not improve the performance
> 4. I tried to test more than one openMP thread per rank. I am fuzzy on this 
> part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6" and it 
> was a lot slower (I think it only used 4 cores?); I also tried as above but 
> with also -ntmpi 4 and it errored with a message about not being able to set 
> thread-MPI since I didn't compile with it. In any event, I doubted that an 
> optimization here would get back most of the 12x performance loss so I didn't 
> pursue this further.
>
> The good news is that there is a beta version of the H-REMD fork of Plumed 
> that works with gromacs 5.1, so I have a viable route for now.
>
> One more question: why does the free energy code have to use 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-31 Thread Michael Shirts
> One more question: why does the free energy code have to use its own kernel? 
> I realize that I'm going to sound like an idiot here, but why can't one just 
> tweak parameters outside of the kernel and then have the optimized kernel do 
> the dynamics? I presume that one has to step outside of the kernel to do the 
> replica exchanges, so why can't the code just use the optimized kernel to do 
> the dynamics between exchanges?

The optimized energy code only codes very specific interactions;
vanilla Coulomb, and Lennard-Jones.  If it coded the more general
interactions (such as soft-core) it would be significantly slower.

We've come up with some ways to handle free energy calculations with
optimized inner loops (I think we're talking about the same thing Mark
-- though we should coordinate to be sure), but that will take a
little time.

One alternate possibility would be to create N replicas of the system,
with slightly different parameters, but where all systems use the
standard functional form, and then combine the results internall.  For
something like REST2, then this could work, since one is only linearly
scaling the interactions between two parts A and B of the system; so
it would be possible to represent each system as a separate physical
system. This could work fairly well for small numbers of replicas,
though might have some problems with large numbers of replicas. If
each replica was an entire new system, it could make replica exchange
quite slow, since calculating the energies of a given configuration
with a different replica's energy function would be very costly.

Right now, for alchemical interactions, there are only two
representations of the system, and then lambda interpolates between
the two representations of the systems to create the alchemical
intermediates. Changing the way the code is structured would again
require some significant changes and time.

On Thu, Mar 31, 2016 at 10:49 AM, Christopher Neale
 wrote:
> Dear Szilárd, Mark, and Michael:
>
> Thank you for your suggestions.
>
> 1. I did use the icc compiler
> 2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I had 
> already set nstlist=10 but mdrun changes that to 25 in my previous setup) -- 
> this did not improve the performance
> 3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with and 
> without also increasing the mdrun -replex option from 200 to 20,000 -- this 
> did not improve the performance
> 4. I tried to test more than one openMP thread per rank. I am fuzzy on this 
> part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6" and it 
> was a lot slower (I think it only used 4 cores?); I also tried as above but 
> with also -ntmpi 4 and it errored with a message about not being able to set 
> thread-MPI since I didn't compile with it. In any event, I doubted that an 
> optimization here would get back most of the 12x performance loss so I didn't 
> pursue this further.
>
> The good news is that there is a beta version of the H-REMD fork of Plumed 
> that works with gromacs 5.1, so I have a viable route for now.
>
> One more question: why does the free energy code have to use its own kernel? 
> I realize that I'm going to sound like an idiot here, but why can't one just 
> tweak parameters outside of the kernel and then have the optimized kernel do 
> the dynamics? I presume that one has to step outside of the kernel to do the 
> replica exchanges, so why can't the code just use the optimized kernel to do 
> the dynamics between exchanges?
>
> Thank you again for all of your help,
> Chris.
>
> 
> From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
>  on behalf of Szilárd Páll 
> 
> Sent: 30 March 2016 12:59
> To: Discussion list for GROMACS users
> Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when 
> decoupling a lot of atoms
>
> Chris,
>
> I can suggest two possible tweaks, these won't do miracles, but may give
> you a bit better performance.
>
> * icc is better than gcc at optimizing the naiive free energy kernel code.
> I observed in the past up to 1.5x faster free energy kernel performance
> with icc 15 vs gcc 4.9 or so.
>
> * The default nstlist heuristic/suggestion assumes fast non-bondeds. In
> your case, depending on how long is the list buffer with the nstlist you
> use (25?), you may be able to gain a bit of performance by shifting load
> back to the search with a smaller nstlist, e.g. 10.
>
> These two combined could give in best case 2x, I guess.
>
> What I find slightly unusual in the logs you posted on redmine is the 3-4x
> slowdown in PME and constraints (I'd expect ~2x in PME and less in
> constraints), but it could be that this too is simply because most
> interaction in the system are perturbed which trigger non-optimized
> code-paths.
>
> Cheers,
> --
> Szilárd
>
> On Wed, 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-31 Thread Christopher Neale
Dear Szilárd, Mark, and Michael:

Thank you for your suggestions. 

1. I did use the icc compiler
2. I tested usage of nstlist=10 by setting verlet-buffer-tolerance=-1 (I had 
already set nstlist=10 but mdrun changes that to 25 in my previous setup) -- 
this did not improve the performance
3. I tested increasing .mdp option nstdhdl from 200 to 20,000 both with and 
without also increasing the mdrun -replex option from 200 to 20,000 -- this did 
not improve the performance
4. I tried to test more than one openMP thread per rank. I am fuzzy on this 
part of the usage, but I tried "ibrun -np 4 gmx_mpi mdrun -ntomp 6" and it was 
a lot slower (I think it only used 4 cores?); I also tried as above but with 
also -ntmpi 4 and it errored with a message about not being able to set 
thread-MPI since I didn't compile with it. In any event, I doubted that an 
optimization here would get back most of the 12x performance loss so I didn't 
pursue this further.

The good news is that there is a beta version of the H-REMD fork of Plumed that 
works with gromacs 5.1, so I have a viable route for now.

One more question: why does the free energy code have to use its own kernel? I 
realize that I'm going to sound like an idiot here, but why can't one just 
tweak parameters outside of the kernel and then have the optimized kernel do 
the dynamics? I presume that one has to step outside of the kernel to do the 
replica exchanges, so why can't the code just use the optimized kernel to do 
the dynamics between exchanges? 

Thank you again for all of your help,
Chris.


From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se 
 on behalf of Szilárd Páll 

Sent: 30 March 2016 12:59
To: Discussion list for GROMACS users
Subject: Re: [gmx-users] Looking for the source of the H-REMD slowdown when 
decoupling a lot of atoms

Chris,

I can suggest two possible tweaks, these won't do miracles, but may give
you a bit better performance.

* icc is better than gcc at optimizing the naiive free energy kernel code.
I observed in the past up to 1.5x faster free energy kernel performance
with icc 15 vs gcc 4.9 or so.

* The default nstlist heuristic/suggestion assumes fast non-bondeds. In
your case, depending on how long is the list buffer with the nstlist you
use (25?), you may be able to gain a bit of performance by shifting load
back to the search with a smaller nstlist, e.g. 10.

These two combined could give in best case 2x, I guess.

What I find slightly unusual in the logs you posted on redmine is the 3-4x
slowdown in PME and constraints (I'd expect ~2x in PME and less in
constraints), but it could be that this too is simply because most
interaction in the system are perturbed which trigger non-optimized
code-paths.

Cheers,
--
Szilárd

On Wed, Mar 30, 2016 at 3:05 PM, Mark Abraham 
wrote:

> Hi,
>
> Yeah, unfortunately Michael's pretty much right - the free-energy kernel is
> currently that cousin nobody talks about (or to). It's essentially
> unchanged since GROMACS 4.0 days, except that the Verlet scheme has some
> kludge so that it can call the same kernel that the group scheme used to
> call. But it has none of the optimizations and also some simplifying
> pessimizations. The result is highly unsuitable for Chris's use case, but
> not horrible for the more normal case of perturbing a small part of a
> system. For now, I can only suggest trying a run with more than one OpenMP
> thread per rank, but there's nothing in the log file snippets that fills me
> with any hope that it would be noticeably faster.
>
> We have a pile of infrastructure built and in the works that will lead to
> being able to offer similarly optimized free-energy kernels, but they won't
> see the light of day until next year, I'm afraid. A set of sample .tprs at
> Redmine 742 would be most welcome, however - it's very good for us to know
> when/that we're optimizing a workflow someone actually wants to run, but
> currently has reason to avoid.
>
> Mark
>
> On Wed, Mar 30, 2016 at 8:05 AM Michael Shirts  wrote:
>
> > Hi, Chris: I'm pretty sure that it's because the nonbonded free
> > energies are much slower than the standard free energies.  You state:
> >
> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
> was
> > unable to find a function called "gmx_waste_time_here()" and beyond that
> I
> > was out of my depth.
> >
> > But it's much more the fact that the non-free energy nonbondeds are
> > SUPER optimized.
> >
> > I don't see a particularly viable way around this for now. The only
> > thing I can think of splitting the neighborlists into two force calls,
> > and scaling the forces and energies coming out of those.  That would
> > be a huge pain.
> >
> > On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
> >  wrote:
> > > Dear Users:
> > >
> > > I 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-30 Thread Szilárd Páll
Chris,

I can suggest two possible tweaks, these won't do miracles, but may give
you a bit better performance.

* icc is better than gcc at optimizing the naiive free energy kernel code.
I observed in the past up to 1.5x faster free energy kernel performance
with icc 15 vs gcc 4.9 or so.

* The default nstlist heuristic/suggestion assumes fast non-bondeds. In
your case, depending on how long is the list buffer with the nstlist you
use (25?), you may be able to gain a bit of performance by shifting load
back to the search with a smaller nstlist, e.g. 10.

These two combined could give in best case 2x, I guess.

What I find slightly unusual in the logs you posted on redmine is the 3-4x
slowdown in PME and constraints (I'd expect ~2x in PME and less in
constraints), but it could be that this too is simply because most
interaction in the system are perturbed which trigger non-optimized
code-paths.

Cheers,
--
Szilárd

On Wed, Mar 30, 2016 at 3:05 PM, Mark Abraham 
wrote:

> Hi,
>
> Yeah, unfortunately Michael's pretty much right - the free-energy kernel is
> currently that cousin nobody talks about (or to). It's essentially
> unchanged since GROMACS 4.0 days, except that the Verlet scheme has some
> kludge so that it can call the same kernel that the group scheme used to
> call. But it has none of the optimizations and also some simplifying
> pessimizations. The result is highly unsuitable for Chris's use case, but
> not horrible for the more normal case of perturbing a small part of a
> system. For now, I can only suggest trying a run with more than one OpenMP
> thread per rank, but there's nothing in the log file snippets that fills me
> with any hope that it would be noticeably faster.
>
> We have a pile of infrastructure built and in the works that will lead to
> being able to offer similarly optimized free-energy kernels, but they won't
> see the light of day until next year, I'm afraid. A set of sample .tprs at
> Redmine 742 would be most welcome, however - it's very good for us to know
> when/that we're optimizing a workflow someone actually wants to run, but
> currently has reason to avoid.
>
> Mark
>
> On Wed, Mar 30, 2016 at 8:05 AM Michael Shirts  wrote:
>
> > Hi, Chris: I'm pretty sure that it's because the nonbonded free
> > energies are much slower than the standard free energies.  You state:
> >
> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
> was
> > unable to find a function called "gmx_waste_time_here()" and beyond that
> I
> > was out of my depth.
> >
> > But it's much more the fact that the non-free energy nonbondeds are
> > SUPER optimized.
> >
> > I don't see a particularly viable way around this for now. The only
> > thing I can think of splitting the neighborlists into two force calls,
> > and scaling the forces and energies coming out of those.  That would
> > be a huge pain.
> >
> > On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
> >  wrote:
> > > Dear Users:
> > >
> > > I am trying to do some Hamiltonian replica exchange (H-REMD) in gromacs
> > 5.1.2 and am running up against really large slowdowns when decoupling a
> > large number of atoms. I am decoupling 5360 atoms out of the 15520 atoms
> in
> > my system. The goal is not to get a PMF, but to enhance sampling using
> the
> > REST approach to partially decouple lipids in a bilayer. This approach
> > enhances lipid relaxation times (
> > http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) though the authors of
> > that paper modified the gromacs code to do their own H-REMD in order to
> > avoid the really slow speed they also got when decoupling lots of atoms
> via
> > the free energy code.
> > >
> > > I have already posted details here
> http://redmine.gromacs.org/issues/742
> > , which includes .mdp options and some timing output. I compare the
> timing
> > output to a standard temperature REMD (T-REMD) run. For my usage, the
> > slowdown is about 12x for H-REMD vs. T-REMD.
> > >
> > > I am motivated to find a solution within gromacs because the
> alternative
> > is to use gromacs 4.6.7 with plumed (or with the aforementioned modified
> > code, which is also gromacs v4). Normally that would be a viable option,
> > but I am using the charmm force field and the charmm TIP3P water and I
> > would rather not give up the speed boost that I see in gromacs v5.1.2,
> > which allows the use of the verlet cutoff scheme and has been tested and
> > shown to give the correct reproduction of charmm forces (vs. the forces
> one
> > would get using the charmm simulation software).
> > >
> > > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I
> was
> > unable to find a function called "gmx_waste_time_here()" and beyond that
> I
> > was out of my depth.
> > >
> > > Thank you for any pointers,
> > > Chris.
> > >
> > >
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > 

Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-30 Thread Mark Abraham
Hi,

Yeah, unfortunately Michael's pretty much right - the free-energy kernel is
currently that cousin nobody talks about (or to). It's essentially
unchanged since GROMACS 4.0 days, except that the Verlet scheme has some
kludge so that it can call the same kernel that the group scheme used to
call. But it has none of the optimizations and also some simplifying
pessimizations. The result is highly unsuitable for Chris's use case, but
not horrible for the more normal case of perturbing a small part of a
system. For now, I can only suggest trying a run with more than one OpenMP
thread per rank, but there's nothing in the log file snippets that fills me
with any hope that it would be noticeably faster.

We have a pile of infrastructure built and in the works that will lead to
being able to offer similarly optimized free-energy kernels, but they won't
see the light of day until next year, I'm afraid. A set of sample .tprs at
Redmine 742 would be most welcome, however - it's very good for us to know
when/that we're optimizing a workflow someone actually wants to run, but
currently has reason to avoid.

Mark

On Wed, Mar 30, 2016 at 8:05 AM Michael Shirts  wrote:

> Hi, Chris: I'm pretty sure that it's because the nonbonded free
> energies are much slower than the standard free energies.  You state:
>
> > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I was
> unable to find a function called "gmx_waste_time_here()" and beyond that I
> was out of my depth.
>
> But it's much more the fact that the non-free energy nonbondeds are
> SUPER optimized.
>
> I don't see a particularly viable way around this for now. The only
> thing I can think of splitting the neighborlists into two force calls,
> and scaling the forces and energies coming out of those.  That would
> be a huge pain.
>
> On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
>  wrote:
> > Dear Users:
> >
> > I am trying to do some Hamiltonian replica exchange (H-REMD) in gromacs
> 5.1.2 and am running up against really large slowdowns when decoupling a
> large number of atoms. I am decoupling 5360 atoms out of the 15520 atoms in
> my system. The goal is not to get a PMF, but to enhance sampling using the
> REST approach to partially decouple lipids in a bilayer. This approach
> enhances lipid relaxation times (
> http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) though the authors of
> that paper modified the gromacs code to do their own H-REMD in order to
> avoid the really slow speed they also got when decoupling lots of atoms via
> the free energy code.
> >
> > I have already posted details here http://redmine.gromacs.org/issues/742
> , which includes .mdp options and some timing output. I compare the timing
> output to a standard temperature REMD (T-REMD) run. For my usage, the
> slowdown is about 12x for H-REMD vs. T-REMD.
> >
> > I am motivated to find a solution within gromacs because the alternative
> is to use gromacs 4.6.7 with plumed (or with the aforementioned modified
> code, which is also gromacs v4). Normally that would be a viable option,
> but I am using the charmm force field and the charmm TIP3P water and I
> would rather not give up the speed boost that I see in gromacs v5.1.2,
> which allows the use of the verlet cutoff scheme and has been tested and
> shown to give the correct reproduction of charmm forces (vs. the forces one
> would get using the charmm simulation software).
> >
> > I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I was
> unable to find a function called "gmx_waste_time_here()" and beyond that I
> was out of my depth.
> >
> > Thank you for any pointers,
> > Chris.
> >
> >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-30 Thread Michael Shirts
Hi, Chris: I'm pretty sure that it's because the nonbonded free
energies are much slower than the standard free energies.  You state:

> I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I was 
> unable to find a function called "gmx_waste_time_here()" and beyond that I 
> was out of my depth.

But it's much more the fact that the non-free energy nonbondeds are
SUPER optimized.

I don't see a particularly viable way around this for now. The only
thing I can think of splitting the neighborlists into two force calls,
and scaling the forces and energies coming out of those.  That would
be a huge pain.

On Tue, Mar 29, 2016 at 9:50 PM, Christopher Neale
 wrote:
> Dear Users:
>
> I am trying to do some Hamiltonian replica exchange (H-REMD) in gromacs 5.1.2 
> and am running up against really large slowdowns when decoupling a large 
> number of atoms. I am decoupling 5360 atoms out of the 15520 atoms in my 
> system. The goal is not to get a PMF, but to enhance sampling using the REST 
> approach to partially decouple lipids in a bilayer. This approach enhances 
> lipid relaxation times ( http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) 
> though the authors of that paper modified the gromacs code to do their own 
> H-REMD in order to avoid the really slow speed they also got when decoupling 
> lots of atoms via the free energy code.
>
> I have already posted details here http://redmine.gromacs.org/issues/742 , 
> which includes .mdp options and some timing output. I compare the timing 
> output to a standard temperature REMD (T-REMD) run. For my usage, the 
> slowdown is about 12x for H-REMD vs. T-REMD.
>
> I am motivated to find a solution within gromacs because the alternative is 
> to use gromacs 4.6.7 with plumed (or with the aforementioned modified code, 
> which is also gromacs v4). Normally that would be a viable option, but I am 
> using the charmm force field and the charmm TIP3P water and I would rather 
> not give up the speed boost that I see in gromacs v5.1.2, which allows the 
> use of the verlet cutoff scheme and has been tested and shown to give the 
> correct reproduction of charmm forces (vs. the forces one would get using the 
> charmm simulation software).
>
> I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I was 
> unable to find a function called "gmx_waste_time_here()" and beyond that I 
> was out of my depth.
>
> Thank you for any pointers,
> Chris.
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Looking for the source of the H-REMD slowdown when decoupling a lot of atoms

2016-03-29 Thread Christopher Neale
Dear Users: 

I am trying to do some Hamiltonian replica exchange (H-REMD) in gromacs 5.1.2 
and am running up against really large slowdowns when decoupling a large number 
of atoms. I am decoupling 5360 atoms out of the 15520 atoms in my system. The 
goal is not to get a PMF, but to enhance sampling using the REST approach to 
partially decouple lipids in a bilayer. This approach enhances lipid relaxation 
times ( http://pubs.acs.org/doi/pdf/10.1021/ct500305u ) though the authors of 
that paper modified the gromacs code to do their own H-REMD in order to avoid 
the really slow speed they also got when decoupling lots of atoms via the free 
energy code. 

I have already posted details here http://redmine.gromacs.org/issues/742 , 
which includes .mdp options and some timing output. I compare the timing output 
to a standard temperature REMD (T-REMD) run. For my usage, the slowdown is 
about 12x for H-REMD vs. T-REMD. 

I am motivated to find a solution within gromacs because the alternative is to 
use gromacs 4.6.7 with plumed (or with the aforementioned modified code, which 
is also gromacs v4). Normally that would be a viable option, but I am using the 
charmm force field and the charmm TIP3P water and I would rather not give up 
the speed boost that I see in gromacs v5.1.2, which allows the use of the 
verlet cutoff scheme and has been tested and shown to give the correct 
reproduction of charmm forces (vs. the forces one would get using the charmm 
simulation software).

I took a look at gmxlib/nonbonded/nb_free_energy.c in v.5.1.2, but I was unable 
to find a function called "gmx_waste_time_here()" and beyond that I was out of 
my depth.

Thank you for any pointers,
Chris.


-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.