Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-03-09 Thread Szilárd Páll
Hi Andreas,

Sorry for the delay.

I can confirm the regression. This affects the energy calculation steps
where the GPU bonded computational did get significantly slower (as a
side-effect of optimizations that mainly targeted the force-only kernels).

Can you please file an issue on redmine.gromacs.org and upload the data you
shared with me?

As a workaround you should consider using an nstcalcenergy>1; bumping it to
just ~10 would eliminate most of the regression and would improve the
performance of other computation too (the nonbonded F-only kernels are also
at least 1.5x faster than the force+energy kernels).
Alternatively, I recall you had decent CPU, so you could run the bonded
interactions on the CPU

Side-note: you are using an overly fine PME grid that you did not scale
along the (overly accurate) the rather long cut-offs (see
http://manual.gromacs.org/documentation/current/user-guide/mdp-options.html#mdp-fourierspacing
).

Cheers,
--
Szilárd


On Fri, Feb 28, 2020 at 11:10 AM Andreas Baer  wrote:

> Hi,
>
> sorry for it!
>
> https://faubox.rrze.uni-erlangen.de/getlink/fiUpELsXokQr3a7vyeDSKdY3/benchmarks_2019-2020_all
>
> Cheers,
> Andreas
>
> On 27.02.20 17:59, Szilárd Páll wrote:
>
> On Thu, Feb 27, 2020 at 1:08 PM Andreas Baer  wrote:
>
>> Hi,
>>
>> On 27.02.20 12:34, Szilárd Páll wrote:
>> > Hi
>> >
>> > On Thu, Feb 27, 2020 at 11:31 AM Andreas Baer 
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> with the link below, additional log files for runs with 1 GPU should be
>> >> accessible now.
>> >>
>> > I meant to ask you to run single-rank GPU runs, i.e. gmx mdrun -ntmpi 1.
>> >
>> > It would also help if you could share some input files in case if
>> further
>> > testing is needed.
>> Ok, there is now also an additional benchmark with `-ntmpi 1 -ntomp 4
>> -bonded gpu -update gpu` as parameters. However, it is run on the same
>> machine with smt disabled.
>> With the following link, I provide all the tests on this machine, I did
>> by now, along with a summary of the performance for the several input
>> parameters (both in `logfiles`), as well as input files (`C60xh.7z`) and
>> the scripts to run these.
>>
>
> Links seems to be missing.
> --
> Szilárd
>
>
>> I hope, this helps. If there is anything else, I can do to help, please
>> let me know!
>> >
>> >
>> >> Thank you for the comment with the rlist, I did not know, that this
>> will
>> >> affect the performance negatively.
>> >
>> > It does in multiple ways. First, you are using a rather long list buffer
>> > which will make the nonbonded pair-interaction calculation more
>> > computational expensive than it could be if you just used a tolerance
>> and
>> > let the buffer be calculated. Secondly, as setting a manual rlist
>> disables
>> > the automated verlet buffer calculation, it prevents mdrun from using a
>> > dual pairl-list setup (see
>> >
>> http://manual.gromacs.org/documentation/2018.1/release-notes/2018/major/features.html#dual-pair-list-buffer-with-dynamic-pruning
>> )
>> > which has additional performance benefits.
>> Ok, thank you for the explanation!
>> >
>> > Cheers,
>> > --
>> > Szilárd
>> Cheers,
>> Andreas
>> >
>> >
>> >
>> >> I know, about the nstcalcenergy, but
>> >> I need it for several of my simulations.
>> > Cheers,
>> >> Andreas
>> >>
>> >> On 26.02.20 16:50, Szilárd Páll wrote:
>> >>> Hi,
>> >>>
>> >>> Can you please check the performance when running on a single GPU
>> 2019 vs
>> >>> 2020 with your inputs?
>> >>>
>> >>> Also note that you are using some peculiar settings that will have an
>> >>> adverse effect on performance (like manually set rlist disallowing the
>> >> dual
>> >>> pair-list setup, and nstcalcenergy=1).
>> >>>
>> >>> Cheers,
>> >>>
>> >>> --
>> >>> Szilárd
>> >>>
>> >>>
>> >>> On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer 
>> >> wrote:
>>  Hello,
>> 
>>  here is a link to the logfiles.
>> 
>> 
>> >>
>> https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020
>>  If necessary, I can also provide some more log or tpr/gro/... files.
>> 
>>  Cheers,
>>  Andreas
>> 
>> 
>>  On 26.02.20 16:09, Paul bauer wrote:
>> > Hello,
>> >
>> > you can't add attachments to the list, please upload the files
>> > somewhere to share them.
>> > This might be quite important to us, because the performance
>> > regression is not expected by us.
>> >
>> > Cheers
>> >
>> > Paul
>> >
>> > On 26/02/2020 15:54, Andreas Baer wrote:
>> >> Hello,
>> >>
>> >> from a set of benchmark tests with large systems using Gromacs
>> >> versions 2019.5 and 2020, I obtained some unexpected results:
>> >> With the same set of parameters and the 2020 version, I obtain a
>> >> performance that is about 2/3 of the 2019.5 version. Interestingly,
>> >> according to nvidia-smi, the GPU usage is about 20% higher for the
>> >> 2020 version.
>> >> Also from the log files it seems, that the 2020 

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-28 Thread Andreas Baer

Hi,

sorry for it!
https://faubox.rrze.uni-erlangen.de/getlink/fiUpELsXokQr3a7vyeDSKdY3/benchmarks_2019-2020_all

Cheers,
Andreas

On 27.02.20 17:59, Szilárd Páll wrote:
On Thu, Feb 27, 2020 at 1:08 PM Andreas Baer > wrote:


Hi,

On 27.02.20 12:34, Szilárd Páll wrote:
> Hi
>
> On Thu, Feb 27, 2020 at 11:31 AM Andreas Baer
mailto:andreas.b...@fau.de>> wrote:
>
>> Hi,
>>
>> with the link below, additional log files for runs with 1 GPU
should be
>> accessible now.
>>
> I meant to ask you to run single-rank GPU runs, i.e. gmx mdrun
-ntmpi 1.
>
> It would also help if you could share some input files in case
if further
> testing is needed.
Ok, there is now also an additional benchmark with `-ntmpi 1 -ntomp 4
-bonded gpu -update gpu` as parameters. However, it is run on the
same
machine with smt disabled.
With the following link, I provide all the tests on this machine,
I did
by now, along with a summary of the performance for the several input
parameters (both in `logfiles`), as well as input files
(`C60xh.7z`) and
the scripts to run these.


Links seems to be missing.
--
Szilárd

I hope, this helps. If there is anything else, I can do to help,
please
let me know!
>
>
>> Thank you for the comment with the rlist, I did not know, that
this will
>> affect the performance negatively.
>
> It does in multiple ways. First, you are using a rather long
list buffer
> which will make the nonbonded pair-interaction calculation more
> computational expensive than it could be if you just used a
tolerance and
> let the buffer be calculated. Secondly, as setting a manual
rlist disables
> the automated verlet buffer calculation, it prevents mdrun from
using a
> dual pairl-list setup (see
>

http://manual.gromacs.org/documentation/2018.1/release-notes/2018/major/features.html#dual-pair-list-buffer-with-dynamic-pruning)
> which has additional performance benefits.
Ok, thank you for the explanation!
>
> Cheers,
> --
> Szilárd
Cheers,
Andreas
>
>
>
>> I know, about the nstcalcenergy, but
>> I need it for several of my simulations.
> Cheers,
>> Andreas
>>
>> On 26.02.20 16:50, Szilárd Páll wrote:
>>> Hi,
>>>
>>> Can you please check the performance when running on a single
GPU 2019 vs
>>> 2020 with your inputs?
>>>
>>> Also note that you are using some peculiar settings that will
have an
>>> adverse effect on performance (like manually set rlist
disallowing the
>> dual
>>> pair-list setup, and nstcalcenergy=1).
>>>
>>> Cheers,
>>>
>>> --
>>> Szilárd
>>>
>>>
>>> On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer
mailto:andreas.b...@fau.de>>
>> wrote:
 Hello,

 here is a link to the logfiles.


>>

https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020
 If necessary, I can also provide some more log or tpr/gro/...
files.

 Cheers,
 Andreas


 On 26.02.20 16:09, Paul bauer wrote:
> Hello,
>
> you can't add attachments to the list, please upload the files
> somewhere to share them.
> This might be quite important to us, because the performance
> regression is not expected by us.
>
> Cheers
>
> Paul
>
> On 26/02/2020 15:54, Andreas Baer wrote:
>> Hello,
>>
>> from a set of benchmark tests with large systems using Gromacs
>> versions 2019.5 and 2020, I obtained some unexpected results:
>> With the same set of parameters and the 2020 version, I
obtain a
>> performance that is about 2/3 of the 2019.5 version.
Interestingly,
>> according to nvidia-smi, the GPU usage is about 20% higher
for the
>> 2020 version.
>> Also from the log files it seems, that the 2020 version
does the
>> computations more efficiently, but spends so much more time
waiting,
>> that the overall performance drops.
>>
>> Some background info on the benchmarks:
>> - System contains about 2.1 million atoms.
>> - Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz =
16 cores +
>> SMT; 4x NVIDIA Tesla V100
>>     (similar results with less significant performance drop
(~15%) on a
>> different machine: 2 or 4 nodes with each [2x Intel Xeon
2660v2 („Ivy
>> Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
>> - Several options for -ntmpi, -ntomp, -bonded, -pme are
used to find
>> the optimal set. However the performance drop seems to be
persistent

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-27 Thread Szilárd Páll
On Thu, Feb 27, 2020 at 1:08 PM Andreas Baer  wrote:

> Hi,
>
> On 27.02.20 12:34, Szilárd Páll wrote:
> > Hi
> >
> > On Thu, Feb 27, 2020 at 11:31 AM Andreas Baer 
> wrote:
> >
> >> Hi,
> >>
> >> with the link below, additional log files for runs with 1 GPU should be
> >> accessible now.
> >>
> > I meant to ask you to run single-rank GPU runs, i.e. gmx mdrun -ntmpi 1.
> >
> > It would also help if you could share some input files in case if further
> > testing is needed.
> Ok, there is now also an additional benchmark with `-ntmpi 1 -ntomp 4
> -bonded gpu -update gpu` as parameters. However, it is run on the same
> machine with smt disabled.
> With the following link, I provide all the tests on this machine, I did
> by now, along with a summary of the performance for the several input
> parameters (both in `logfiles`), as well as input files (`C60xh.7z`) and
> the scripts to run these.
>

Links seems to be missing.
--
Szilárd


> I hope, this helps. If there is anything else, I can do to help, please
> let me know!
> >
> >
> >> Thank you for the comment with the rlist, I did not know, that this will
> >> affect the performance negatively.
> >
> > It does in multiple ways. First, you are using a rather long list buffer
> > which will make the nonbonded pair-interaction calculation more
> > computational expensive than it could be if you just used a tolerance and
> > let the buffer be calculated. Secondly, as setting a manual rlist
> disables
> > the automated verlet buffer calculation, it prevents mdrun from using a
> > dual pairl-list setup (see
> >
> http://manual.gromacs.org/documentation/2018.1/release-notes/2018/major/features.html#dual-pair-list-buffer-with-dynamic-pruning
> )
> > which has additional performance benefits.
> Ok, thank you for the explanation!
> >
> > Cheers,
> > --
> > Szilárd
> Cheers,
> Andreas
> >
> >
> >
> >> I know, about the nstcalcenergy, but
> >> I need it for several of my simulations.
> > Cheers,
> >> Andreas
> >>
> >> On 26.02.20 16:50, Szilárd Páll wrote:
> >>> Hi,
> >>>
> >>> Can you please check the performance when running on a single GPU 2019
> vs
> >>> 2020 with your inputs?
> >>>
> >>> Also note that you are using some peculiar settings that will have an
> >>> adverse effect on performance (like manually set rlist disallowing the
> >> dual
> >>> pair-list setup, and nstcalcenergy=1).
> >>>
> >>> Cheers,
> >>>
> >>> --
> >>> Szilárd
> >>>
> >>>
> >>> On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer 
> >> wrote:
>  Hello,
> 
>  here is a link to the logfiles.
> 
> 
> >>
> https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020
>  If necessary, I can also provide some more log or tpr/gro/... files.
> 
>  Cheers,
>  Andreas
> 
> 
>  On 26.02.20 16:09, Paul bauer wrote:
> > Hello,
> >
> > you can't add attachments to the list, please upload the files
> > somewhere to share them.
> > This might be quite important to us, because the performance
> > regression is not expected by us.
> >
> > Cheers
> >
> > Paul
> >
> > On 26/02/2020 15:54, Andreas Baer wrote:
> >> Hello,
> >>
> >> from a set of benchmark tests with large systems using Gromacs
> >> versions 2019.5 and 2020, I obtained some unexpected results:
> >> With the same set of parameters and the 2020 version, I obtain a
> >> performance that is about 2/3 of the 2019.5 version. Interestingly,
> >> according to nvidia-smi, the GPU usage is about 20% higher for the
> >> 2020 version.
> >> Also from the log files it seems, that the 2020 version does the
> >> computations more efficiently, but spends so much more time waiting,
> >> that the overall performance drops.
> >>
> >> Some background info on the benchmarks:
> >> - System contains about 2.1 million atoms.
> >> - Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores
> +
> >> SMT; 4x NVIDIA Tesla V100
> >> (similar results with less significant performance drop (~15%)
> on a
> >> different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2
> („Ivy
> >> Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
> >> - Several options for -ntmpi, -ntomp, -bonded, -pme are used to find
> >> the optimal set. However the performance drop seems to be persistent
> >> for all such options.
> >>
> >> Two representative log files are attached.
> >> Does anyone have an idea, where this drop comes from, and how to
> >> choose the parameters for the 2020 version to circumvent this?
> >>
> >> Regards,
> >> Andreas
> >>
>  --
>  Gromacs Users mailing list
> 
>  * Please search the archive at
>  http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>  posting!
> 
>  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
>  * For (un)subscribe requests visit

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-27 Thread Andreas Baer

Hi,

On 27.02.20 12:34, Szilárd Páll wrote:

Hi

On Thu, Feb 27, 2020 at 11:31 AM Andreas Baer  wrote:


Hi,

with the link below, additional log files for runs with 1 GPU should be
accessible now.


I meant to ask you to run single-rank GPU runs, i.e. gmx mdrun -ntmpi 1.

It would also help if you could share some input files in case if further
testing is needed.
Ok, there is now also an additional benchmark with `-ntmpi 1 -ntomp 4 
-bonded gpu -update gpu` as parameters. However, it is run on the same 
machine with smt disabled.
With the following link, I provide all the tests on this machine, I did 
by now, along with a summary of the performance for the several input 
parameters (both in `logfiles`), as well as input files (`C60xh.7z`) and 
the scripts to run these.
I hope, this helps. If there is anything else, I can do to help, please 
let me know!




Thank you for the comment with the rlist, I did not know, that this will
affect the performance negatively.


It does in multiple ways. First, you are using a rather long list buffer
which will make the nonbonded pair-interaction calculation more
computational expensive than it could be if you just used a tolerance and
let the buffer be calculated. Secondly, as setting a manual rlist disables
the automated verlet buffer calculation, it prevents mdrun from using a
dual pairl-list setup (see
http://manual.gromacs.org/documentation/2018.1/release-notes/2018/major/features.html#dual-pair-list-buffer-with-dynamic-pruning)
which has additional performance benefits.

Ok, thank you for the explanation!


Cheers,
--
Szilárd

Cheers,
Andreas





I know, about the nstcalcenergy, but
I need it for several of my simulations.

Cheers,

Andreas

On 26.02.20 16:50, Szilárd Páll wrote:

Hi,

Can you please check the performance when running on a single GPU 2019 vs
2020 with your inputs?

Also note that you are using some peculiar settings that will have an
adverse effect on performance (like manually set rlist disallowing the

dual

pair-list setup, and nstcalcenergy=1).

Cheers,

--
Szilárd


On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer 

wrote:

Hello,

here is a link to the logfiles.



https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020

If necessary, I can also provide some more log or tpr/gro/... files.

Cheers,
Andreas


On 26.02.20 16:09, Paul bauer wrote:

Hello,

you can't add attachments to the list, please upload the files
somewhere to share them.
This might be quite important to us, because the performance
regression is not expected by us.

Cheers

Paul

On 26/02/2020 15:54, Andreas Baer wrote:

Hello,

from a set of benchmark tests with large systems using Gromacs
versions 2019.5 and 2020, I obtained some unexpected results:
With the same set of parameters and the 2020 version, I obtain a
performance that is about 2/3 of the 2019.5 version. Interestingly,
according to nvidia-smi, the GPU usage is about 20% higher for the
2020 version.
Also from the log files it seems, that the 2020 version does the
computations more efficiently, but spends so much more time waiting,
that the overall performance drops.

Some background info on the benchmarks:
- System contains about 2.1 million atoms.
- Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores +
SMT; 4x NVIDIA Tesla V100
(similar results with less significant performance drop (~15%) on a
different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy
Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
- Several options for -ntmpi, -ntomp, -bonded, -pme are used to find
the optimal set. However the performance drop seems to be persistent
for all such options.

Two representative log files are attached.
Does anyone have an idea, where this drop comes from, and how to
choose the parameters for the 2020 version to circumvent this?

Regards,
Andreas


--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-27 Thread Szilárd Páll
Hi

On Thu, Feb 27, 2020 at 11:31 AM Andreas Baer  wrote:

> Hi,
>
> with the link below, additional log files for runs with 1 GPU should be
> accessible now.
>

I meant to ask you to run single-rank GPU runs, i.e. gmx mdrun -ntmpi 1.

It would also help if you could share some input files in case if further
testing is needed.


> Thank you for the comment with the rlist, I did not know, that this will
> affect the performance negatively.


It does in multiple ways. First, you are using a rather long list buffer
which will make the nonbonded pair-interaction calculation more
computational expensive than it could be if you just used a tolerance and
let the buffer be calculated. Secondly, as setting a manual rlist disables
the automated verlet buffer calculation, it prevents mdrun from using a
dual pairl-list setup (see
http://manual.gromacs.org/documentation/2018.1/release-notes/2018/major/features.html#dual-pair-list-buffer-with-dynamic-pruning)
which has additional performance benefits.

Cheers,
--
Szilárd



> I know, about the nstcalcenergy, but
> I need it for several of my simulations.

Cheers,
> Andreas
>
> On 26.02.20 16:50, Szilárd Páll wrote:
> > Hi,
> >
> > Can you please check the performance when running on a single GPU 2019 vs
> > 2020 with your inputs?
> >
> > Also note that you are using some peculiar settings that will have an
> > adverse effect on performance (like manually set rlist disallowing the
> dual
> > pair-list setup, and nstcalcenergy=1).
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> >
> > On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer 
> wrote:
> >
> >> Hello,
> >>
> >> here is a link to the logfiles.
> >>
> >>
> https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020
> >>
> >> If necessary, I can also provide some more log or tpr/gro/... files.
> >>
> >> Cheers,
> >> Andreas
> >>
> >>
> >> On 26.02.20 16:09, Paul bauer wrote:
> >>> Hello,
> >>>
> >>> you can't add attachments to the list, please upload the files
> >>> somewhere to share them.
> >>> This might be quite important to us, because the performance
> >>> regression is not expected by us.
> >>>
> >>> Cheers
> >>>
> >>> Paul
> >>>
> >>> On 26/02/2020 15:54, Andreas Baer wrote:
>  Hello,
> 
>  from a set of benchmark tests with large systems using Gromacs
>  versions 2019.5 and 2020, I obtained some unexpected results:
>  With the same set of parameters and the 2020 version, I obtain a
>  performance that is about 2/3 of the 2019.5 version. Interestingly,
>  according to nvidia-smi, the GPU usage is about 20% higher for the
>  2020 version.
>  Also from the log files it seems, that the 2020 version does the
>  computations more efficiently, but spends so much more time waiting,
>  that the overall performance drops.
> 
>  Some background info on the benchmarks:
>  - System contains about 2.1 million atoms.
>  - Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores +
>  SMT; 4x NVIDIA Tesla V100
> (similar results with less significant performance drop (~15%) on a
>  different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy
>  Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
>  - Several options for -ntmpi, -ntomp, -bonded, -pme are used to find
>  the optimal set. However the performance drop seems to be persistent
>  for all such options.
> 
>  Two representative log files are attached.
>  Does anyone have an idea, where this drop comes from, and how to
>  choose the parameters for the 2020 version to circumvent this?
> 
>  Regards,
>  Andreas
> 
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> >> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-requ...@gromacs.org.
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-27 Thread Andreas Baer

Hi,

with the link below, additional log files for runs with 1 GPU should be 
accessible now.


Thank you for the comment with the rlist, I did not know, that this will 
affect the performance negatively. I know, about the nstcalcenergy, but 
I need it for several of my simulations.


Cheers,
Andreas

On 26.02.20 16:50, Szilárd Páll wrote:

Hi,

Can you please check the performance when running on a single GPU 2019 vs
2020 with your inputs?

Also note that you are using some peculiar settings that will have an
adverse effect on performance (like manually set rlist disallowing the dual
pair-list setup, and nstcalcenergy=1).

Cheers,

--
Szilárd


On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer  wrote:


Hello,

here is a link to the logfiles.

https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020

If necessary, I can also provide some more log or tpr/gro/... files.

Cheers,
Andreas


On 26.02.20 16:09, Paul bauer wrote:

Hello,

you can't add attachments to the list, please upload the files
somewhere to share them.
This might be quite important to us, because the performance
regression is not expected by us.

Cheers

Paul

On 26/02/2020 15:54, Andreas Baer wrote:

Hello,

from a set of benchmark tests with large systems using Gromacs
versions 2019.5 and 2020, I obtained some unexpected results:
With the same set of parameters and the 2020 version, I obtain a
performance that is about 2/3 of the 2019.5 version. Interestingly,
according to nvidia-smi, the GPU usage is about 20% higher for the
2020 version.
Also from the log files it seems, that the 2020 version does the
computations more efficiently, but spends so much more time waiting,
that the overall performance drops.

Some background info on the benchmarks:
- System contains about 2.1 million atoms.
- Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores +
SMT; 4x NVIDIA Tesla V100
   (similar results with less significant performance drop (~15%) on a
different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy
Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
- Several options for -ntmpi, -ntomp, -bonded, -pme are used to find
the optimal set. However the performance drop seems to be persistent
for all such options.

Two representative log files are attached.
Does anyone have an idea, where this drop comes from, and how to
choose the parameters for the 2020 version to circumvent this?

Regards,
Andreas


--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.


--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-26 Thread Szilárd Páll
Hi,

Can you please check the performance when running on a single GPU 2019 vs
2020 with your inputs?

Also note that you are using some peculiar settings that will have an
adverse effect on performance (like manually set rlist disallowing the dual
pair-list setup, and nstcalcenergy=1).

Cheers,

--
Szilárd


On Wed, Feb 26, 2020 at 4:11 PM Andreas Baer  wrote:

> Hello,
>
> here is a link to the logfiles.
>
> https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020
>
> If necessary, I can also provide some more log or tpr/gro/... files.
>
> Cheers,
> Andreas
>
>
> On 26.02.20 16:09, Paul bauer wrote:
> > Hello,
> >
> > you can't add attachments to the list, please upload the files
> > somewhere to share them.
> > This might be quite important to us, because the performance
> > regression is not expected by us.
> >
> > Cheers
> >
> > Paul
> >
> > On 26/02/2020 15:54, Andreas Baer wrote:
> >> Hello,
> >>
> >> from a set of benchmark tests with large systems using Gromacs
> >> versions 2019.5 and 2020, I obtained some unexpected results:
> >> With the same set of parameters and the 2020 version, I obtain a
> >> performance that is about 2/3 of the 2019.5 version. Interestingly,
> >> according to nvidia-smi, the GPU usage is about 20% higher for the
> >> 2020 version.
> >> Also from the log files it seems, that the 2020 version does the
> >> computations more efficiently, but spends so much more time waiting,
> >> that the overall performance drops.
> >>
> >> Some background info on the benchmarks:
> >> - System contains about 2.1 million atoms.
> >> - Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores +
> >> SMT; 4x NVIDIA Tesla V100
> >>   (similar results with less significant performance drop (~15%) on a
> >> different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy
> >> Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
> >> - Several options for -ntmpi, -ntomp, -bonded, -pme are used to find
> >> the optimal set. However the performance drop seems to be persistent
> >> for all such options.
> >>
> >> Two representative log files are attached.
> >> Does anyone have an idea, where this drop comes from, and how to
> >> choose the parameters for the 2020 version to circumvent this?
> >>
> >> Regards,
> >> Andreas
> >>
> >
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-26 Thread Andreas Baer

Hello,

here is a link to the logfiles.
https://faubox.rrze.uni-erlangen.de/getlink/fiX8wP1LwSBkHRoykw6ksjqY/benchmarks_2019-2020

If necessary, I can also provide some more log or tpr/gro/... files.

Cheers,
Andreas


On 26.02.20 16:09, Paul bauer wrote:

Hello,

you can't add attachments to the list, please upload the files 
somewhere to share them.
This might be quite important to us, because the performance 
regression is not expected by us.


Cheers

Paul

On 26/02/2020 15:54, Andreas Baer wrote:

Hello,

from a set of benchmark tests with large systems using Gromacs 
versions 2019.5 and 2020, I obtained some unexpected results:
With the same set of parameters and the 2020 version, I obtain a 
performance that is about 2/3 of the 2019.5 version. Interestingly, 
according to nvidia-smi, the GPU usage is about 20% higher for the 
2020 version.
Also from the log files it seems, that the 2020 version does the 
computations more efficiently, but spends so much more time waiting, 
that the overall performance drops.


Some background info on the benchmarks:
- System contains about 2.1 million atoms.
- Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores + 
SMT; 4x NVIDIA Tesla V100
  (similar results with less significant performance drop (~15%) on a 
different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy 
Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
- Several options for -ntmpi, -ntomp, -bonded, -pme are used to find 
the optimal set. However the performance drop seems to be persistent 
for all such options.


Two representative log files are attached.
Does anyone have an idea, where this drop comes from, and how to 
choose the parameters for the 2020 version to circumvent this?


Regards,
Andreas





--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-26 Thread Paul bauer

Hello,

you can't add attachments to the list, please upload the files somewhere 
to share them.
This might be quite important to us, because the performance regression 
is not expected by us.


Cheers

Paul

On 26/02/2020 15:54, Andreas Baer wrote:

Hello,

from a set of benchmark tests with large systems using Gromacs 
versions 2019.5 and 2020, I obtained some unexpected results:
With the same set of parameters and the 2020 version, I obtain a 
performance that is about 2/3 of the 2019.5 version. Interestingly, 
according to nvidia-smi, the GPU usage is about 20% higher for the 
2020 version.
Also from the log files it seems, that the 2020 version does the 
computations more efficiently, but spends so much more time waiting, 
that the overall performance drops.


Some background info on the benchmarks:
- System contains about 2.1 million atoms.
- Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores + 
SMT; 4x NVIDIA Tesla V100
  (similar results with less significant performance drop (~15%) on a 
different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy 
Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
- Several options for -ntmpi, -ntomp, -bonded, -pme are used to find 
the optimal set. However the performance drop seems to be persistent 
for all such options.


Two representative log files are attached.
Does anyone have an idea, where this drop comes from, and how to 
choose the parameters for the 2020 version to circumvent this?


Regards,
Andreas



--
Paul Bauer, PhD
GROMACS Development Manager
KTH Stockholm, SciLifeLab
0046737308594

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance issues with Gromacs 2020 on GPUs - slower than 2019.5

2020-02-26 Thread Andreas Baer

Hello,

from a set of benchmark tests with large systems using Gromacs versions 
2019.5 and 2020, I obtained some unexpected results:
With the same set of parameters and the 2020 version, I obtain a 
performance that is about 2/3 of the 2019.5 version. Interestingly, 
according to nvidia-smi, the GPU usage is about 20% higher for the 2020 
version.
Also from the log files it seems, that the 2020 version does the 
computations more efficiently, but spends so much more time waiting, 
that the overall performance drops.


Some background info on the benchmarks:
- System contains about 2.1 million atoms.
- Hardware: 2x Intel Xeon Gold 6134 („Skylake“) @3.2 GHz = 16 cores + 
SMT; 4x NVIDIA Tesla V100
  (similar results with less significant performance drop (~15%) on a 
different machine: 2 or 4 nodes with each [2x Intel Xeon 2660v2 („Ivy 
Bridge“) @ 2.2GHz = 20 cores + SMT; 2x NVIDIA Kepler K20])
- Several options for -ntmpi, -ntomp, -bonded, -pme are used to find the 
optimal set. However the performance drop seems to be persistent for all 
such options.


Two representative log files are attached.
Does anyone have an idea, where this drop comes from, and how to choose 
the parameters for the 2020 version to circumvent this?


Regards,
Andreas
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance GROMACS on GPU

2019-12-09 Thread Talarico Carmine
Hi,
I run three simulations on 1 node with 3 GPU by using an increasing number of 
GPUs.

This is my system:
___
1 node with total 36 cores, 72 logical cores, 3 compatible GPUs

GROMACS version:2018.2
Precision:  single
Memory model:   64 bit
MPI library:thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:CUDA
SIMD instructions:  AVX2_256
FFT library:fftw-3.3.8-sse2-avx-avx2-avx2_128-avx512
RDTSCP usage:   enabled
TNG support:enabled
Hwloc support:  hwloc-1.11.0
Tracing support:disabled
Built on:   2018-08-01 09:03:03

GPU info:
Number of GPUs detected: 3
#0: NVIDIA Tesla V100-PCIE-32GB, compute cap.: 7.0, ECC: yes, stat: 
compatible
#1: NVIDIA Tesla V100-PCIE-32GB, compute cap.: 7.0, ECC: yes, stat: 
compatible
#2: NVIDIA Tesla V100-PCIE-32GB, compute cap.: 7.0, ECC: yes, stat: 
compatible
___

These are the launched commands on two different system size and related 
performances:

Command

ns/day

h/ns

Alchol Dehydrogenase system (95561 atoms)

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 1 -ntomp 12

53.355

0.45

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 2 -ntomp 12 -npme 1 -gputasks 01

53.176

0.451

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 3 -ntomp 12 -npme 1 -gputasks 
012

50.024

0.48

Villin system (4723 atoms)

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 1 -ntomp 12

589.635

0.041

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 2 -ntomp 12 -npme 1 -gputasks 01

727.139

0.033

gmx mdrun -deffnm topol -nb gpu -pme gpu -ntmpi 3 -ntomp 12 -npme 1 -gputasks 
012

664.695

0.036


Despite the performances seems very strange, because by increasing the number 
of GPUs in a big system I can't seeing a speedup, while in a small system the 
performance's peak is reached with 2 GPU,
can I ask to all of you if I'm using the GPU's selection options in the right 
way?

Moreover, I'm not sure about the right usage of -ntomp option, I thought to ask 
in another session.

Thanks a lot!
Carmine

---

CONFIDENTIALITY NOTICE
 
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message.
 

Thank you

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] [Performance] poor performance with NV V100

2019-10-16 Thread Szilárd Páll
Hi,

Please keep the conversation on the mailing list.

GROMACS uses both CPUs and GPUs for computation. Your runs limit core count
per rank, and do so in a way that the rest of the cores are left idle. This
is not a suitable approach for realistic benchmarking due to the clock
boosting skewing your scaling results.

Secondly, you should consider using PME offload as well, see the docs and
previous discussion on the list how to do so.

Last, if you are evaluating hardware for some use-cases, do make sure you
set up your benchmarks such that they reflect the intended use cases (e.g.
scaling vs throughput), and please check out the best practices for how to
run GROMACS on GPU servers.

You might also be interested in a recent study we did:
https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.26011

Cheers,

--
Szilárd


On Tue, Oct 8, 2019 at 3:00 PM Jimmy Chen  wrote:

> Hi Szilard,
>
> Thanks for your help.
> Is md.log enough for you to clarify where the bottleneck is located?
> If you need another log, please let me know.
>
> I just checked the release note of 2019.4, I didn't see any major release
> impact the performance of intra-node.
>
> http://manual.gromacs.org/documentation/2020-beta1/release-notes/2019/2019.4.html
>
> anyway, I will have a try on 2019.4 later.
>
> looking forward to check new feature which will be on 2/3 beta release of
> 2020.
>
> Best regards,
> Jimmy
>
>
> Szilárd Páll  於 2019年10月8日 週二 下午8:34寫道:
>
>> Hi,
>>
>> Can you please share your log files? we may be able to help with spotting
>> performance issues or bottlenecks.
>> However, note that for NVIDIA are the best source to aid you with
>> reproducing their benchmark numbers, we
>>
>> Scaling across multiple GPUs requires some tuning of command line options,
>> please see the related discussion on the list ((briefly: use multiple
>> ranks
>> per GPU, and one separate PME rank with GPU offload).
>>
>> Also note that intra-node strong scaling optimization target of recent
>> releases (there are no p2p optimizations either), however new features
>> going into the 2020 release will improve things significantly. Keep an eye
>> out on the beta2/3 releases if you are interested in checking out the new
>> features.
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Mon, Oct 7, 2019 at 7:48 AM Jimmy Chen  wrote:
>>
>> > Hi,
>> >
>> > I'm using NV v100 to evaluate if it's suitable to do purchase.
>> > But I can't get similar test result as referenced performance data
>> > which was got from internet.
>> > https://developer.nvidia.com/hpc-application-performance
>> >
>> >
>> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
>> >
>> >
>> > No matter using docker tag 18.02 from
>> > https://ngc.nvidia.com/catalog/containers/hpc:gromacs/tags
>> >
>> > or gromacs source code from
>> > ftp://ftp.gromacs.org/pub/gromacs/gromacs-2019.3.tar.gz
>> >
>> > test data set is ADH dodec and water 1.5M
>> > gmx grompp -f pme_verlet.mdp
>> > gmx mdrun -ntmpi 1 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
>> topol.tpr
>> > -ntomp 4
>> > and  gmx mdrun -ntmpi 2 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
>> > topol.tpr -ntomp 4
>> >
>> > My CPU is Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>> > and GPU is NV V100 16GB PCIE.
>> >
>> > For ADH dodec,
>> > The perf data of 2xV100 16GB PCIE in
>> > https://developer.nvidia.com/hpc-application-performance is 176
>> (ns/day).
>> > But I only can get 28 (ns/day). actually I can get 67(ns/day) with
>> 1xV100.
>> > I don't know why I got poorer result with 2xV100.
>> >
>> > For water 1.5M
>> > The perf data of 1xV100 16GB PCIE in
>> >
>> >
>> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
>> > is
>> > 9.83(ns/day) and 2xV100 is 10.41(ns/day).
>> > But what I got is 6.5(ns/day) with 1xV100 and 2(ns/day) with 2xV100.
>> >
>> > Could anyone give me some suggestions about how to clarify what's
>> problem
>> > to result to this perf data in my environment? Is my command to perform
>> the
>> > testing wrong? any suggested command to perform the testing?
>> > or which source code version is recommended to use now?
>> >
>> > btw, after checking the code, it seems MPI doesn't go through PCIE P2p
>> or
>> > RDMA, is it correct? any plan to implement this in MPI?
>> >
>> > Best regards,
>> > Jimmy
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-requ...@gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> 

Re: [gmx-users] [Performance] poor performance with NV V100

2019-10-08 Thread Szilárd Páll
Hi,

Can you please share your log files? we may be able to help with spotting
performance issues or bottlenecks.
However, note that for NVIDIA are the best source to aid you with
reproducing their benchmark numbers, we

Scaling across multiple GPUs requires some tuning of command line options,
please see the related discussion on the list ((briefly: use multiple ranks
per GPU, and one separate PME rank with GPU offload).

Also note that intra-node strong scaling optimization target of recent
releases (there are no p2p optimizations either), however new features
going into the 2020 release will improve things significantly. Keep an eye
out on the beta2/3 releases if you are interested in checking out the new
features.

Cheers,
--
Szilárd


On Mon, Oct 7, 2019 at 7:48 AM Jimmy Chen  wrote:

> Hi,
>
> I'm using NV v100 to evaluate if it's suitable to do purchase.
> But I can't get similar test result as referenced performance data
> which was got from internet.
> https://developer.nvidia.com/hpc-application-performance
>
> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
>
>
> No matter using docker tag 18.02 from
> https://ngc.nvidia.com/catalog/containers/hpc:gromacs/tags
>
> or gromacs source code from
> ftp://ftp.gromacs.org/pub/gromacs/gromacs-2019.3.tar.gz
>
> test data set is ADH dodec and water 1.5M
> gmx grompp -f pme_verlet.mdp
> gmx mdrun -ntmpi 1 -nb gpu -pin on -v -noconfout -nsteps 5000 -s topol.tpr
> -ntomp 4
> and  gmx mdrun -ntmpi 2 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
> topol.tpr -ntomp 4
>
> My CPU is Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> and GPU is NV V100 16GB PCIE.
>
> For ADH dodec,
> The perf data of 2xV100 16GB PCIE in
> https://developer.nvidia.com/hpc-application-performance is 176 (ns/day).
> But I only can get 28 (ns/day). actually I can get 67(ns/day) with 1xV100.
> I don't know why I got poorer result with 2xV100.
>
> For water 1.5M
> The perf data of 1xV100 16GB PCIE in
>
> https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf
> is
> 9.83(ns/day) and 2xV100 is 10.41(ns/day).
> But what I got is 6.5(ns/day) with 1xV100 and 2(ns/day) with 2xV100.
>
> Could anyone give me some suggestions about how to clarify what's problem
> to result to this perf data in my environment? Is my command to perform the
> testing wrong? any suggested command to perform the testing?
> or which source code version is recommended to use now?
>
> btw, after checking the code, it seems MPI doesn't go through PCIE P2p or
> RDMA, is it correct? any plan to implement this in MPI?
>
> Best regards,
> Jimmy
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] [Performance] poor performance with NV V100

2019-10-06 Thread Jimmy Chen
Hi,

I'm using NV v100 to evaluate if it's suitable to do purchase.
But I can't get similar test result as referenced performance data
which was got from internet.
https://developer.nvidia.com/hpc-application-performance
https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf


No matter using docker tag 18.02 from
https://ngc.nvidia.com/catalog/containers/hpc:gromacs/tags

or gromacs source code from
ftp://ftp.gromacs.org/pub/gromacs/gromacs-2019.3.tar.gz

test data set is ADH dodec and water 1.5M
gmx grompp -f pme_verlet.mdp
gmx mdrun -ntmpi 1 -nb gpu -pin on -v -noconfout -nsteps 5000 -s topol.tpr
-ntomp 4
and  gmx mdrun -ntmpi 2 -nb gpu -pin on -v -noconfout -nsteps 5000 -s
topol.tpr -ntomp 4

My CPU is Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
and GPU is NV V100 16GB PCIE.

For ADH dodec,
The perf data of 2xV100 16GB PCIE in
https://developer.nvidia.com/hpc-application-performance is 176 (ns/day).
But I only can get 28 (ns/day). actually I can get 67(ns/day) with 1xV100.
I don't know why I got poorer result with 2xV100.

For water 1.5M
The perf data of 1xV100 16GB PCIE in
https://www.hpc.co.jp/images/pdf/benchmark/Molecular-Dynamics-March-2018.pdf is
9.83(ns/day) and 2xV100 is 10.41(ns/day).
But what I got is 6.5(ns/day) with 1xV100 and 2(ns/day) with 2xV100.

Could anyone give me some suggestions about how to clarify what's problem
to result to this perf data in my environment? Is my command to perform the
testing wrong? any suggested command to perform the testing?
or which source code version is recommended to use now?

btw, after checking the code, it seems MPI doesn't go through PCIE P2p or
RDMA, is it correct? any plan to implement this in MPI?

Best regards,
Jimmy
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance with Epyc Rome

2019-08-28 Thread Jochen Hub

Dear Gromacs users,

does someone already have experience with the new AMD Epyc Rome? Can we 
expect that 4 Epyc Cores per Nvidia RTX 2080 on a CPU/GPU node is 
sufficient for common simulations (as one would expect with an common 
Intel Xeon)?


Many thanks,
Jochen


--
---
Dr. Jochen Hub
Computational Molecular Biophysics Group
Institute for Microbiology and Genetics
Georg-August-University of Göttingen
Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany.
Phone: +49-551-39-14189
http://cmb.bio.uni-goettingen.de/
---
--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance, gpu

2019-08-28 Thread Mark Abraham
Hi,

Your command chooses 44 PME ranks, thus 88-44=44 PP ranks. It gives each of
those 6 threads and 4 threads respectively. That's 44*6+44*4 threads which
is very much larger than the 88 total cores in your 4 nodes, ie
over-subscription. The number of PME-only ranks just changes how much it's
over-subscribed.

I'd be starting my investigation with

aprun -n 8 gmx_mpi mdrun -npme 4

and let the defaults work out that there's 1 GPU and 11 OpenMP threads per
rank to achieve full utilization.

Mark

On Wed, 28 Aug 2019 at 17:31, Alex  wrote:

> Dear all,
> Whatever "-npme" likes 22, 44, 24, 48 ..  I use in below command, I always
> get the "WARNING: On rank 0: oversubscribing the available XXX logical CPU
> core per node with 88 threads, This will cause considerable performance
> loss."
>
> aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
> -gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes
>
> would you please help me choose a correct combinations of -npme and  ...
> to get a better performance, according to the attached case.log file in my
> previous email?
> Regards,
> Alex
>
> On Sat, Aug 24, 2019 at 11:21 AM Mark Abraham 
> wrote:
>
> > Hi,
> >
> > There's a thread oversubscription warning in your log file that you
> should
> > definitely have read and acted upon :-) I'd be running more like one PP
> > rank per gpu and 4 PME ranks, picking ntomp and ntomp_pme according to
> what
> > gives best performance (which could require configuring your MPI
> invocation
> > accordingly).
> >
> > Mark
> >
> > On Fri., 23 Aug. 2019, 21:00 Alex,  wrote:
> >
> > > Dear Gromacs user,
> > > Using a machine with below configurations and also below command I
> tried
> > to
> > > simulate a system with 479K atoms (mainly water) on CPU-GPU, the
> > > performance is around 1ns per 1 hour.
> > > According the information and also shared log file below, I would be so
> > > appreciated if you could comment on the submission command to improve
> the
> > > performance by involving better the GPU and CPU.
> > >
> > > %
> > > #PBS -l select=4:ncpus=22:mpiprocs=22:ngpus=1
> > > export OMP_NUM_THREADS=4
> > >
> > > aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
> > > -gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes
> > >
> > > Running on 4 nodes with total 88 cores, 176 logical cores, 4 compatible
> > > GPUs
> > >   Cores per node:   22
> > >   Logical cores per node:   44
> > >   Compatible GPUs per node:  1
> > >   All nodes have identical type(s) of GPUs
> > >
> > > %
> > > GROMACS version:2018.1
> > > Precision:  single
> > > Memory model:   64 bit
> > > MPI library:MPI
> > > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> > > GPU support:CUDA
> > > SIMD instructions:  AVX2_256
> > > FFT library:
> commercial-fftw-3.3.6-pl1-fma-sse2-avx-avx2-avx2_128
> > > RDTSCP usage:   enabled
> > > TNG support:enabled
> > > Hwloc support:  hwloc-1.11.0
> > > Tracing support:disabled
> > > Built on:   2018-09-12 20:34:33
> > > Built by:   
> > > Build OS/arch:  Linux 3.12.61-52.111-default x86_64
> > > Build CPU vendor:   Intel
> > > Build CPU brand:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > > Build CPU family:   6   Model: 79   Stepping: 1
> > > Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle
> > htt
> > > intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse
> > rdrnd
> > > rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > > C compiler: /opt/cray/pe/craype/2.5.13/bin/cc GNU 5.3.0
> > > C compiler flags:-march=core-avx2 -O3 -DNDEBUG
> -funroll-all-loops
> > > -fexcess-precision=fast
> > > C++ compiler:   /opt/cray/pe/craype/2.5.13/bin/CC GNU 5.3.0
> > > C++ compiler flags:  -march=core-avx2-std=c++11   -O3 -DNDEBUG
> > > -funroll-all-loops -fexcess-precision=fast
> > > CUDA compiler:
> > > /opt/nvidia/cudatoolkit8.0/8.0.61_2.3.13_g32c34f9-2.1/bin/nvcc nvcc:
> > NVIDIA
> > > (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA
> Corporation;Built
> > > on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0,
> > > V8.0.61
> > > CUDA compiler
> > >
> > >
> >
> flags:-gencode;arch=compute_60,code=sm_60;-use_fast_math;-Wno-deprecated-gpu-targets;;;
> > >
> > >
> >
> ;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> > > CUDA driver:9.20
> > > CUDA runtime:   8.0
> > > %-
> > > Log file:
> > > https://drive.google.com/open?id=1-myQ5rP85UWKb1262QDPa6kYhuzHPzLu
> > >
> > > Thank you,
> > > Alex
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read 

Re: [gmx-users] Performance, gpu

2019-08-28 Thread Alex
Dear all,
Whatever "-npme" likes 22, 44, 24, 48 ..  I use in below command, I always
get the "WARNING: On rank 0: oversubscribing the available XXX logical CPU
core per node with 88 threads, This will cause considerable performance
loss."

aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
-gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes

would you please help me choose a correct combinations of -npme and  ...
to get a better performance, according to the attached case.log file in my
previous email?
Regards,
Alex

On Sat, Aug 24, 2019 at 11:21 AM Mark Abraham 
wrote:

> Hi,
>
> There's a thread oversubscription warning in your log file that you should
> definitely have read and acted upon :-) I'd be running more like one PP
> rank per gpu and 4 PME ranks, picking ntomp and ntomp_pme according to what
> gives best performance (which could require configuring your MPI invocation
> accordingly).
>
> Mark
>
> On Fri., 23 Aug. 2019, 21:00 Alex,  wrote:
>
> > Dear Gromacs user,
> > Using a machine with below configurations and also below command I tried
> to
> > simulate a system with 479K atoms (mainly water) on CPU-GPU, the
> > performance is around 1ns per 1 hour.
> > According the information and also shared log file below, I would be so
> > appreciated if you could comment on the submission command to improve the
> > performance by involving better the GPU and CPU.
> >
> > %
> > #PBS -l select=4:ncpus=22:mpiprocs=22:ngpus=1
> > export OMP_NUM_THREADS=4
> >
> > aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
> > -gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes
> >
> > Running on 4 nodes with total 88 cores, 176 logical cores, 4 compatible
> > GPUs
> >   Cores per node:   22
> >   Logical cores per node:   44
> >   Compatible GPUs per node:  1
> >   All nodes have identical type(s) of GPUs
> >
> > %
> > GROMACS version:2018.1
> > Precision:  single
> > Memory model:   64 bit
> > MPI library:MPI
> > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> > GPU support:CUDA
> > SIMD instructions:  AVX2_256
> > FFT library:commercial-fftw-3.3.6-pl1-fma-sse2-avx-avx2-avx2_128
> > RDTSCP usage:   enabled
> > TNG support:enabled
> > Hwloc support:  hwloc-1.11.0
> > Tracing support:disabled
> > Built on:   2018-09-12 20:34:33
> > Built by:   
> > Build OS/arch:  Linux 3.12.61-52.111-default x86_64
> > Build CPU vendor:   Intel
> > Build CPU brand:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > Build CPU family:   6   Model: 79   Stepping: 1
> > Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle
> htt
> > intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse
> rdrnd
> > rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > C compiler: /opt/cray/pe/craype/2.5.13/bin/cc GNU 5.3.0
> > C compiler flags:-march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
> > -fexcess-precision=fast
> > C++ compiler:   /opt/cray/pe/craype/2.5.13/bin/CC GNU 5.3.0
> > C++ compiler flags:  -march=core-avx2-std=c++11   -O3 -DNDEBUG
> > -funroll-all-loops -fexcess-precision=fast
> > CUDA compiler:
> > /opt/nvidia/cudatoolkit8.0/8.0.61_2.3.13_g32c34f9-2.1/bin/nvcc nvcc:
> NVIDIA
> > (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built
> > on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0,
> > V8.0.61
> > CUDA compiler
> >
> >
> flags:-gencode;arch=compute_60,code=sm_60;-use_fast_math;-Wno-deprecated-gpu-targets;;;
> >
> >
> ;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> > CUDA driver:9.20
> > CUDA runtime:   8.0
> > %-
> > Log file:
> > https://drive.google.com/open?id=1-myQ5rP85UWKb1262QDPa6kYhuzHPzLu
> >
> > Thank you,
> > Alex
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read 

Re: [gmx-users] Performance, gpu

2019-08-24 Thread Mark Abraham
Hi,

There's a thread oversubscription warning in your log file that you should
definitely have read and acted upon :-) I'd be running more like one PP
rank per gpu and 4 PME ranks, picking ntomp and ntomp_pme according to what
gives best performance (which could require configuring your MPI invocation
accordingly).

Mark

On Fri., 23 Aug. 2019, 21:00 Alex,  wrote:

> Dear Gromacs user,
> Using a machine with below configurations and also below command I tried to
> simulate a system with 479K atoms (mainly water) on CPU-GPU, the
> performance is around 1ns per 1 hour.
> According the information and also shared log file below, I would be so
> appreciated if you could comment on the submission command to improve the
> performance by involving better the GPU and CPU.
>
> %
> #PBS -l select=4:ncpus=22:mpiprocs=22:ngpus=1
> export OMP_NUM_THREADS=4
>
> aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
> -gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes
>
> Running on 4 nodes with total 88 cores, 176 logical cores, 4 compatible
> GPUs
>   Cores per node:   22
>   Logical cores per node:   44
>   Compatible GPUs per node:  1
>   All nodes have identical type(s) of GPUs
>
> %
> GROMACS version:2018.1
> Precision:  single
> Memory model:   64 bit
> MPI library:MPI
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
> GPU support:CUDA
> SIMD instructions:  AVX2_256
> FFT library:commercial-fftw-3.3.6-pl1-fma-sse2-avx-avx2-avx2_128
> RDTSCP usage:   enabled
> TNG support:enabled
> Hwloc support:  hwloc-1.11.0
> Tracing support:disabled
> Built on:   2018-09-12 20:34:33
> Built by:   
> Build OS/arch:  Linux 3.12.61-52.111-default x86_64
> Build CPU vendor:   Intel
> Build CPU brand:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Build CPU family:   6   Model: 79   Stepping: 1
> Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt
> intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /opt/cray/pe/craype/2.5.13/bin/cc GNU 5.3.0
> C compiler flags:-march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
> -fexcess-precision=fast
> C++ compiler:   /opt/cray/pe/craype/2.5.13/bin/CC GNU 5.3.0
> C++ compiler flags:  -march=core-avx2-std=c++11   -O3 -DNDEBUG
> -funroll-all-loops -fexcess-precision=fast
> CUDA compiler:
> /opt/nvidia/cudatoolkit8.0/8.0.61_2.3.13_g32c34f9-2.1/bin/nvcc nvcc: NVIDIA
> (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built
> on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0,
> V8.0.61
> CUDA compiler
>
> flags:-gencode;arch=compute_60,code=sm_60;-use_fast_math;-Wno-deprecated-gpu-targets;;;
>
> ;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
> CUDA driver:9.20
> CUDA runtime:   8.0
> %-
> Log file:
> https://drive.google.com/open?id=1-myQ5rP85UWKb1262QDPa6kYhuzHPzLu
>
> Thank you,
> Alex
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance, gpu

2019-08-23 Thread Alex
Dear Gromacs user,
Using a machine with below configurations and also below command I tried to
simulate a system with 479K atoms (mainly water) on CPU-GPU, the
performance is around 1ns per 1 hour.
According the information and also shared log file below, I would be so
appreciated if you could comment on the submission command to improve the
performance by involving better the GPU and CPU.

%
#PBS -l select=4:ncpus=22:mpiprocs=22:ngpus=1
export OMP_NUM_THREADS=4

aprun -n 88 gmx_mpi mdrun -deffnm out -s out.tpr -g out.log -v -dlb yes
-gcom 1 -nb gpu -npme 44 -ntomp 4 -ntomp_pme 6 -tunepme yes

Running on 4 nodes with total 88 cores, 176 logical cores, 4 compatible GPUs
  Cores per node:   22
  Logical cores per node:   44
  Compatible GPUs per node:  1
  All nodes have identical type(s) of GPUs

%
GROMACS version:2018.1
Precision:  single
Memory model:   64 bit
MPI library:MPI
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 64)
GPU support:CUDA
SIMD instructions:  AVX2_256
FFT library:commercial-fftw-3.3.6-pl1-fma-sse2-avx-avx2-avx2_128
RDTSCP usage:   enabled
TNG support:enabled
Hwloc support:  hwloc-1.11.0
Tracing support:disabled
Built on:   2018-09-12 20:34:33
Built by:   
Build OS/arch:  Linux 3.12.61-52.111-default x86_64
Build CPU vendor:   Intel
Build CPU brand:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Build CPU family:   6   Model: 79   Stepping: 1
Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma hle htt
intel lahf mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
rdtscp rtm sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /opt/cray/pe/craype/2.5.13/bin/cc GNU 5.3.0
C compiler flags:-march=core-avx2 -O3 -DNDEBUG -funroll-all-loops
-fexcess-precision=fast
C++ compiler:   /opt/cray/pe/craype/2.5.13/bin/CC GNU 5.3.0
C++ compiler flags:  -march=core-avx2-std=c++11   -O3 -DNDEBUG
-funroll-all-loops -fexcess-precision=fast
CUDA compiler:
/opt/nvidia/cudatoolkit8.0/8.0.61_2.3.13_g32c34f9-2.1/bin/nvcc nvcc: NVIDIA
(R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built
on Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler
flags:-gencode;arch=compute_60,code=sm_60;-use_fast_math;-Wno-deprecated-gpu-targets;;;
;-march=core-avx2;-std=c++11;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;
CUDA driver:9.20
CUDA runtime:   8.0
%-
Log file:
https://drive.google.com/open?id=1-myQ5rP85UWKb1262QDPa6kYhuzHPzLu

Thank you,
Alex
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-30 Thread Szilárd Páll
On Tue, Jul 30, 2019 at 3:29 PM Carlos Navarro
 wrote:
>
> Hi all,
> First of all, thanks for all your valuable inputs!!.
> I tried Szilárd suggestion (multi simulations) with the following commands
> (using a single node):
>
> EXE="mpirun -np 4 gmx_mpi mdrun "
>
> cd $WORKDIR0
> #$DO_PARALLEL
> $EXE -s 4q.tpr -deffnm 4q -dlb yes -resethway -multidir 1 2 3 4
> And I noticed that the performance went from 37,32,23,22 ns/day to ~42
> ns/day in all four simulations. I check that the 80 processors were been
> used a 100% of the time, while the gpu was used about a 50% (from a 70%
> when running a single simulation in the node where I obtain a performance
> of ~50 ns/day).

Great!

Note that optimizing hardware utilization doesn't always maximize performance.

Also, manual launches with pinoffset/pinstride will give exactly the
same performance as the multi runs *if* you get the affinities right.
In your original commands you tried to use 20 of the 80 threads/rank,
but you offset the runs only by 10 (hardware threads) which means that
runs  were overlapping and interfering with each other as well as
ending up under-utilizing the hardware.

> So overall I'm quite happy with the performance I'm getting now; and
> honestly, I don't know if at some point I can get the same performance
> (running 4 jobs) that I'm getting running just one.

No, but you _may_ get a bit more aggregate performance if you run 8
concurrent jobs. Also, you cna try 1 thread per core ("mpirun -np 4
gmx mdrun_mpi -multi 4 -ntomp 10 -pin on to use only half of the
threads),

Cheers,
--
Szilárd

> Best regards,
> Carlos
>
> ——
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarr...@gmail.com or cnava...@utalca.cl
>
> On July 29, 2019 at 6:11:31 PM, Mark Abraham (mark.j.abra...@gmail.com)
> wrote:
>
> Hi,
>
> Yes and the -nmpi I copied from Carlos's post is ineffective - use -ntmpi
>
> Mark
>
>
> On Mon., 29 Jul. 2019, 15:15 Justin Lemkul,  wrote:
>
> >
> >
> > On 7/29/19 8:46 AM, Carlos Navarro wrote:
> > > Hi Mark,
> > > I tried that before, but unfortunately in that case (removing
> —gres=gpu:1
> > > and including in each line the -gpu_id flag) for some reason the jobs
> are
> > > run one at a time (one after the other), so I can’t use properly the
> > whole
> > > node.
> > >
> >
> > You need to run all but the last mdrun process in the background (&).
> >
> > -Justin
> >
> > > ——
> > > Carlos Navarro Retamal
> > > Bioinformatic Engineering. PhD.
> > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > Simulations
> > > Universidad de Talca
> > > Av. Lircay S/N, Talca, Chile
> > > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> > >
> > > On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
> > > wrote:
> > >
> > > Hi,
> > >
> > > When you use
> > >
> > > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > >
> > > then the environment seems to make sure only one GPU is visible. (The
> log
> > > files report only finding one GPU.) But it's probably the same GPU in
> > each
> > > case, with three remaining idle. I would suggest not using --gres unless
> > > you can specify *which* of the four available GPUs each run can use.
> > >
> > > Otherwise, don't use --gres and use the facilities built into GROMACS,
> > e.g.
> > >
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > > -ntomp 20 -gpu_id 0
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 10
> > > -ntomp 20 -gpu_id 1
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 20
> > > -ntomp 20 -gpu_id 2
> > > etc.
> > >
> > > Mark
> > >
> > > On Mon, 29 Jul 2019 at 11:34, Carlos Navarro  > >
> > > wrote:
> > >
> > >> Hi Szilárd,
> > >> To answer your questions:
> > >> **are you trying to run multiple simulations concurrently on the same
> > >> node or are you trying to strong-scale?
> > >> I'm trying to run multiple simulations on the same node at the same
> > time.
> > >>
> > >> ** what are you simulating?
> > >> Regular and CompEl simulations
> > >>
> > >> ** can you provide log files of the runs?
> > >> In the following link are some logs files:
> > >> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> > >> In short, alone.log -> single run in the node (using 1 gpu).
> > >> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> > >> single node. In all cases, 20 cpus are used.
> > >> Best regards,
> > >> Carlos
> > >>
> > >> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<
> > pall.szil...@gmail.com>)
> > >> escribió:
> > >>
> > >>> Hi,
> > >>>
> > >>> It is not clear to me how are you trying to set up your runs, so
> > >>> please provide some details:
> > >>> - are you trying to run multiple simulations concurrently on the same
> > >>> node or are you trying to 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-30 Thread Carlos Navarro
Hi all,
First of all, thanks for all your valuable inputs!!.
I tried Szilárd suggestion (multi simulations) with the following commands
(using a single node):

EXE="mpirun -np 4 gmx_mpi mdrun "

cd $WORKDIR0
#$DO_PARALLEL
$EXE -s 4q.tpr -deffnm 4q -dlb yes -resethway -multidir 1 2 3 4
And I noticed that the performance went from 37,32,23,22 ns/day to ~42
ns/day in all four simulations. I check that the 80 processors were been
used a 100% of the time, while the gpu was used about a 50% (from a 70%
when running a single simulation in the node where I obtain a performance
of ~50 ns/day).
So overall I'm quite happy with the performance I'm getting now; and
honestly, I don't know if at some point I can get the same performance
(running 4 jobs) that I'm getting running just one.
Best regards,
Carlos

——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 29, 2019 at 6:11:31 PM, Mark Abraham (mark.j.abra...@gmail.com)
wrote:

Hi,

Yes and the -nmpi I copied from Carlos's post is ineffective - use -ntmpi

Mark


On Mon., 29 Jul. 2019, 15:15 Justin Lemkul,  wrote:

>
>
> On 7/29/19 8:46 AM, Carlos Navarro wrote:
> > Hi Mark,
> > I tried that before, but unfortunately in that case (removing
—gres=gpu:1
> > and including in each line the -gpu_id flag) for some reason the jobs
are
> > run one at a time (one after the other), so I can’t use properly the
> whole
> > node.
> >
>
> You need to run all but the last mdrun process in the background (&).
>
> -Justin
>
> > ——
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> >
> > On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
> > wrote:
> >
> > Hi,
> >
> > When you use
> >
> > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> >
> > then the environment seems to make sure only one GPU is visible. (The
log
> > files report only finding one GPU.) But it's probably the same GPU in
> each
> > case, with three remaining idle. I would suggest not using --gres unless
> > you can specify *which* of the four available GPUs each run can use.
> >
> > Otherwise, don't use --gres and use the facilities built into GROMACS,
> e.g.
> >
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > -ntomp 20 -gpu_id 0
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
10
> > -ntomp 20 -gpu_id 1
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
20
> > -ntomp 20 -gpu_id 2
> > etc.
> >
> > Mark
> >
> > On Mon, 29 Jul 2019 at 11:34, Carlos Navarro  >
> > wrote:
> >
> >> Hi Szilárd,
> >> To answer your questions:
> >> **are you trying to run multiple simulations concurrently on the same
> >> node or are you trying to strong-scale?
> >> I'm trying to run multiple simulations on the same node at the same
> time.
> >>
> >> ** what are you simulating?
> >> Regular and CompEl simulations
> >>
> >> ** can you provide log files of the runs?
> >> In the following link are some logs files:
> >> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> >> In short, alone.log -> single run in the node (using 1 gpu).
> >> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> >> single node. In all cases, 20 cpus are used.
> >> Best regards,
> >> Carlos
> >>
> >> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<
> pall.szil...@gmail.com>)
> >> escribió:
> >>
> >>> Hi,
> >>>
> >>> It is not clear to me how are you trying to set up your runs, so
> >>> please provide some details:
> >>> - are you trying to run multiple simulations concurrently on the same
> >>> node or are you trying to strong-scale?
> >>> - what are you simulating?
> >>> - can you provide log files of the runs?
> >>>
> >>> Cheers,
> >>>
> >>> --
> >>> Szilárd
> >>>
> >>> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> >>>  wrote:
>  No one can give me an idea of what can be happening? Or how I can
> > solve
> >>> it?
>  Best regards,
>  Carlos
> 
>  ——
>  Carlos Navarro Retamal
>  Bioinformatic Engineering. PhD.
>  Postdoctoral Researcher in Center of Bioinformatics and Molecular
>  Simulations
>  Universidad de Talca
>  Av. Lircay S/N, Talca, Chile
>  E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> 
>  On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> >>> carlos.navarr...@gmail.com)
>  wrote:
> 
>  Dear gmx-users,
>  I’m currently working in a server where each node posses 40 physical
> >>> cores
>  (40 threads) and 4 Nvidia-V100.
>  When I launch a single job (1 simulation using a single gpu card) I
> >> get a
>  performance 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Mark Abraham
Hi,

Yes and the -nmpi I copied from Carlos's post is ineffective - use -ntmpi

Mark


On Mon., 29 Jul. 2019, 15:15 Justin Lemkul,  wrote:

>
>
> On 7/29/19 8:46 AM, Carlos Navarro wrote:
> > Hi Mark,
> > I tried that before, but unfortunately in that case (removing —gres=gpu:1
> > and including in each line the -gpu_id flag) for some reason the jobs are
> > run one at a time (one after the other), so I can’t use properly the
> whole
> > node.
> >
>
> You need to run all but the last mdrun process in the background (&).
>
> -Justin
>
> > ——
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> >
> > On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
> > wrote:
> >
> > Hi,
> >
> > When you use
> >
> > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> >
> > then the environment seems to make sure only one GPU is visible. (The log
> > files report only finding one GPU.) But it's probably the same GPU in
> each
> > case, with three remaining idle. I would suggest not using --gres unless
> > you can specify *which* of the four available GPUs each run can use.
> >
> > Otherwise, don't use --gres and use the facilities built into GROMACS,
> e.g.
> >
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > -ntomp 20 -gpu_id 0
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> > -ntomp 20 -gpu_id 1
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
> > -ntomp 20 -gpu_id 2
> > etc.
> >
> > Mark
> >
> > On Mon, 29 Jul 2019 at 11:34, Carlos Navarro  >
> > wrote:
> >
> >> Hi Szilárd,
> >> To answer your questions:
> >> **are you trying to run multiple simulations concurrently on the same
> >> node or are you trying to strong-scale?
> >> I'm trying to run multiple simulations on the same node at the same
> time.
> >>
> >> ** what are you simulating?
> >> Regular and CompEl simulations
> >>
> >> ** can you provide log files of the runs?
> >> In the following link are some logs files:
> >> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> >> In short, alone.log -> single run in the node (using 1 gpu).
> >> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> >> single node. In all cases, 20 cpus are used.
> >> Best regards,
> >> Carlos
> >>
> >> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll (<
> pall.szil...@gmail.com>)
> >> escribió:
> >>
> >>> Hi,
> >>>
> >>> It is not clear to me how are you trying to set up your runs, so
> >>> please provide some details:
> >>> - are you trying to run multiple simulations concurrently on the same
> >>> node or are you trying to strong-scale?
> >>> - what are you simulating?
> >>> - can you provide log files of the runs?
> >>>
> >>> Cheers,
> >>>
> >>> --
> >>> Szilárd
> >>>
> >>> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> >>>  wrote:
>  No one can give me an idea of what can be happening? Or how I can
> > solve
> >>> it?
>  Best regards,
>  Carlos
> 
>  ——
>  Carlos Navarro Retamal
>  Bioinformatic Engineering. PhD.
>  Postdoctoral Researcher in Center of Bioinformatics and Molecular
>  Simulations
>  Universidad de Talca
>  Av. Lircay S/N, Talca, Chile
>  E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> 
>  On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> >>> carlos.navarr...@gmail.com)
>  wrote:
> 
>  Dear gmx-users,
>  I’m currently working in a server where each node posses 40 physical
> >>> cores
>  (40 threads) and 4 Nvidia-V100.
>  When I launch a single job (1 simulation using a single gpu card) I
> >> get a
>  performance of about ~35ns/day in a system of about 300k atoms.
> > Looking
>  into the usage of the video card during the simulation I notice that
> >> the
>  card is being used about and ~80%.
>  The problems arise when I increase the number of jobs running at the
> >> same
>  time. If for instance 2 jobs are running at the same time, the
> >>> performance
>  drops to ~25ns/day each and the usage of the video cards also drops
> >>> during
>  the simulation to about a ~30-40% (and sometimes dropping to less than
> >>> 5%).
>  Clearly there is a communication problem between the gpu cards and the
> >>> cpu
>  during the simulations, but I don’t know how to solve this.
>  Here is the script I use to run the simulations:
> 
>  #!/bin/bash -x
>  #SBATCH --job-name=testAtTPC1
>  #SBATCH --ntasks-per-node=4
>  #SBATCH --cpus-per-task=20
>  #SBATCH --account=hdd22
>  #SBATCH --nodes=1
>  #SBATCH --mem=0
>  #SBATCH --output=sout.%j
>  #SBATCH --error=s4err.%j
>  #SBATCH --time=00:10:00
>  #SBATCH --partition=develgpus
>  #SBATCH --gres=gpu:4
> 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Szilárd Páll
Carlos,

You can accomplish the same using the multi-simulation feature of
mdrun and avoid having to manually manage the placement of runs, e.g.
instead of the above you just write
gmx mdrun_mpi -np N -multidir $WORKDIR1 $WORKDIR2 $WORKDIR3 ...
For more details see
http://manual.gromacs.org/documentation/current/user-guide/mdrun-features.html#running-multi-simulations
Note that if the different runs have different speed, just as with
your manual launch, your machine can end up partially utilized when
some of the runs finish.

Cheers,
--
Szilárd

On Mon, Jul 29, 2019 at 2:46 PM Carlos Navarro
 wrote:
>
> Hi Mark,
> I tried that before, but unfortunately in that case (removing —gres=gpu:1
> and including in each line the -gpu_id flag) for some reason the jobs are
> run one at a time (one after the other), so I can’t use properly the whole
> node.
>
>
> ——
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarr...@gmail.com or cnava...@utalca.cl
>
> On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
> wrote:
>
> Hi,
>
> When you use
>
> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
>
> then the environment seems to make sure only one GPU is visible. (The log
> files report only finding one GPU.) But it's probably the same GPU in each
> case, with three remaining idle. I would suggest not using --gres unless
> you can specify *which* of the four available GPUs each run can use.
>
> Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.
>
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> -ntomp 20 -gpu_id 0
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> -ntomp 20 -gpu_id 1
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
> -ntomp 20 -gpu_id 2
> etc.
>
> Mark
>
> On Mon, 29 Jul 2019 at 11:34, Carlos Navarro 
> wrote:
>
> > Hi Szilárd,
> > To answer your questions:
> > **are you trying to run multiple simulations concurrently on the same
> > node or are you trying to strong-scale?
> > I'm trying to run multiple simulations on the same node at the same time.
> >
> > ** what are you simulating?
> > Regular and CompEl simulations
> >
> > ** can you provide log files of the runs?
> > In the following link are some logs files:
> > https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> > In short, alone.log -> single run in the node (using 1 gpu).
> > multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> > single node. In all cases, 20 cpus are used.
> > Best regards,
> > Carlos
> >
> > El jue., 25 jul. 2019 a las 10:59, Szilárd Páll ()
> > escribió:
> >
> > > Hi,
> > >
> > > It is not clear to me how are you trying to set up your runs, so
> > > please provide some details:
> > > - are you trying to run multiple simulations concurrently on the same
> > > node or are you trying to strong-scale?
> > > - what are you simulating?
> > > - can you provide log files of the runs?
> > >
> > > Cheers,
> > >
> > > --
> > > Szilárd
> > >
> > > On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> > >  wrote:
> > > >
> > > > No one can give me an idea of what can be happening? Or how I can
> solve
> > > it?
> > > > Best regards,
> > > > Carlos
> > > >
> > > > ——
> > > > Carlos Navarro Retamal
> > > > Bioinformatic Engineering. PhD.
> > > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > > Simulations
> > > > Universidad de Talca
> > > > Av. Lircay S/N, Talca, Chile
> > > > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> > > >
> > > > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> > > carlos.navarr...@gmail.com)
> > > > wrote:
> > > >
> > > > Dear gmx-users,
> > > > I’m currently working in a server where each node posses 40 physical
> > > cores
> > > > (40 threads) and 4 Nvidia-V100.
> > > > When I launch a single job (1 simulation using a single gpu card) I
> > get a
> > > > performance of about ~35ns/day in a system of about 300k atoms.
> Looking
> > > > into the usage of the video card during the simulation I notice that
> > the
> > > > card is being used about and ~80%.
> > > > The problems arise when I increase the number of jobs running at the
> > same
> > > > time. If for instance 2 jobs are running at the same time, the
> > > performance
> > > > drops to ~25ns/day each and the usage of the video cards also drops
> > > during
> > > > the simulation to about a ~30-40% (and sometimes dropping to less than
> > > 5%).
> > > > Clearly there is a communication problem between the gpu cards and the
> > > cpu
> > > > during the simulations, but I don’t know how to solve this.
> > > > Here is the script I use to run the simulations:
> > > >
> > > > #!/bin/bash -x
> > > > #SBATCH --job-name=testAtTPC1
> > > > #SBATCH --ntasks-per-node=4
> > > > #SBATCH --cpus-per-task=20
> > > > #SBATCH 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Justin Lemkul



On 7/29/19 8:46 AM, Carlos Navarro wrote:

Hi Mark,
I tried that before, but unfortunately in that case (removing —gres=gpu:1
and including in each line the -gpu_id flag) for some reason the jobs are
run one at a time (one after the other), so I can’t use properly the whole
node.



You need to run all but the last mdrun process in the background (&).

-Justin


——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
wrote:

Hi,

When you use

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "

then the environment seems to make sure only one GPU is visible. (The log
files report only finding one GPU.) But it's probably the same GPU in each
case, with three remaining idle. I would suggest not using --gres unless
you can specify *which* of the four available GPUs each run can use.

Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.

$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 -gpu_id 0
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 -gpu_id 1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
-ntomp 20 -gpu_id 2
etc.

Mark

On Mon, 29 Jul 2019 at 11:34, Carlos Navarro 
wrote:


Hi Szilárd,
To answer your questions:
**are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
I'm trying to run multiple simulations on the same node at the same time.

** what are you simulating?
Regular and CompEl simulations

** can you provide log files of the runs?
In the following link are some logs files:
https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
In short, alone.log -> single run in the node (using 1 gpu).
multi1/2/3/4.log ->4 independent simulations ran at the same time in a
single node. In all cases, 20 cpus are used.
Best regards,
Carlos

El jue., 25 jul. 2019 a las 10:59, Szilárd Páll ()
escribió:


Hi,

It is not clear to me how are you trying to set up your runs, so
please provide some details:
- are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
- what are you simulating?
- can you provide log files of the runs?

Cheers,

--
Szilárd

On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
 wrote:

No one can give me an idea of what can be happening? Or how I can

solve

it?

Best regards,
Carlos

——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 19, 2019 at 2:20:41 PM, Carlos Navarro (

carlos.navarr...@gmail.com)

wrote:

Dear gmx-users,
I’m currently working in a server where each node posses 40 physical

cores

(40 threads) and 4 Nvidia-V100.
When I launch a single job (1 simulation using a single gpu card) I

get a

performance of about ~35ns/day in a system of about 300k atoms.

Looking

into the usage of the video card during the simulation I notice that

the

card is being used about and ~80%.
The problems arise when I increase the number of jobs running at the

same

time. If for instance 2 jobs are running at the same time, the

performance

drops to ~25ns/day each and the usage of the video cards also drops

during

the simulation to about a ~30-40% (and sometimes dropping to less than

5%).

Clearly there is a communication problem between the gpu cards and the

cpu

during the simulations, but I don’t know how to solve this.
Here is the script I use to run the simulations:

#!/bin/bash -x
#SBATCH --job-name=testAtTPC1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --account=hdd22
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --output=sout.%j
#SBATCH --error=s4err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

module use /gpfs/software/juwels/otherstages
module load Stages/2018b
module load Intel/2019.0.117-GCC-7.3.0
module load IntelMPI/2019.0.117
module load GROMACS/2018.3

WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
EXE=" gmx mdrun "

cd $WORKDIR1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

0

-ntomp 20 &>log &
cd $WORKDIR2
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

10

-ntomp 20 &>log &
cd $WORKDIR3
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset

20

-ntomp 20 &>log &
cd $WORKDIR4
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Carlos Navarro
Hi Mark,
I tried that before, but unfortunately in that case (removing —gres=gpu:1
and including in each line the -gpu_id flag) for some reason the jobs are
run one at a time (one after the other), so I can’t use properly the whole
node.


——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 29, 2019 at 11:48:21 AM, Mark Abraham (mark.j.abra...@gmail.com)
wrote:

Hi,

When you use

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "

then the environment seems to make sure only one GPU is visible. (The log
files report only finding one GPU.) But it's probably the same GPU in each
case, with three remaining idle. I would suggest not using --gres unless
you can specify *which* of the four available GPUs each run can use.

Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.

$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 -gpu_id 0
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 -gpu_id 1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 20
-ntomp 20 -gpu_id 2
etc.

Mark

On Mon, 29 Jul 2019 at 11:34, Carlos Navarro 
wrote:

> Hi Szilárd,
> To answer your questions:
> **are you trying to run multiple simulations concurrently on the same
> node or are you trying to strong-scale?
> I'm trying to run multiple simulations on the same node at the same time.
>
> ** what are you simulating?
> Regular and CompEl simulations
>
> ** can you provide log files of the runs?
> In the following link are some logs files:
> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> In short, alone.log -> single run in the node (using 1 gpu).
> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> single node. In all cases, 20 cpus are used.
> Best regards,
> Carlos
>
> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll ()
> escribió:
>
> > Hi,
> >
> > It is not clear to me how are you trying to set up your runs, so
> > please provide some details:
> > - are you trying to run multiple simulations concurrently on the same
> > node or are you trying to strong-scale?
> > - what are you simulating?
> > - can you provide log files of the runs?
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> > On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> >  wrote:
> > >
> > > No one can give me an idea of what can be happening? Or how I can
solve
> > it?
> > > Best regards,
> > > Carlos
> > >
> > > ——
> > > Carlos Navarro Retamal
> > > Bioinformatic Engineering. PhD.
> > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > Simulations
> > > Universidad de Talca
> > > Av. Lircay S/N, Talca, Chile
> > > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> > >
> > > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> > carlos.navarr...@gmail.com)
> > > wrote:
> > >
> > > Dear gmx-users,
> > > I’m currently working in a server where each node posses 40 physical
> > cores
> > > (40 threads) and 4 Nvidia-V100.
> > > When I launch a single job (1 simulation using a single gpu card) I
> get a
> > > performance of about ~35ns/day in a system of about 300k atoms.
Looking
> > > into the usage of the video card during the simulation I notice that
> the
> > > card is being used about and ~80%.
> > > The problems arise when I increase the number of jobs running at the
> same
> > > time. If for instance 2 jobs are running at the same time, the
> > performance
> > > drops to ~25ns/day each and the usage of the video cards also drops
> > during
> > > the simulation to about a ~30-40% (and sometimes dropping to less than
> > 5%).
> > > Clearly there is a communication problem between the gpu cards and the
> > cpu
> > > during the simulations, but I don’t know how to solve this.
> > > Here is the script I use to run the simulations:
> > >
> > > #!/bin/bash -x
> > > #SBATCH --job-name=testAtTPC1
> > > #SBATCH --ntasks-per-node=4
> > > #SBATCH --cpus-per-task=20
> > > #SBATCH --account=hdd22
> > > #SBATCH --nodes=1
> > > #SBATCH --mem=0
> > > #SBATCH --output=sout.%j
> > > #SBATCH --error=s4err.%j
> > > #SBATCH --time=00:10:00
> > > #SBATCH --partition=develgpus
> > > #SBATCH --gres=gpu:4
> > >
> > > module use /gpfs/software/juwels/otherstages
> > > module load Stages/2018b
> > > module load Intel/2019.0.117-GCC-7.3.0
> > > module load IntelMPI/2019.0.117
> > > module load GROMACS/2018.3
> > >
> > > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> > >
> > > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > > EXE=" gmx mdrun "
> > >
> > > cd $WORKDIR1
> > > $DO_PARALLEL $EXE -s eq6.tpr 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Mark Abraham
Hi,

When you use

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "

then the environment seems to make sure only one GPU is visible. (The log
files report only finding one GPU.) But it's probably the same GPU in each
case, with three remaining idle. I would suggest not using --gres unless
you can specify *which* of the four available GPUs each run can use.

Otherwise, don't use --gres and use the facilities built into GROMACS, e.g.

$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 -gpu_id 0
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 -gpu_id 1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
-ntomp 20 -gpu_id 2
etc.

Mark

On Mon, 29 Jul 2019 at 11:34, Carlos Navarro 
wrote:

> Hi Szilárd,
> To answer your questions:
> **are you trying to run multiple simulations concurrently on the same
> node or are you trying to strong-scale?
> I'm trying to run multiple simulations on the same node at the same time.
>
> ** what are you simulating?
> Regular and CompEl simulations
>
> ** can you provide log files of the runs?
> In the following link are some logs files:
> https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
> In short, alone.log -> single run in the node (using 1 gpu).
> multi1/2/3/4.log ->4 independent simulations ran at the same time in a
> single node. In all cases, 20 cpus are used.
> Best regards,
> Carlos
>
> El jue., 25 jul. 2019 a las 10:59, Szilárd Páll ()
> escribió:
>
> > Hi,
> >
> > It is not clear to me how are you trying to set up your runs, so
> > please provide some details:
> > - are you trying to run multiple simulations concurrently on the same
> > node or are you trying to strong-scale?
> > - what are you simulating?
> > - can you provide log files of the runs?
> >
> > Cheers,
> >
> > --
> > Szilárd
> >
> > On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
> >  wrote:
> > >
> > > No one can give me an idea of what can be happening? Or how I can solve
> > it?
> > > Best regards,
> > > Carlos
> > >
> > > ——
> > > Carlos Navarro Retamal
> > > Bioinformatic Engineering. PhD.
> > > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > > Simulations
> > > Universidad de Talca
> > > Av. Lircay S/N, Talca, Chile
> > > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> > >
> > > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> > carlos.navarr...@gmail.com)
> > > wrote:
> > >
> > > Dear gmx-users,
> > > I’m currently working in a server where each node posses 40 physical
> > cores
> > > (40 threads) and 4 Nvidia-V100.
> > > When I launch a single job (1 simulation using a single gpu card) I
> get a
> > > performance of about ~35ns/day in a system of about 300k atoms. Looking
> > > into the usage of the video card during the simulation I notice that
> the
> > > card is being used about and ~80%.
> > > The problems arise when I increase the number of jobs running at the
> same
> > > time. If for instance 2 jobs are running at the same time, the
> > performance
> > > drops to ~25ns/day each and the usage of the video cards also drops
> > during
> > > the simulation to about a ~30-40% (and sometimes dropping to less than
> > 5%).
> > > Clearly there is a communication problem between the gpu cards and the
> > cpu
> > > during the simulations, but I don’t know how to solve this.
> > > Here is the script I use to run the simulations:
> > >
> > > #!/bin/bash -x
> > > #SBATCH --job-name=testAtTPC1
> > > #SBATCH --ntasks-per-node=4
> > > #SBATCH --cpus-per-task=20
> > > #SBATCH --account=hdd22
> > > #SBATCH --nodes=1
> > > #SBATCH --mem=0
> > > #SBATCH --output=sout.%j
> > > #SBATCH --error=s4err.%j
> > > #SBATCH --time=00:10:00
> > > #SBATCH --partition=develgpus
> > > #SBATCH --gres=gpu:4
> > >
> > > module use /gpfs/software/juwels/otherstages
> > > module load Stages/2018b
> > > module load Intel/2019.0.117-GCC-7.3.0
> > > module load IntelMPI/2019.0.117
> > > module load GROMACS/2018.3
> > >
> > > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> > >
> > > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > > EXE=" gmx mdrun "
> > >
> > > cd $WORKDIR1
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 0
> > > -ntomp 20 &>log &
> > > cd $WORKDIR2
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 10
> > > -ntomp 20 &>log &
> > > cd $WORKDIR3
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset
> > 20
> > > -ntomp 20 &>log &
> > > cd $WORKDIR4
> > > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset
> 30
> > > -ntomp 20 &>log &
> > >
> > >
> > > Regarding to pinoffset, I first tried using 20 cores for each job but
> > then
> > > also tried with 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-29 Thread Carlos Navarro
Hi Szilárd,
To answer your questions:
**are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
I'm trying to run multiple simulations on the same node at the same time.

** what are you simulating?
Regular and CompEl simulations

** can you provide log files of the runs?
In the following link are some logs files:
https://www.dropbox.com/s/7q249vbqqwf5r03/Archive.zip?dl=0.
In short, alone.log -> single run in the node (using 1 gpu).
multi1/2/3/4.log ->4 independent simulations ran at the same time in a
single node. In all cases, 20 cpus are used.
Best regards,
Carlos

El jue., 25 jul. 2019 a las 10:59, Szilárd Páll ()
escribió:

> Hi,
>
> It is not clear to me how are you trying to set up your runs, so
> please provide some details:
> - are you trying to run multiple simulations concurrently on the same
> node or are you trying to strong-scale?
> - what are you simulating?
> - can you provide log files of the runs?
>
> Cheers,
>
> --
> Szilárd
>
> On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
>  wrote:
> >
> > No one can give me an idea of what can be happening? Or how I can solve
> it?
> > Best regards,
> > Carlos
> >
> > ——
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> >
> > On July 19, 2019 at 2:20:41 PM, Carlos Navarro (
> carlos.navarr...@gmail.com)
> > wrote:
> >
> > Dear gmx-users,
> > I’m currently working in a server where each node posses 40 physical
> cores
> > (40 threads) and 4 Nvidia-V100.
> > When I launch a single job (1 simulation using a single gpu card) I get a
> > performance of about ~35ns/day in a system of about 300k atoms. Looking
> > into the usage of the video card during the simulation I notice that the
> > card is being used about and ~80%.
> > The problems arise when I increase the number of jobs running at the same
> > time. If for instance 2 jobs are running at the same time, the
> performance
> > drops to ~25ns/day each and the usage of the video cards also drops
> during
> > the simulation to about a ~30-40% (and sometimes dropping to less than
> 5%).
> > Clearly there is a communication problem between the gpu cards and the
> cpu
> > during the simulations, but I don’t know how to solve this.
> > Here is the script I use to run the simulations:
> >
> > #!/bin/bash -x
> > #SBATCH --job-name=testAtTPC1
> > #SBATCH --ntasks-per-node=4
> > #SBATCH --cpus-per-task=20
> > #SBATCH --account=hdd22
> > #SBATCH --nodes=1
> > #SBATCH --mem=0
> > #SBATCH --output=sout.%j
> > #SBATCH --error=s4err.%j
> > #SBATCH --time=00:10:00
> > #SBATCH --partition=develgpus
> > #SBATCH --gres=gpu:4
> >
> > module use /gpfs/software/juwels/otherstages
> > module load Stages/2018b
> > module load Intel/2019.0.117-GCC-7.3.0
> > module load IntelMPI/2019.0.117
> > module load GROMACS/2018.3
> >
> > WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> > WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> > WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> > WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
> >
> > DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> > EXE=" gmx mdrun "
> >
> > cd $WORKDIR1
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> > -ntomp 20 &>log &
> > cd $WORKDIR2
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> > -ntomp 20 &>log &
> > cd $WORKDIR3
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset
> 20
> > -ntomp 20 &>log &
> > cd $WORKDIR4
> > $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
> > -ntomp 20 &>log &
> >
> >
> > Regarding to pinoffset, I first tried using 20 cores for each job but
> then
> > also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
> > pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the
> problem
> > persist.
> >
> > Currently in this machine I’m not able to use more than 1 gpu per job, so
> > this is my only choice to use properly the whole node.
> > If you need more information please just let me know.
> > Best regards.
> > Carlos
> >
> > ——
> > Carlos Navarro Retamal
> > Bioinformatic Engineering. PhD.
> > Postdoctoral Researcher in Center of Bioinformatics and Molecular
> > Simulations
> > Universidad de Talca
> > Av. Lircay S/N, Talca, Chile
> > E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to 

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-25 Thread Szilárd Páll
Hi,

It is not clear to me how are you trying to set up your runs, so
please provide some details:
- are you trying to run multiple simulations concurrently on the same
node or are you trying to strong-scale?
- what are you simulating?
- can you provide log files of the runs?

Cheers,

--
Szilárd

On Tue, Jul 23, 2019 at 1:34 AM Carlos Navarro
 wrote:
>
> No one can give me an idea of what can be happening? Or how I can solve it?
> Best regards,
> Carlos
>
> ——
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarr...@gmail.com or cnava...@utalca.cl
>
> On July 19, 2019 at 2:20:41 PM, Carlos Navarro (carlos.navarr...@gmail.com)
> wrote:
>
> Dear gmx-users,
> I’m currently working in a server where each node posses 40 physical cores
> (40 threads) and 4 Nvidia-V100.
> When I launch a single job (1 simulation using a single gpu card) I get a
> performance of about ~35ns/day in a system of about 300k atoms. Looking
> into the usage of the video card during the simulation I notice that the
> card is being used about and ~80%.
> The problems arise when I increase the number of jobs running at the same
> time. If for instance 2 jobs are running at the same time, the performance
> drops to ~25ns/day each and the usage of the video cards also drops during
> the simulation to about a ~30-40% (and sometimes dropping to less than 5%).
> Clearly there is a communication problem between the gpu cards and the cpu
> during the simulations, but I don’t know how to solve this.
> Here is the script I use to run the simulations:
>
> #!/bin/bash -x
> #SBATCH --job-name=testAtTPC1
> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=20
> #SBATCH --account=hdd22
> #SBATCH --nodes=1
> #SBATCH --mem=0
> #SBATCH --output=sout.%j
> #SBATCH --error=s4err.%j
> #SBATCH --time=00:10:00
> #SBATCH --partition=develgpus
> #SBATCH --gres=gpu:4
>
> module use /gpfs/software/juwels/otherstages
> module load Stages/2018b
> module load Intel/2019.0.117-GCC-7.3.0
> module load IntelMPI/2019.0.117
> module load GROMACS/2018.3
>
> WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
> WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
> WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
> WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4
>
> DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
> EXE=" gmx mdrun "
>
> cd $WORKDIR1
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
> -ntomp 20 &>log &
> cd $WORKDIR2
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
> -ntomp 20 &>log &
> cd $WORKDIR3
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
> -ntomp 20 &>log &
> cd $WORKDIR4
> $DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
> -ntomp 20 &>log &
>
>
> Regarding to pinoffset, I first tried using 20 cores for each job but then
> also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
> pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the problem
> persist.
>
> Currently in this machine I’m not able to use more than 1 gpu per job, so
> this is my only choice to use properly the whole node.
> If you need more information please just let me know.
> Best regards.
> Carlos
>
> ——
> Carlos Navarro Retamal
> Bioinformatic Engineering. PhD.
> Postdoctoral Researcher in Center of Bioinformatics and Molecular
> Simulations
> Universidad de Talca
> Av. Lircay S/N, Talca, Chile
> E: carlos.navarr...@gmail.com or cnava...@utalca.cl
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-22 Thread Carlos Navarro
No one can give me an idea of what can be happening? Or how I can solve it?
Best regards,
Carlos

——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl

On July 19, 2019 at 2:20:41 PM, Carlos Navarro (carlos.navarr...@gmail.com)
wrote:

Dear gmx-users,
I’m currently working in a server where each node posses 40 physical cores
(40 threads) and 4 Nvidia-V100.
When I launch a single job (1 simulation using a single gpu card) I get a
performance of about ~35ns/day in a system of about 300k atoms. Looking
into the usage of the video card during the simulation I notice that the
card is being used about and ~80%.
The problems arise when I increase the number of jobs running at the same
time. If for instance 2 jobs are running at the same time, the performance
drops to ~25ns/day each and the usage of the video cards also drops during
the simulation to about a ~30-40% (and sometimes dropping to less than 5%).
Clearly there is a communication problem between the gpu cards and the cpu
during the simulations, but I don’t know how to solve this.
Here is the script I use to run the simulations:

#!/bin/bash -x
#SBATCH --job-name=testAtTPC1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --account=hdd22
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --output=sout.%j
#SBATCH --error=s4err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

module use /gpfs/software/juwels/otherstages
module load Stages/2018b
module load Intel/2019.0.117-GCC-7.3.0
module load IntelMPI/2019.0.117
module load GROMACS/2018.3

WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
EXE=" gmx mdrun "

cd $WORKDIR1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 &>log &
cd $WORKDIR2
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 &>log &
cd $WORKDIR3
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
-ntomp 20 &>log &
cd $WORKDIR4
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
-ntomp 20 &>log &


Regarding to pinoffset, I first tried using 20 cores for each job but then
also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the problem
persist.

Currently in this machine I’m not able to use more than 1 gpu per job, so
this is my only choice to use properly the whole node.
If you need more information please just let me know.
Best regards.
Carlos

——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] performance issues running gromacs with more than 1 gpu card in slurm

2019-07-19 Thread Carlos Navarro
Dear gmx-users,
I’m currently working in a server where each node posses 40 physical cores
(40 threads) and 4 Nvidia-V100.
When I launch a single job (1 simulation using a single gpu card) I get a
performance of about ~35ns/day in a system of about 300k atoms. Looking
into the usage of the video card during the simulation I notice that the
card is being used about and ~80%.
The problems arise when I increase the number of jobs running at the same
time. If for instance 2 jobs are running at the same time, the performance
drops to ~25ns/day each and the usage of the video cards also drops during
the simulation to about a ~30-40% (and sometimes dropping to less than 5%).
Clearly there is a communication problem between the gpu cards and the cpu
during the simulations, but I don’t know how to solve this.
Here is the script I use to run the simulations:

#!/bin/bash -x
#SBATCH --job-name=testAtTPC1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=20
#SBATCH --account=hdd22
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --output=sout.%j
#SBATCH --error=s4err.%j
#SBATCH --time=00:10:00
#SBATCH --partition=develgpus
#SBATCH --gres=gpu:4

module use /gpfs/software/juwels/otherstages
module load Stages/2018b
module load Intel/2019.0.117-GCC-7.3.0
module load IntelMPI/2019.0.117
module load GROMACS/2018.3

WORKDIR1=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/1
WORKDIR2=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/2
WORKDIR3=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/3
WORKDIR4=/p/project/chdd22/gromacs/benchmark/AtTPC1/singlegpu/4

DO_PARALLEL=" srun --exclusive -n 1 --gres=gpu:1 "
EXE=" gmx mdrun "

cd $WORKDIR1
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 0
-ntomp 20 &>log &
cd $WORKDIR2
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 10
-ntomp 20 &>log &
cd $WORKDIR3
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20  -nmpi 1 -pin on -pinoffset 20
-ntomp 20 &>log &
cd $WORKDIR4
$DO_PARALLEL $EXE -s eq6.tpr -deffnm eq6-20 -nmpi 1 -pin on -pinoffset 30
-ntomp 20 &>log &


Regarding to pinoffset, I first tried using 20 cores for each job but then
also tried with 8 cores (so pinoffset 0 for job 1, pinoffset 4 for job 2,
pinoffset 8 for job 3 and pinoffset 12 for job) but at the end the problem
persist.

Currently in this machine I’m not able to use more than 1 gpu per job, so
this is my only choice to use properly the whole node.
If you need more information please just let me know.
Best regards.
Carlos

——
Carlos Navarro Retamal
Bioinformatic Engineering. PhD.
Postdoctoral Researcher in Center of Bioinformatics and Molecular
Simulations
Universidad de Talca
Av. Lircay S/N, Talca, Chile
E: carlos.navarr...@gmail.com or cnava...@utalca.cl
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance of GROMACS on GPU's on ORNL Titan?

2019-02-13 Thread pbuscemi
For what it is worth: on our AMD 2990wd 32 core,  2 x 2080ti we can run 100k
atoms at ~ 100ns/day NVT , ~ 150 ns/day NPT so 8 -10 days to get that
microsecond.   I'm curious to learn what kind of results you might obtain
from Oak Ridge and if the cost/clock time analysis makes it worthwhile.

Paul

-Original Message-
From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se
 On Behalf Of Michael
Shirts
Sent: Wednesday, February 13, 2019 1:28 PM
To: Discussion list for GROMACS users ; Michael R
Shirts 
Subject: [gmx-users] Performance of GROMACS on GPU's on ORNL Titan?

Does anyone have experience running GROMACS on GPU's on Oak Ridge National
Labs Titan or Summit machines, especially parallelization over multiple
GPUs? I'm looking at applying for allocations there, and am interested in
experiences that people have had. We're probably mostly looking at systems
in the 100-200K atoms range, but we need to get to long timescales (multiple
microseconds, at least) for some of the phenomena we are looking at.

Thanks!
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance of GROMACS on GPU's on ORNL Titan?

2019-02-13 Thread Michael Shirts
Does anyone have experience running GROMACS on GPU's on Oak Ridge National
Labs Titan or Summit machines, especially parallelization over multiple
GPUs? I'm looking at applying for allocations there, and am interested in
experiences that people have had. We're probably mostly looking at systems
in the 100-200K atoms range, but we need to get to long timescales
(multiple microseconds, at least) for some of the phenomena we are looking
at.

Thanks!
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance

2018-04-03 Thread Szilárd Páll
Hi,

Your system is exploding, some atoms end up with coordinates of around
10^9 which then throw off the PBC code that tries to put atoms back in
the box. This will normally not happen as constraining will already
fail with such huge coordinates, I think, so technically it is a bug,
we could handle this corner-case better.

However, if you need to verify your system setup as it is unstable
(not well equilibrated or the time-step too long).

--
Szilárd


On Thu, Mar 29, 2018 at 3:50 PM, Myunggi Yi  wrote:
> Dear Szilard,
>
> Can you run this simulation?
>
> Simulation doesn't crush and doesn't generate error message.
> It take forever without updating report in log file or other output files.
>
> Is this a bug?
>
>
>
> On Thu, Mar 29, 2018 at 7:58 AM, Szilárd Páll 
> wrote:
>
>> Thanks. Looks like the messages and error handling is somewhat
>> confusing; you must have the OMP_NUM_THREADS environment variable set
>> which (just as setting -ntomp), without setting -ntmpi too is not
>> supported.
>>
>> Either let mdrun decide about the thread count or set -ntmpi manually.
>>
>> --
>> Szilárd
>>
>>
>> On Wed, Mar 28, 2018 at 7:10 PM, Myunggi Yi  wrote:
>> > Does it work?
>> >
>> > https://drive.google.com/open?id=1n5m1tNGbnV7oZnuAEgZ7gSP6qA6HluNl
>> >
>> > How about this?
>> >
>> >
>> > Myunggi Yi
>> >
>> > On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham > >
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> Attachments can't be accepted on the list - please upload to a file
>> sharing
>> >> service and share links to those.
>> >>
>> >> Mark
>> >>
>> >> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi 
>> wrote:
>> >>
>> >> > I am attaching the file.
>> >> >
>> >> > Thank you.
>> >> >
>> >> > Myunggi Yi
>> >> >
>> >> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll <
>> pall.szil...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Again, please share the exact log files / description of inputs.
>> What
>> >> > > does "bad performance" mean?
>> >> > > --
>> >> > > Szilárd
>> >> > >
>> >> > >
>> >> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
>> >> > wrote:
>> >> > > > Dear users,
>> >> > > >
>> >> > > > I have two questions.
>> >> > > >
>> >> > > >
>> >> > > > 1. I used to run typical simulations with the following command.
>> >> > > >
>> >> > > > gmx mdrun -deffnm md
>> >> > > >
>> >> > > > I had no problem.
>> >> > > >
>> >> > > >
>> >> > > > Now I am running a simulation with "Dry_Martini" FF with the
>> >> following
>> >> > > > input.
>> >> > > >
>> >> > > >
>> >> > > > integrator   = sd
>> >> > > > tinit= 0.0
>> >> > > > dt   = 0.040
>> >> > > > nsteps   = 100
>> >> > > >
>> >> > > > nstlog   = 5000
>> >> > > > nstenergy= 5000
>> >> > > > nstxout-compressed   = 5000
>> >> > > > compressed-x-precision   = 100
>> >> > > >
>> >> > > > cutoff-scheme= Verlet
>> >> > > > nstlist  = 10
>> >> > > > ns_type  = grid
>> >> > > > pbc  = xyz
>> >> > > > verlet-buffer-tolerance  = 0.005
>> >> > > >
>> >> > > > epsilon_r= 15
>> >> > > > coulombtype  = reaction-field
>> >> > > > rcoulomb = 1.1
>> >> > > > vdw_type = cutoff
>> >> > > > vdw-modifier = Potential-shift-verlet
>> >> > > > rvdw = 1.1
>> >> > > >
>> >> > > > tc-grps  = system
>> >> > > > tau_t= 4.0
>> >> > > > ref_t= 310
>> >> > > >
>> >> > > > ; Pressure coupling:
>> >> > > > Pcoupl   = no
>> >> > > >
>> >> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
>> >> > > > gen_vel  = yes
>> >> > > > gen_temp = 310
>> >> > > > gen_seed = 1521731368
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > If I use the same command to submit the job.
>> >> > > > I got the following error. I don't know why.
>> >> > > >
>> >> > > > ---
>> >> > > > Program: gmx mdrun, version 2018.1
>> >> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp
>> (line
>> >> 224)
>> >> > > >
>> >> > > > Fatal error:
>> >> > > > When using GPUs, setting the number of OpenMP threads without
>> >> > specifying
>> >> > > the
>> >> > > > number of ranks can lead to conflicting demands. Please specify
>> the
>> >> > > number
>> >> > > > of
>> >> > > > thread-MPI ranks as well (option -ntmpi).
>> >> > > >
>> >> > > > For more information and tips for troubleshooting, please check
>> the
>> >> > > GROMACS
>> >> > > > website at http://www.gromacs.org/Documentation/Errors
>> >> > > > ---
>> >> > > >
>> >> > > >
>> >> > > > So I did run simulation with the following command.
>> >> > > >
>> >> > > 

Re: [gmx-users] Performance

2018-03-29 Thread Myunggi Yi
Dear Szilard,

Can you run this simulation?

Simulation doesn't crush and doesn't generate error message.
It take forever without updating report in log file or other output files.

Is this a bug?



On Thu, Mar 29, 2018 at 7:58 AM, Szilárd Páll 
wrote:

> Thanks. Looks like the messages and error handling is somewhat
> confusing; you must have the OMP_NUM_THREADS environment variable set
> which (just as setting -ntomp), without setting -ntmpi too is not
> supported.
>
> Either let mdrun decide about the thread count or set -ntmpi manually.
>
> --
> Szilárd
>
>
> On Wed, Mar 28, 2018 at 7:10 PM, Myunggi Yi  wrote:
> > Does it work?
> >
> > https://drive.google.com/open?id=1n5m1tNGbnV7oZnuAEgZ7gSP6qA6HluNl
> >
> > How about this?
> >
> >
> > Myunggi Yi
> >
> > On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham  >
> > wrote:
> >
> >> Hi,
> >>
> >> Attachments can't be accepted on the list - please upload to a file
> sharing
> >> service and share links to those.
> >>
> >> Mark
> >>
> >> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi 
> wrote:
> >>
> >> > I am attaching the file.
> >> >
> >> > Thank you.
> >> >
> >> > Myunggi Yi
> >> >
> >> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll <
> pall.szil...@gmail.com>
> >> > wrote:
> >> >
> >> > > Again, please share the exact log files / description of inputs.
> What
> >> > > does "bad performance" mean?
> >> > > --
> >> > > Szilárd
> >> > >
> >> > >
> >> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
> >> > wrote:
> >> > > > Dear users,
> >> > > >
> >> > > > I have two questions.
> >> > > >
> >> > > >
> >> > > > 1. I used to run typical simulations with the following command.
> >> > > >
> >> > > > gmx mdrun -deffnm md
> >> > > >
> >> > > > I had no problem.
> >> > > >
> >> > > >
> >> > > > Now I am running a simulation with "Dry_Martini" FF with the
> >> following
> >> > > > input.
> >> > > >
> >> > > >
> >> > > > integrator   = sd
> >> > > > tinit= 0.0
> >> > > > dt   = 0.040
> >> > > > nsteps   = 100
> >> > > >
> >> > > > nstlog   = 5000
> >> > > > nstenergy= 5000
> >> > > > nstxout-compressed   = 5000
> >> > > > compressed-x-precision   = 100
> >> > > >
> >> > > > cutoff-scheme= Verlet
> >> > > > nstlist  = 10
> >> > > > ns_type  = grid
> >> > > > pbc  = xyz
> >> > > > verlet-buffer-tolerance  = 0.005
> >> > > >
> >> > > > epsilon_r= 15
> >> > > > coulombtype  = reaction-field
> >> > > > rcoulomb = 1.1
> >> > > > vdw_type = cutoff
> >> > > > vdw-modifier = Potential-shift-verlet
> >> > > > rvdw = 1.1
> >> > > >
> >> > > > tc-grps  = system
> >> > > > tau_t= 4.0
> >> > > > ref_t= 310
> >> > > >
> >> > > > ; Pressure coupling:
> >> > > > Pcoupl   = no
> >> > > >
> >> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
> >> > > > gen_vel  = yes
> >> > > > gen_temp = 310
> >> > > > gen_seed = 1521731368
> >> > > >
> >> > > >
> >> > > >
> >> > > > If I use the same command to submit the job.
> >> > > > I got the following error. I don't know why.
> >> > > >
> >> > > > ---
> >> > > > Program: gmx mdrun, version 2018.1
> >> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp
> (line
> >> 224)
> >> > > >
> >> > > > Fatal error:
> >> > > > When using GPUs, setting the number of OpenMP threads without
> >> > specifying
> >> > > the
> >> > > > number of ranks can lead to conflicting demands. Please specify
> the
> >> > > number
> >> > > > of
> >> > > > thread-MPI ranks as well (option -ntmpi).
> >> > > >
> >> > > > For more information and tips for troubleshooting, please check
> the
> >> > > GROMACS
> >> > > > website at http://www.gromacs.org/Documentation/Errors
> >> > > > ---
> >> > > >
> >> > > >
> >> > > > So I did run simulation with the following command.
> >> > > >
> >> > > >
> >> > > > gmx mdrun -deffnm md -ntmpi 1
> >> > > >
> >> > > >
> >> > > > Now the performance is extremely bad.
> >> > > > Since yesterday, the log file still reporting the first step's
> >> energy.
> >> > > >
> >> > > > 2. This is the second question. Why?
> >> > > >
> >> > > > Can anyone help?
> >> > > >
> >> > > >
> >> > > > Myunggi Yi
> >> > > > --
> >> > > > Gromacs Users mailing list
> >> > > >
> >> > > > * Please search the archive at http://www.gromacs.org/Support
> >> > > /Mailing_Lists/GMX-Users_List before posting!
> >> > > >
> >> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> > > >
> >> > > > * For (un)subscribe requests visit
> >> > > > 

Re: [gmx-users] Performance

2018-03-29 Thread Myunggi Yi
Thanks.

I used the exactly same program (same installation and same environment
variable).

How come this error depends on the *.mdp file?

I don't get it with typical simulations such as md with PME etc,
but I get it with this Dry_Martini mdp file (using sd, reaction field etc.).



Myunggi Yi

On Thu, Mar 29, 2018 at 7:58 AM, Szilárd Páll 
wrote:

> Thanks. Looks like the messages and error handling is somewhat
> confusing; you must have the OMP_NUM_THREADS environment variable set
> which (just as setting -ntomp), without setting -ntmpi too is not
> supported.
>
> Either let mdrun decide about the thread count or set -ntmpi manually.
>
> --
> Szilárd
>
>
> On Wed, Mar 28, 2018 at 7:10 PM, Myunggi Yi  wrote:
> > Does it work?
> >
> > https://drive.google.com/open?id=1n5m1tNGbnV7oZnuAEgZ7gSP6qA6HluNl
> >
> > How about this?
> >
> >
> > Myunggi Yi
> >
> > On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham  >
> > wrote:
> >
> >> Hi,
> >>
> >> Attachments can't be accepted on the list - please upload to a file
> sharing
> >> service and share links to those.
> >>
> >> Mark
> >>
> >> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi 
> wrote:
> >>
> >> > I am attaching the file.
> >> >
> >> > Thank you.
> >> >
> >> > Myunggi Yi
> >> >
> >> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll <
> pall.szil...@gmail.com>
> >> > wrote:
> >> >
> >> > > Again, please share the exact log files / description of inputs.
> What
> >> > > does "bad performance" mean?
> >> > > --
> >> > > Szilárd
> >> > >
> >> > >
> >> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
> >> > wrote:
> >> > > > Dear users,
> >> > > >
> >> > > > I have two questions.
> >> > > >
> >> > > >
> >> > > > 1. I used to run typical simulations with the following command.
> >> > > >
> >> > > > gmx mdrun -deffnm md
> >> > > >
> >> > > > I had no problem.
> >> > > >
> >> > > >
> >> > > > Now I am running a simulation with "Dry_Martini" FF with the
> >> following
> >> > > > input.
> >> > > >
> >> > > >
> >> > > > integrator   = sd
> >> > > > tinit= 0.0
> >> > > > dt   = 0.040
> >> > > > nsteps   = 100
> >> > > >
> >> > > > nstlog   = 5000
> >> > > > nstenergy= 5000
> >> > > > nstxout-compressed   = 5000
> >> > > > compressed-x-precision   = 100
> >> > > >
> >> > > > cutoff-scheme= Verlet
> >> > > > nstlist  = 10
> >> > > > ns_type  = grid
> >> > > > pbc  = xyz
> >> > > > verlet-buffer-tolerance  = 0.005
> >> > > >
> >> > > > epsilon_r= 15
> >> > > > coulombtype  = reaction-field
> >> > > > rcoulomb = 1.1
> >> > > > vdw_type = cutoff
> >> > > > vdw-modifier = Potential-shift-verlet
> >> > > > rvdw = 1.1
> >> > > >
> >> > > > tc-grps  = system
> >> > > > tau_t= 4.0
> >> > > > ref_t= 310
> >> > > >
> >> > > > ; Pressure coupling:
> >> > > > Pcoupl   = no
> >> > > >
> >> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
> >> > > > gen_vel  = yes
> >> > > > gen_temp = 310
> >> > > > gen_seed = 1521731368
> >> > > >
> >> > > >
> >> > > >
> >> > > > If I use the same command to submit the job.
> >> > > > I got the following error. I don't know why.
> >> > > >
> >> > > > ---
> >> > > > Program: gmx mdrun, version 2018.1
> >> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp
> (line
> >> 224)
> >> > > >
> >> > > > Fatal error:
> >> > > > When using GPUs, setting the number of OpenMP threads without
> >> > specifying
> >> > > the
> >> > > > number of ranks can lead to conflicting demands. Please specify
> the
> >> > > number
> >> > > > of
> >> > > > thread-MPI ranks as well (option -ntmpi).
> >> > > >
> >> > > > For more information and tips for troubleshooting, please check
> the
> >> > > GROMACS
> >> > > > website at http://www.gromacs.org/Documentation/Errors
> >> > > > ---
> >> > > >
> >> > > >
> >> > > > So I did run simulation with the following command.
> >> > > >
> >> > > >
> >> > > > gmx mdrun -deffnm md -ntmpi 1
> >> > > >
> >> > > >
> >> > > > Now the performance is extremely bad.
> >> > > > Since yesterday, the log file still reporting the first step's
> >> energy.
> >> > > >
> >> > > > 2. This is the second question. Why?
> >> > > >
> >> > > > Can anyone help?
> >> > > >
> >> > > >
> >> > > > Myunggi Yi
> >> > > > --
> >> > > > Gromacs Users mailing list
> >> > > >
> >> > > > * Please search the archive at http://www.gromacs.org/Support
> >> > > /Mailing_Lists/GMX-Users_List before posting!
> >> > > >
> >> > > > * Can't post? Read 

Re: [gmx-users] Performance

2018-03-29 Thread Szilárd Páll
Thanks. Looks like the messages and error handling is somewhat
confusing; you must have the OMP_NUM_THREADS environment variable set
which (just as setting -ntomp), without setting -ntmpi too is not
supported.

Either let mdrun decide about the thread count or set -ntmpi manually.

--
Szilárd


On Wed, Mar 28, 2018 at 7:10 PM, Myunggi Yi  wrote:
> Does it work?
>
> https://drive.google.com/open?id=1n5m1tNGbnV7oZnuAEgZ7gSP6qA6HluNl
>
> How about this?
>
>
> Myunggi Yi
>
> On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham 
> wrote:
>
>> Hi,
>>
>> Attachments can't be accepted on the list - please upload to a file sharing
>> service and share links to those.
>>
>> Mark
>>
>> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi  wrote:
>>
>> > I am attaching the file.
>> >
>> > Thank you.
>> >
>> > Myunggi Yi
>> >
>> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll 
>> > wrote:
>> >
>> > > Again, please share the exact log files / description of inputs. What
>> > > does "bad performance" mean?
>> > > --
>> > > Szilárd
>> > >
>> > >
>> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
>> > wrote:
>> > > > Dear users,
>> > > >
>> > > > I have two questions.
>> > > >
>> > > >
>> > > > 1. I used to run typical simulations with the following command.
>> > > >
>> > > > gmx mdrun -deffnm md
>> > > >
>> > > > I had no problem.
>> > > >
>> > > >
>> > > > Now I am running a simulation with "Dry_Martini" FF with the
>> following
>> > > > input.
>> > > >
>> > > >
>> > > > integrator   = sd
>> > > > tinit= 0.0
>> > > > dt   = 0.040
>> > > > nsteps   = 100
>> > > >
>> > > > nstlog   = 5000
>> > > > nstenergy= 5000
>> > > > nstxout-compressed   = 5000
>> > > > compressed-x-precision   = 100
>> > > >
>> > > > cutoff-scheme= Verlet
>> > > > nstlist  = 10
>> > > > ns_type  = grid
>> > > > pbc  = xyz
>> > > > verlet-buffer-tolerance  = 0.005
>> > > >
>> > > > epsilon_r= 15
>> > > > coulombtype  = reaction-field
>> > > > rcoulomb = 1.1
>> > > > vdw_type = cutoff
>> > > > vdw-modifier = Potential-shift-verlet
>> > > > rvdw = 1.1
>> > > >
>> > > > tc-grps  = system
>> > > > tau_t= 4.0
>> > > > ref_t= 310
>> > > >
>> > > > ; Pressure coupling:
>> > > > Pcoupl   = no
>> > > >
>> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
>> > > > gen_vel  = yes
>> > > > gen_temp = 310
>> > > > gen_seed = 1521731368
>> > > >
>> > > >
>> > > >
>> > > > If I use the same command to submit the job.
>> > > > I got the following error. I don't know why.
>> > > >
>> > > > ---
>> > > > Program: gmx mdrun, version 2018.1
>> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp (line
>> 224)
>> > > >
>> > > > Fatal error:
>> > > > When using GPUs, setting the number of OpenMP threads without
>> > specifying
>> > > the
>> > > > number of ranks can lead to conflicting demands. Please specify the
>> > > number
>> > > > of
>> > > > thread-MPI ranks as well (option -ntmpi).
>> > > >
>> > > > For more information and tips for troubleshooting, please check the
>> > > GROMACS
>> > > > website at http://www.gromacs.org/Documentation/Errors
>> > > > ---
>> > > >
>> > > >
>> > > > So I did run simulation with the following command.
>> > > >
>> > > >
>> > > > gmx mdrun -deffnm md -ntmpi 1
>> > > >
>> > > >
>> > > > Now the performance is extremely bad.
>> > > > Since yesterday, the log file still reporting the first step's
>> energy.
>> > > >
>> > > > 2. This is the second question. Why?
>> > > >
>> > > > Can anyone help?
>> > > >
>> > > >
>> > > > Myunggi Yi
>> > > > --
>> > > > Gromacs Users mailing list
>> > > >
>> > > > * Please search the archive at http://www.gromacs.org/Support
>> > > /Mailing_Lists/GMX-Users_List before posting!
>> > > >
>> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > > >
>> > > > * For (un)subscribe requests visit
>> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>> or
>> > > send a mail to gmx-users-requ...@gromacs.org.
>> > > --
>> > > Gromacs Users mailing list
>> > >
>> > > * Please search the archive at http://www.gromacs.org/Support
>> > > /Mailing_Lists/GMX-Users_List before posting!
>> > >
>> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> > >
>> > > * For (un)subscribe requests visit
>> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > > send a mail to gmx-users-requ...@gromacs.org.
>> > --
>> > Gromacs Users mailing list
>> >
>> 

Re: [gmx-users] Performance

2018-03-28 Thread Myunggi Yi
Does it work?

https://drive.google.com/open?id=1n5m1tNGbnV7oZnuAEgZ7gSP6qA6HluNl

How about this?


Myunggi Yi

On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham 
wrote:

> Hi,
>
> Attachments can't be accepted on the list - please upload to a file sharing
> service and share links to those.
>
> Mark
>
> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi  wrote:
>
> > I am attaching the file.
> >
> > Thank you.
> >
> > Myunggi Yi
> >
> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll 
> > wrote:
> >
> > > Again, please share the exact log files / description of inputs. What
> > > does "bad performance" mean?
> > > --
> > > Szilárd
> > >
> > >
> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
> > wrote:
> > > > Dear users,
> > > >
> > > > I have two questions.
> > > >
> > > >
> > > > 1. I used to run typical simulations with the following command.
> > > >
> > > > gmx mdrun -deffnm md
> > > >
> > > > I had no problem.
> > > >
> > > >
> > > > Now I am running a simulation with "Dry_Martini" FF with the
> following
> > > > input.
> > > >
> > > >
> > > > integrator   = sd
> > > > tinit= 0.0
> > > > dt   = 0.040
> > > > nsteps   = 100
> > > >
> > > > nstlog   = 5000
> > > > nstenergy= 5000
> > > > nstxout-compressed   = 5000
> > > > compressed-x-precision   = 100
> > > >
> > > > cutoff-scheme= Verlet
> > > > nstlist  = 10
> > > > ns_type  = grid
> > > > pbc  = xyz
> > > > verlet-buffer-tolerance  = 0.005
> > > >
> > > > epsilon_r= 15
> > > > coulombtype  = reaction-field
> > > > rcoulomb = 1.1
> > > > vdw_type = cutoff
> > > > vdw-modifier = Potential-shift-verlet
> > > > rvdw = 1.1
> > > >
> > > > tc-grps  = system
> > > > tau_t= 4.0
> > > > ref_t= 310
> > > >
> > > > ; Pressure coupling:
> > > > Pcoupl   = no
> > > >
> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
> > > > gen_vel  = yes
> > > > gen_temp = 310
> > > > gen_seed = 1521731368
> > > >
> > > >
> > > >
> > > > If I use the same command to submit the job.
> > > > I got the following error. I don't know why.
> > > >
> > > > ---
> > > > Program: gmx mdrun, version 2018.1
> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp (line
> 224)
> > > >
> > > > Fatal error:
> > > > When using GPUs, setting the number of OpenMP threads without
> > specifying
> > > the
> > > > number of ranks can lead to conflicting demands. Please specify the
> > > number
> > > > of
> > > > thread-MPI ranks as well (option -ntmpi).
> > > >
> > > > For more information and tips for troubleshooting, please check the
> > > GROMACS
> > > > website at http://www.gromacs.org/Documentation/Errors
> > > > ---
> > > >
> > > >
> > > > So I did run simulation with the following command.
> > > >
> > > >
> > > > gmx mdrun -deffnm md -ntmpi 1
> > > >
> > > >
> > > > Now the performance is extremely bad.
> > > > Since yesterday, the log file still reporting the first step's
> energy.
> > > >
> > > > 2. This is the second question. Why?
> > > >
> > > > Can anyone help?
> > > >
> > > >
> > > > Myunggi Yi
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at http://www.gromacs.org/Support
> > > /Mailing_Lists/GMX-Users_List before posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > send a mail to gmx-users-requ...@gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at http://www.gromacs.org/Support
> > > /Mailing_Lists/GMX-Users_List before posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-requ...@gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read 

Re: [gmx-users] Performance

2018-03-28 Thread Myunggi Yi
I see.

I am trying again.
​
 ves

​

Myunggi Yi  (이명기, Ph. D.), Professor

Department of Biomedical Engineering (의공학과 bme.pknu.ac.kr), College of
Engineering
Interdisciplinary Program of Biomedical Mechanical & Electrical Engineering
Center for Marine-Integrated Biomedical Technology (BK21+)
College of Engineering
Pukyong National University (부경대학교 www.pknu.ac.kr)
45 Yongso-ro, Nam-gu (남구 용소로 45)
Busan, 48513, South Korea
Phone: +82 51 629 5773
Fax: +82 51 629 5779

On Wed, Mar 28, 2018 at 12:20 PM, Mark Abraham 
wrote:

> Hi,
>
> Attachments can't be accepted on the list - please upload to a file sharing
> service and share links to those.
>
> Mark
>
> On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi  wrote:
>
> > I am attaching the file.
> >
> > Thank you.
> >
> > Myunggi Yi
> >
> > On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll 
> > wrote:
> >
> > > Again, please share the exact log files / description of inputs. What
> > > does "bad performance" mean?
> > > --
> > > Szilárd
> > >
> > >
> > > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
> > wrote:
> > > > Dear users,
> > > >
> > > > I have two questions.
> > > >
> > > >
> > > > 1. I used to run typical simulations with the following command.
> > > >
> > > > gmx mdrun -deffnm md
> > > >
> > > > I had no problem.
> > > >
> > > >
> > > > Now I am running a simulation with "Dry_Martini" FF with the
> following
> > > > input.
> > > >
> > > >
> > > > integrator   = sd
> > > > tinit= 0.0
> > > > dt   = 0.040
> > > > nsteps   = 100
> > > >
> > > > nstlog   = 5000
> > > > nstenergy= 5000
> > > > nstxout-compressed   = 5000
> > > > compressed-x-precision   = 100
> > > >
> > > > cutoff-scheme= Verlet
> > > > nstlist  = 10
> > > > ns_type  = grid
> > > > pbc  = xyz
> > > > verlet-buffer-tolerance  = 0.005
> > > >
> > > > epsilon_r= 15
> > > > coulombtype  = reaction-field
> > > > rcoulomb = 1.1
> > > > vdw_type = cutoff
> > > > vdw-modifier = Potential-shift-verlet
> > > > rvdw = 1.1
> > > >
> > > > tc-grps  = system
> > > > tau_t= 4.0
> > > > ref_t= 310
> > > >
> > > > ; Pressure coupling:
> > > > Pcoupl   = no
> > > >
> > > > ; GENERATE VELOCITIES FOR STARTUP RUN:
> > > > gen_vel  = yes
> > > > gen_temp = 310
> > > > gen_seed = 1521731368
> > > >
> > > >
> > > >
> > > > If I use the same command to submit the job.
> > > > I got the following error. I don't know why.
> > > >
> > > > ---
> > > > Program: gmx mdrun, version 2018.1
> > > > Source file: src/gromacs/taskassignment/resourcedivision.cpp (line
> 224)
> > > >
> > > > Fatal error:
> > > > When using GPUs, setting the number of OpenMP threads without
> > specifying
> > > the
> > > > number of ranks can lead to conflicting demands. Please specify the
> > > number
> > > > of
> > > > thread-MPI ranks as well (option -ntmpi).
> > > >
> > > > For more information and tips for troubleshooting, please check the
> > > GROMACS
> > > > website at http://www.gromacs.org/Documentation/Errors
> > > > ---
> > > >
> > > >
> > > > So I did run simulation with the following command.
> > > >
> > > >
> > > > gmx mdrun -deffnm md -ntmpi 1
> > > >
> > > >
> > > > Now the performance is extremely bad.
> > > > Since yesterday, the log file still reporting the first step's
> energy.
> > > >
> > > > 2. This is the second question. Why?
> > > >
> > > > Can anyone help?
> > > >
> > > >
> > > > Myunggi Yi
> > > > --
> > > > Gromacs Users mailing list
> > > >
> > > > * Please search the archive at http://www.gromacs.org/Support
> > > /Mailing_Lists/GMX-Users_List before posting!
> > > >
> > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > > >
> > > > * For (un)subscribe requests visit
> > > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
> or
> > > send a mail to gmx-users-requ...@gromacs.org.
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at http://www.gromacs.org/Support
> > > /Mailing_Lists/GMX-Users_List before posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-requ...@gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List 

Re: [gmx-users] Performance

2018-03-28 Thread Mark Abraham
Hi,

Attachments can't be accepted on the list - please upload to a file sharing
service and share links to those.

Mark

On Wed, Mar 28, 2018 at 6:16 PM Myunggi Yi  wrote:

> I am attaching the file.
>
> Thank you.
>
> Myunggi Yi
>
> On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll 
> wrote:
>
> > Again, please share the exact log files / description of inputs. What
> > does "bad performance" mean?
> > --
> > Szilárd
> >
> >
> > On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi 
> wrote:
> > > Dear users,
> > >
> > > I have two questions.
> > >
> > >
> > > 1. I used to run typical simulations with the following command.
> > >
> > > gmx mdrun -deffnm md
> > >
> > > I had no problem.
> > >
> > >
> > > Now I am running a simulation with "Dry_Martini" FF with the following
> > > input.
> > >
> > >
> > > integrator   = sd
> > > tinit= 0.0
> > > dt   = 0.040
> > > nsteps   = 100
> > >
> > > nstlog   = 5000
> > > nstenergy= 5000
> > > nstxout-compressed   = 5000
> > > compressed-x-precision   = 100
> > >
> > > cutoff-scheme= Verlet
> > > nstlist  = 10
> > > ns_type  = grid
> > > pbc  = xyz
> > > verlet-buffer-tolerance  = 0.005
> > >
> > > epsilon_r= 15
> > > coulombtype  = reaction-field
> > > rcoulomb = 1.1
> > > vdw_type = cutoff
> > > vdw-modifier = Potential-shift-verlet
> > > rvdw = 1.1
> > >
> > > tc-grps  = system
> > > tau_t= 4.0
> > > ref_t= 310
> > >
> > > ; Pressure coupling:
> > > Pcoupl   = no
> > >
> > > ; GENERATE VELOCITIES FOR STARTUP RUN:
> > > gen_vel  = yes
> > > gen_temp = 310
> > > gen_seed = 1521731368
> > >
> > >
> > >
> > > If I use the same command to submit the job.
> > > I got the following error. I don't know why.
> > >
> > > ---
> > > Program: gmx mdrun, version 2018.1
> > > Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 224)
> > >
> > > Fatal error:
> > > When using GPUs, setting the number of OpenMP threads without
> specifying
> > the
> > > number of ranks can lead to conflicting demands. Please specify the
> > number
> > > of
> > > thread-MPI ranks as well (option -ntmpi).
> > >
> > > For more information and tips for troubleshooting, please check the
> > GROMACS
> > > website at http://www.gromacs.org/Documentation/Errors
> > > ---
> > >
> > >
> > > So I did run simulation with the following command.
> > >
> > >
> > > gmx mdrun -deffnm md -ntmpi 1
> > >
> > >
> > > Now the performance is extremely bad.
> > > Since yesterday, the log file still reporting the first step's energy.
> > >
> > > 2. This is the second question. Why?
> > >
> > > Can anyone help?
> > >
> > >
> > > Myunggi Yi
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at http://www.gromacs.org/Support
> > /Mailing_Lists/GMX-Users_List before posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/Support
> > /Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance

2018-03-28 Thread Myunggi Yi
I am attaching the file.

Thank you.

Myunggi Yi

On Wed, Mar 28, 2018 at 11:40 AM, Szilárd Páll 
wrote:

> Again, please share the exact log files / description of inputs. What
> does "bad performance" mean?
> --
> Szilárd
>
>
> On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi  wrote:
> > Dear users,
> >
> > I have two questions.
> >
> >
> > 1. I used to run typical simulations with the following command.
> >
> > gmx mdrun -deffnm md
> >
> > I had no problem.
> >
> >
> > Now I am running a simulation with "Dry_Martini" FF with the following
> > input.
> >
> >
> > integrator   = sd
> > tinit= 0.0
> > dt   = 0.040
> > nsteps   = 100
> >
> > nstlog   = 5000
> > nstenergy= 5000
> > nstxout-compressed   = 5000
> > compressed-x-precision   = 100
> >
> > cutoff-scheme= Verlet
> > nstlist  = 10
> > ns_type  = grid
> > pbc  = xyz
> > verlet-buffer-tolerance  = 0.005
> >
> > epsilon_r= 15
> > coulombtype  = reaction-field
> > rcoulomb = 1.1
> > vdw_type = cutoff
> > vdw-modifier = Potential-shift-verlet
> > rvdw = 1.1
> >
> > tc-grps  = system
> > tau_t= 4.0
> > ref_t= 310
> >
> > ; Pressure coupling:
> > Pcoupl   = no
> >
> > ; GENERATE VELOCITIES FOR STARTUP RUN:
> > gen_vel  = yes
> > gen_temp = 310
> > gen_seed = 1521731368
> >
> >
> >
> > If I use the same command to submit the job.
> > I got the following error. I don't know why.
> >
> > ---
> > Program: gmx mdrun, version 2018.1
> > Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 224)
> >
> > Fatal error:
> > When using GPUs, setting the number of OpenMP threads without specifying
> the
> > number of ranks can lead to conflicting demands. Please specify the
> number
> > of
> > thread-MPI ranks as well (option -ntmpi).
> >
> > For more information and tips for troubleshooting, please check the
> GROMACS
> > website at http://www.gromacs.org/Documentation/Errors
> > ---
> >
> >
> > So I did run simulation with the following command.
> >
> >
> > gmx mdrun -deffnm md -ntmpi 1
> >
> >
> > Now the performance is extremely bad.
> > Since yesterday, the log file still reporting the first step's energy.
> >
> > 2. This is the second question. Why?
> >
> > Can anyone help?
> >
> >
> > Myunggi Yi
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance

2018-03-28 Thread Szilárd Páll
Again, please share the exact log files / description of inputs. What
does "bad performance" mean?
--
Szilárd


On Wed, Mar 28, 2018 at 5:31 PM, Myunggi Yi  wrote:
> Dear users,
>
> I have two questions.
>
>
> 1. I used to run typical simulations with the following command.
>
> gmx mdrun -deffnm md
>
> I had no problem.
>
>
> Now I am running a simulation with "Dry_Martini" FF with the following
> input.
>
>
> integrator   = sd
> tinit= 0.0
> dt   = 0.040
> nsteps   = 100
>
> nstlog   = 5000
> nstenergy= 5000
> nstxout-compressed   = 5000
> compressed-x-precision   = 100
>
> cutoff-scheme= Verlet
> nstlist  = 10
> ns_type  = grid
> pbc  = xyz
> verlet-buffer-tolerance  = 0.005
>
> epsilon_r= 15
> coulombtype  = reaction-field
> rcoulomb = 1.1
> vdw_type = cutoff
> vdw-modifier = Potential-shift-verlet
> rvdw = 1.1
>
> tc-grps  = system
> tau_t= 4.0
> ref_t= 310
>
> ; Pressure coupling:
> Pcoupl   = no
>
> ; GENERATE VELOCITIES FOR STARTUP RUN:
> gen_vel  = yes
> gen_temp = 310
> gen_seed = 1521731368
>
>
>
> If I use the same command to submit the job.
> I got the following error. I don't know why.
>
> ---
> Program: gmx mdrun, version 2018.1
> Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 224)
>
> Fatal error:
> When using GPUs, setting the number of OpenMP threads without specifying the
> number of ranks can lead to conflicting demands. Please specify the number
> of
> thread-MPI ranks as well (option -ntmpi).
>
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> ---
>
>
> So I did run simulation with the following command.
>
>
> gmx mdrun -deffnm md -ntmpi 1
>
>
> Now the performance is extremely bad.
> Since yesterday, the log file still reporting the first step's energy.
>
> 2. This is the second question. Why?
>
> Can anyone help?
>
>
> Myunggi Yi
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance

2018-03-28 Thread Myunggi Yi
Dear users,

I have two questions.


1. I used to run typical simulations with the following command.

gmx mdrun -deffnm md

I had no problem.


Now I am running a simulation with "Dry_Martini" FF with the following
input.


integrator   = sd
tinit= 0.0
dt   = 0.040
nsteps   = 100

nstlog   = 5000
nstenergy= 5000
nstxout-compressed   = 5000
compressed-x-precision   = 100

cutoff-scheme= Verlet
nstlist  = 10
ns_type  = grid
pbc  = xyz
verlet-buffer-tolerance  = 0.005

epsilon_r= 15
coulombtype  = reaction-field
rcoulomb = 1.1
vdw_type = cutoff
vdw-modifier = Potential-shift-verlet
rvdw = 1.1

tc-grps  = system
tau_t= 4.0
ref_t= 310

; Pressure coupling:
Pcoupl   = no

; GENERATE VELOCITIES FOR STARTUP RUN:
gen_vel  = yes
gen_temp = 310
gen_seed = 1521731368



If I use the same command to submit the job.
I got the following error. I don't know why.

---
Program: gmx mdrun, version 2018.1
Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 224)

Fatal error:
When using GPUs, setting the number of OpenMP threads without specifying the
number of ranks can lead to conflicting demands. Please specify the number
of
thread-MPI ranks as well (option -ntmpi).

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
---


So I did run simulation with the following command.


gmx mdrun -deffnm md -ntmpi 1


Now the performance is extremely bad.
Since yesterday, the log file still reporting the first step's energy.

2. This is the second question. Why?

Can anyone help?


Myunggi Yi
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance gains with AVX_512 ?

2017-12-12 Thread Kutzner, Carsten
Hi Szilárd,

> On 12. Dec 2017, at 17:58, Szilárd Páll  wrote:
> 
> Hi Carsten,
> 
> The performance behavior you observe is expected, I have observed it
> myself. Nothing seems unusual in the performance numbers you report.
> 
> The AVX512 clock throttle is additional (10-20% IIRC) to the AVX2 throttle,
> and the only code that really gains significantly from AVX512 is the
> nonbonded kernels. When those are offloaded, the gain from higher clocks
> with AVX2 will translate to better CPU performance (and especially if the
> run is CPU-bound, that will make a significant difference).
> 
> BTW, on the low- and mid-range CPUs ("Bronze"/"Silver" and "cut-down" i9s)
> AVX512 is even less likely to ever be worth it.
So using AVX2 on GPU nodes seems generally to be the fastest option. 
Thanks a lot for the info! 

Best,
  Carsten

> 
> Cheers,
> 
> --
> Szilárd
> 
> On Tue, Dec 12, 2017 at 3:07 PM, Kutzner, Carsten  wrote:
> 
>> Hi,
>> 
>> what are the expected performance benefits of AVX_512 SIMD instructions
>> on Intel Skylake processors, compared to AVX2_256? In many cases, I see
>> a significantly (15 %) higher GROMACS 2016 / 2018b2 performance when using
>> AVX2_256 instead of AVX_512. I would have guessed that AVX_512 is at least
>> not slower than inferior instruction sets.
>> 
>> Some quick benchmarks results:
>> Node with 2x12 core (48 threads) Xeon Gold 6146 plus 2x GTX 1080Ti
>> 80k atoms membrane benchmark system, 2 fs time step, pme on cpu
>> 
>> GROMACS v.SIMDns/d
>> 2016  AVX_512 102.3
>> 2016  AVX2_256119.3
>> 2018b2AVX_512 107.9
>> 2018b2AVX2_256123.2
>> 
>> I realize that AVX_512 turbo frequencies are significantly lower
>> compared to AVX2_256 if all cores are in use, and for a serial run,
>> AVX_512 is indeed by about 6% faster than AVX2_256.
>> 
> 
> By "serial" you mean single threaded runs? Single-core turbo on this 165W
> CPU will be pretty high (>=4.2 GHz) and it will not likely to reflect the
> relative difference at the base-clock.
> 
> Gromacs 2018b2, -nb cpu
>> thread-MPI  ns/day   ns/day improvement
>> threads AVX_512  AVX2_256   over AVX2
>> 1   2.8802.702 1.065
>> 2   5.4515.209 1.046
>> 4   9.6179.332 1.031
>> 8  17.469   17.276 1.011
>> 12  21.852   24.245  .901
>> 16  28.579   31.691  .902
>> 24  39.731   41.576  .956
>> 48  41.831   39.336 1.063
>> 
> 
> Does this mean that for all but row 5,7, and 8 last two rows you left
> socket(s) partially empty?
> 
> 
> Cheers,
> --
> Szilárd
> 
> 
>> Can anyone comment on whether that is the expected behavior and why?
>> 
>> Thanks!
>>  Carsten
>> 
>> 
>> 
>> --
>> Gromacs Users mailing list
>> 
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> 
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> 
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
>> 
> -- 
> Gromacs Users mailing list
> 
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
> 
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> 
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance gains with AVX_512 ?

2017-12-12 Thread Szilárd Páll
Hi Carsten,

The performance behavior you observe is expected, I have observed it
myself. Nothing seems unusual in the performance numbers you report.

The AVX512 clock throttle is additional (10-20% IIRC) to the AVX2 throttle,
and the only code that really gains significantly from AVX512 is the
nonbonded kernels. When those are offloaded, the gain from higher clocks
with AVX2 will translate to better CPU performance (and especially if the
run is CPU-bound, that will make a significant difference).

BTW, on the low- and mid-range CPUs ("Bronze"/"Silver" and "cut-down" i9s)
AVX512 is even less likely to ever be worth it.

Cheers,

--
Szilárd

On Tue, Dec 12, 2017 at 3:07 PM, Kutzner, Carsten  wrote:

> Hi,
>
> what are the expected performance benefits of AVX_512 SIMD instructions
> on Intel Skylake processors, compared to AVX2_256? In many cases, I see
> a significantly (15 %) higher GROMACS 2016 / 2018b2 performance when using
> AVX2_256 instead of AVX_512. I would have guessed that AVX_512 is at least
> not slower than inferior instruction sets.
>
> Some quick benchmarks results:
> Node with 2x12 core (48 threads) Xeon Gold 6146 plus 2x GTX 1080Ti
> 80k atoms membrane benchmark system, 2 fs time step, pme on cpu
>
> GROMACS v.SIMDns/d
> 2016  AVX_512 102.3
> 2016  AVX2_256119.3
> 2018b2AVX_512 107.9
> 2018b2AVX2_256123.2
>
> I realize that AVX_512 turbo frequencies are significantly lower
> compared to AVX2_256 if all cores are in use, and for a serial run,
> AVX_512 is indeed by about 6% faster than AVX2_256.
>

By "serial" you mean single threaded runs? Single-core turbo on this 165W
CPU will be pretty high (>=4.2 GHz) and it will not likely to reflect the
relative difference at the base-clock.

Gromacs 2018b2, -nb cpu
> thread-MPI  ns/day   ns/day improvement
> threads AVX_512  AVX2_256   over AVX2
>  1   2.8802.702 1.065
>  2   5.4515.209 1.046
>  4   9.6179.332 1.031
>  8  17.469   17.276 1.011
> 12  21.852   24.245  .901
> 16  28.579   31.691  .902
> 24  39.731   41.576  .956
> 48  41.831   39.336 1.063
>

Does this mean that for all but row 5,7, and 8 last two rows you left
socket(s) partially empty?


Cheers,
--
Szilárd


> Can anyone comment on whether that is the expected behavior and why?
>
> Thanks!
>   Carsten
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance gains with AVX_512 ?

2017-12-12 Thread Kutzner, Carsten
Hi,

what are the expected performance benefits of AVX_512 SIMD instructions
on Intel Skylake processors, compared to AVX2_256? In many cases, I see
a significantly (15 %) higher GROMACS 2016 / 2018b2 performance when using
AVX2_256 instead of AVX_512. I would have guessed that AVX_512 is at least
not slower than inferior instruction sets.

Some quick benchmarks results:
Node with 2x12 core (48 threads) Xeon Gold 6146 plus 2x GTX 1080Ti
80k atoms membrane benchmark system, 2 fs time step, pme on cpu

GROMACS v.SIMDns/d
2016  AVX_512 102.3
2016  AVX2_256119.3
2018b2AVX_512 107.9
2018b2AVX2_256123.2

I realize that AVX_512 turbo frequencies are significantly lower
compared to AVX2_256 if all cores are in use, and for a serial run,
AVX_512 is indeed by about 6% faster than AVX2_256.

Gromacs 2018b2, -nb cpu
thread-MPI  ns/day   ns/day improvement
threads AVX_512  AVX2_256   over AVX2
 1   2.8802.702 1.065
 2   5.4515.209 1.046
 4   9.6179.332 1.031
 8  17.469   17.276 1.011
12  21.852   24.245  .901
16  28.579   31.691  .902
24  39.731   41.576  .956
48  41.831   39.336 1.063

Can anyone comment on whether that is the expected behavior and why?

Thanks!
  Carsten



-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance test

2017-11-27 Thread Javier Luque Di Salvo
Dear community,
I share the results of scaling- performance test. I used this command and
checked the core usage with the help of htop tool (http://hisham.hm/htop/):
gmx mdrun -ntmpi 1 -ntomp N -pin on -deffnm  &

Where N is the number of (logical) cores, and hardware is Intel(R) Core(TM)
i7-6700 @3.40 GHz, 16 GB RAM and no GPU. I tested two polymer chains of
different size (psu10= 552 atoms; psu36= 1956 atoms), 1ns NPT simulations
of the previously-equilibrated system, md settings were Berendsen
thermostat and barostat, V-rescale, time step 1fs, rcut-off 1.0 nm, PME for
coulombic computation. In this link the figures:
https://goo.gl/bVZKcU

And the table with the values if problems in opening the figures:
PSU10 (552 atoms)
N   wall-time  ns/day
1   1057.166   81.7025
2   631.117  136.908
3   461.265  187.448
4   352.821  244.886
5   440.070  196.393
6   386.782  223.346
7   348.273  248.083
8   389.243  255.187
--
PSU36 (1956 atoms)
N   wall-time  ns/day
1   2259.990  38.231
2   1254.619  68.870
3   875.394   99.267
4   672.042   128.570
5   822.385   105.056
6   712.061   121.338
7   628.172   137.551
8   576.145   149.963

Kind regards,
Javi

2017-11-21 13:50 GMT+01:00 Javier E :

> Dear users,
>
> I'm doing a performance analysis following this link
> http://manual.gromacs.org/documentation/5.1/user-guide/
> mdrun-performance.html and wanted to ask:
>
> Is there a "standard" procedure to test performance in gromacs (on single
> nodes, one multi-processor CPU)? Following there are some results, the
> system is a small polymeric chain of 542 atoms with no water and NPT 100 ps
> runs (if more information about md settings are needed please ask):
>
> Running on 1 node with total 4 cores, 8 logical cores
> Hardware detected:
>   CPU info:
> Vendor: GenuineIntel
> Brand:  Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
> SIMD instructions most likely to fit this hardware: AVX2_256
> SIMD instructions selected at GROMACS compile time: AVX2_256
>
> GROMACS version: VERSION 5.1.4
> Precision:single
> Memory model:64 bit
> MPI library:  thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:   disabled
> OpenCL support:  disabled
>
>
> gmx mdrun -ntmpi 1 -ntomp # -v -deffnm out.tpr
>
>
> ___
> -ntomp  | MPI / OpenMP- | Wall time (s)|ns/day   |  % CPU   |
> Note?* |
> 
> -
>1|  1/1 | 1075.764  |80.315|
> 100.0   |   No   |
>2|  1/2 |  619.679   |   139.427   |
> 200.0   |   Yes |
>3|  1/3 |  458.721   |   188.350   |
> 299.9   |   Yes |
>4|  1/4 |  356.906   |   242.081   |
> 399.8   |   Yes |
>5|  1/5 |  433.572   |   199.275   |
> 499.0   |   Yes |
>6|  1/6 |  378.951   |   227.998   |
> 598.0   |   Yes |
>7|  1/7 |  355.785   |   242.844   |
> 693.1   |   Yes |
>8|  1/8 (default)|  328.520   |   262.081   |
> 779.0   |   No   |
> 
> -
>
> *NOTE: The number of threads is not equal to the number of (logical) cores
>   and the -pin option is set to auto: will not pin thread to cores.
>
>
> If (MPI-Threads)*(OpenMP-Threads) = number of threads, does mdrun uses
> number of cores= number of threads, and this can be seen in the %CPU usage?
>
> For example, as I installed GROMACS in default, the GMX_OpenMP_MAX_THREAD
> is set at 32 threads, but this will never happen with this hardware (4
> cores, 8 logical), is this correct? By now I'm re-running the exact same
> tests to have at least one replica, and extending the system size and the
> and run time. Any suggestions on how to deep further in this kind of tests
> are welcome,
>
> Best regards
> --
>
> 
>
> *Javier Luque Di Salvo*
>
> Dipartamento di Ingegneria Chimica
>
> Universitá Degli Studi di Palermo
> *Viale delle Scienze, Ed. 6*
> *90128 PALERMO (PA)*
> *+39.09123867503 <+39%20091%202386%207503>*
>



-- 



*Javier Luque Di Salvo*

Dipartamento di Ingegneria Chimica

Universitá Degli Studi di Palermo
*Viale delle Scienze, Ed. 6*
*90128 PALERMO (PA)*
*+39.09123867503*
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit

[gmx-users] Performance test

2017-11-21 Thread Javier E
Dear users,

I'm doing a performance analysis following this link
http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-performance.html
and wanted to ask:

Is there a "standard" procedure to test performance in gromacs (on single
nodes, one multi-processor CPU)? Following there are some results, the
system is a small polymeric chain of 542 atoms with no water and NPT 100 ps
runs (if more information about md settings are needed please ask):

Running on 1 node with total 4 cores, 8 logical cores
Hardware detected:
  CPU info:
Vendor: GenuineIntel
Brand:  Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX2_256

GROMACS version: VERSION 5.1.4
Precision:single
Memory model:64 bit
MPI library:  thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:   disabled
OpenCL support:  disabled


gmx mdrun -ntmpi 1 -ntomp # -v -deffnm out.tpr


___
-ntomp  | MPI / OpenMP- | Wall time (s)|ns/day   |  % CPU   |
Note?* |
-
   1|  1/1 | 1075.764  |80.315|
100.0   |   No   |
   2|  1/2 |  619.679   |   139.427   |
200.0   |   Yes |
   3|  1/3 |  458.721   |   188.350   |
299.9   |   Yes |
   4|  1/4 |  356.906   |   242.081   |
399.8   |   Yes |
   5|  1/5 |  433.572   |   199.275   |
499.0   |   Yes |
   6|  1/6 |  378.951   |   227.998   |
598.0   |   Yes |
   7|  1/7 |  355.785   |   242.844   |
693.1   |   Yes |
   8|  1/8 (default)|  328.520   |   262.081   |
779.0   |   No   |
-

*NOTE: The number of threads is not equal to the number of (logical) cores
  and the -pin option is set to auto: will not pin thread to cores.

If (MPI-Threads)*(OpenMP-Threads) = number of threads, does mdrun uses
number of cores= number of threads, and this can be seen in the %CPU usage?

For example, as I installed GROMACS in default, the GMX_OpenMP_MAX_THREAD
is set at 32 threads, but this will never happen with this hardware (4
cores, 8 logical), is this correct? By now I'm re-running the exact same
tests to have at least one replica, and extending the system size and the
and run time. Any suggestions on how to deep further in this kind of tests
are welcome,

Best regards
-- 



*Javier Luque Di Salvo*

Dipartamento di Ingegneria Chimica

Universitá Degli Studi di Palermo
*Viale delle Scienze, Ed. 6*
*90128 PALERMO (PA)*
*+39.09123867503*
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance difference when using Gromacs 5.0 with different vector instructions

2017-10-19 Thread Mark Abraham
Yes, thus "always"

Mark

On Fri, 20 Oct 2017 00:19 MING HA  wrote:

> Dear Mark,
>
>
> Thanks for the quick response. So, in general, running using AVX
> instructions will yield
> better performance than SSE4.1?
>
>
> Sincerely,
> Ming
>
> On Thu, Oct 19, 2017 at 5:18 PM, Mark Abraham 
> wrote:
>
> > Hi,
> >
> > In short, yes. See
> > http://manual.gromacs.org/documentation/2016.4/install-
> > guide/index.html#simd-support.
> > You should generally always use a GROMACS binary compiled for the highest
> > SIMD level supported by your hardware. Your mdrun .log file will advise
> you
> > when it observes that you are not.
> >
> > Mark
> >
> > On Thu, Oct 19, 2017 at 7:02 PM MING HA  >
> > wrote:
> >
> > > Hi all,
> > >
> > >
> > > I am running several resources using Gromacs 5.0 to run my simulations.
> > > On some resources, Gromacs is compiled using SSE4.1 SIMD instructions,
> > > while on others AVX_256 or AVX2_256 is used. While I don't find much
> of a
> > > performance difference between AVX_256 and AVX2_256 instructions, there
> > > is a large performance difference between resources that use SSE4.1 and
> > > AVX instructions. Specifically, resources using SSE4.1 are about 2-3x
> > > slower
> > > than those that use AVX.
> > >
> > > I'm kind of new to the SIMD instructions used by Gromacs, so I was
> > > wondering
> > > whether the instruction set is causing the large performance
> difference.
> > >
> > >
> > > Sincerely,
> > > Ming
> > > --
> > > Gromacs Users mailing list
> > >
> > > * Please search the archive at
> > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > > posting!
> > >
> > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> > >
> > > * For (un)subscribe requests visit
> > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > > send a mail to gmx-users-requ...@gromacs.org.
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/
> > Support/Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance difference when using Gromacs 5.0 with different vector instructions

2017-10-19 Thread MING HA
Dear Mark,


Thanks for the quick response. So, in general, running using AVX
instructions will yield
better performance than SSE4.1?


Sincerely,
Ming

On Thu, Oct 19, 2017 at 5:18 PM, Mark Abraham 
wrote:

> Hi,
>
> In short, yes. See
> http://manual.gromacs.org/documentation/2016.4/install-
> guide/index.html#simd-support.
> You should generally always use a GROMACS binary compiled for the highest
> SIMD level supported by your hardware. Your mdrun .log file will advise you
> when it observes that you are not.
>
> Mark
>
> On Thu, Oct 19, 2017 at 7:02 PM MING HA 
> wrote:
>
> > Hi all,
> >
> >
> > I am running several resources using Gromacs 5.0 to run my simulations.
> > On some resources, Gromacs is compiled using SSE4.1 SIMD instructions,
> > while on others AVX_256 or AVX2_256 is used. While I don't find much of a
> > performance difference between AVX_256 and AVX2_256 instructions, there
> > is a large performance difference between resources that use SSE4.1 and
> > AVX instructions. Specifically, resources using SSE4.1 are about 2-3x
> > slower
> > than those that use AVX.
> >
> > I'm kind of new to the SIMD instructions used by Gromacs, so I was
> > wondering
> > whether the instruction set is causing the large performance difference.
> >
> >
> > Sincerely,
> > Ming
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance difference when using Gromacs 5.0 with different vector instructions

2017-10-19 Thread Mark Abraham
Hi,

In short, yes. See
http://manual.gromacs.org/documentation/2016.4/install-guide/index.html#simd-support.
You should generally always use a GROMACS binary compiled for the highest
SIMD level supported by your hardware. Your mdrun .log file will advise you
when it observes that you are not.

Mark

On Thu, Oct 19, 2017 at 7:02 PM MING HA 
wrote:

> Hi all,
>
>
> I am running several resources using Gromacs 5.0 to run my simulations.
> On some resources, Gromacs is compiled using SSE4.1 SIMD instructions,
> while on others AVX_256 or AVX2_256 is used. While I don't find much of a
> performance difference between AVX_256 and AVX2_256 instructions, there
> is a large performance difference between resources that use SSE4.1 and
> AVX instructions. Specifically, resources using SSE4.1 are about 2-3x
> slower
> than those that use AVX.
>
> I'm kind of new to the SIMD instructions used by Gromacs, so I was
> wondering
> whether the instruction set is causing the large performance difference.
>
>
> Sincerely,
> Ming
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance difference when using Gromacs 5.0 with different vector instructions

2017-10-19 Thread MING HA
Hi all,


I am running several resources using Gromacs 5.0 to run my simulations.
On some resources, Gromacs is compiled using SSE4.1 SIMD instructions,
while on others AVX_256 or AVX2_256 is used. While I don't find much of a
performance difference between AVX_256 and AVX2_256 instructions, there
is a large performance difference between resources that use SSE4.1 and
AVX instructions. Specifically, resources using SSE4.1 are about 2-3x slower
than those that use AVX.

I'm kind of new to the SIMD instructions used by Gromacs, so I was
wondering
whether the instruction set is causing the large performance difference.


Sincerely,
Ming
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] performance

2017-09-21 Thread gromacs query
Hi Szilárd,

Thanks a lot for your time, and see my replies below. Overall they are very
useful and I hope this long carried over discussion email will serve for
the future users. (Also could you please see my other email pointing
errors(?)/repeats in the web documentation about performance)

'multi/multidir' is not much helpful in my case as my simulation crashes
sometimes, and to restart them would be a pain as there are many (many!)
simulations. Also, one is never sure if other users will impose
-multi/-multidir option or not on shared node clusters. I have read your
other email suggestions [tagged: the importance of process/thread affinity,
especially in node sharing setups] where node sharing among different users
could be an issue which would ultimately depend on job scheduler.

My replies are inserted here:


On Thu, Sep 21, 2017 at 4:54 PM, Szilárd Páll 
wrote:

> Hi,
>
> A few remarks in no particular order:
>
> 1. Avoid domain-decomposition unless necessary (especially in
> CPU-bound runs, and especially with PME), it has a non-negligible
> overhead (greatest when going from no DD to using DD). Running
> multi-threading only typically has better performance. There are
> exceptions (e.g. your case of reaction-field runs could be such a
> case, but I'm doubtful as the DD cost is signiificant). Hence, I
> suggest trying 1, 2, 4... ranks per simulation, i.e.
> mpirun -np 1 gmx mdrun -ntomp N (single-run)
> mpirun -np 2 gmx mdrun -ntomp N/2 (single-run)
> mpirun -np 4 gmx mdrun -ntomp N/4 (single-run)
> [...]
> The multi-run equivalents of the above would simply use M ranks where
> M=Nmulti * Nranks_per_run.


You mean -dlb no? I think I did not modify so should be on auto mode then.
I can try it though. And yes indeed I have tried many other cases where I
vary -np gradually. I just shared one of the glitchy performance issues [I
have wealth of such cases :)]. Which I suspect now is a slurm scheduler
issue. I need to ask Admin if there are affinities to cores for a job.


> 2. If you're aiming for best throughput place two or more
> _independent_ runs on the same GPU, e.g. assuming 4 GPUs + 40 cores
> (and that no DD turns out to be best) to run 2 sim/GPU you can do:
> mpirun -np 8 -multi 8 gmx mdrun [-ntomp 5] [-gpu_id 00112233]
> The last two args can be omitted, but you should make sure that's what
> you get, i.e. that sim #0/#1 use GPU #0, sim #2/#3 use GPU#1, etc.
>

 I am avoiding multi option as explained above. But this is useful.


> 3. 2a,b are clearly off, my hypothesis is still that they get pinned
> to the wrong cores. I suspect 6a,b are just lucky and happen to not be
> placed too badly. Plus 6 use 4 GPUs vs 7 only 2 GPUs, so that's not a
> fair comparison (and probably explains the 350 vs 300 ns/day).
>

Ah Sorry! yes my fault. I just checked 7th case uses 2 GPU. I forgot to
change the GPU numbers.


>
> 4. -pin on is faster than letting the scheduler place jobs (e.g. 3ab
> vs 4b) which is in line with what I would expect.
>


> 5. The strange asymmetry in 8a vs 8b is due to 8b having failed to pin
> and running where it should not be (empty socket -> core turbo-ing?).
> The 4a / 4b mismatch is strange; are those using the very same system
> (tpr?) -- one of them reports higher load imbalance!
>
>
>
Yes all these jobs (1 to 8 cases) use same tpr.



> Overall, I suggest starting over and determining performance first by
> deciding: What DD setup is best and how to lay out jobs in a node to
> get best throughput. Start with run configs testing settings with
> -multi to avoid pinning headaches and fill at least half a node (or a
> full node) with #concurrent simulations >= #GPUs.
>

I will see if I get some node free. I need to wait.

Thanks for all responses.

-J


> Cheers,
> --
> Szilárd
>
>
> On Mon, Sep 18, 2017 at 9:25 PM, gromacs query 
> wrote:
> > Hi Szilárd,
> >
> > {I had to trim the message as my message is put on hold because only 50kb
> > allowed and this message has reached 58 KB! Not due to files attached as
> > they are shared via dropbox}; Sorry seamless reading might be compromised
> > for future readers.
> >
> > Thanks for your replies. I have shared log files here:
> >
> > https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0
> >
> > Two self-describing name folders have all the test logs. The test_*.log
> > file serial numbers correspond to my simulations briefly described here
> > [with folder names].
> >
> > For quick look one can: grep Performance *.log
> >
> > Folder 2gpu_4np:
> > Sr. no.  Remarks  performance (ns/day)
> > 1.  only one job  345 ns/day
> > 2a,b.  two same jobs together (without pin on)  16.1 and 15.9
> > 3a,b.  two same jobs together (without pin on, with -multidir)  270 and
> 276
> > 4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301
> >
> >
> >
> > Folder:4gpu_16np
> >
> >
> >
> >
> > Remarks  performance (ns/day)
> > 5.  only one job  694 ns/day
> > 6a,b.  two same 

Re: [gmx-users] performance

2017-09-21 Thread Szilárd Páll
--
Szilárd


On Thu, Sep 21, 2017 at 5:54 PM, Szilárd Páll  wrote:
> Hi,
>
> A few remarks in no particular order:
>
> 1. Avoid domain-decomposition unless necessary (especially in
> CPU-bound runs, and especially with PME), it has a non-negligible
> overhead (greatest when going from no DD to using DD). Running
> multi-threading only typically has better performance. There are
> exceptions (e.g. your case of reaction-field runs could be such a
> case, but I'm doubtful as the DD cost is signiificant). Hence, I
> suggest trying 1, 2, 4... ranks per simulation, i.e.
> mpirun -np 1 gmx mdrun -ntomp N (single-run)
> mpirun -np 2 gmx mdrun -ntomp N/2 (single-run)
> mpirun -np 4 gmx mdrun -ntomp N/4 (single-run)
> [...]
> The multi-run equivalents of the above would simply use M ranks where
> M=Nmulti * Nranks_per_run.
>
> 2. If you're aiming for best throughput place two or more
> _independent_ runs on the same GPU, e.g. assuming 4 GPUs + 40 cores
> (and that no DD turns out to be best) to run 2 sim/GPU you can do:
> mpirun -np 8 -multi 8 gmx mdrun [-ntomp 5] [-gpu_id 00112233]
> The last two args can be omitted, but you should make sure that's what
> you get, i.e. that sim #0/#1 use GPU #0, sim #2/#3 use GPU#1, etc.

See Fig 5 of http://arxiv.org/abs/1507.00898 if you're not convinced.

> 3. 2a,b are clearly off, my hypothesis is still that they get pinned
> to the wrong cores. I suspect 6a,b are just lucky and happen to not be
> placed too badly. Plus 6 use 4 GPUs vs 7 only 2 GPUs, so that's not a
> fair comparison (and probably explains the 350 vs 300 ns/day).
>
> 4. -pin on is faster than letting the scheduler place jobs (e.g. 3ab
> vs 4b) which is in line with what I would expect.
>
> 5. The strange asymmetry in 8a vs 8b is due to 8b having failed to pin
> and running where it should not be (empty socket -> core turbo-ing?).
> The 4a / 4b mismatch is strange; are those using the very same system
> (tpr?) -- one of them reports higher load imbalance!
>
>
> Overall, I suggest starting over and determining performance first by
> deciding: What DD setup is best and how to lay out jobs in a node to
> get best throughput. Start with run configs testing settings with
> -multi to avoid pinning headaches and fill at least half a node (or a
> full node) with #concurrent simulations >= #GPUs.
>
> Cheers,
> --
> Szilárd
>
>
> On Mon, Sep 18, 2017 at 9:25 PM, gromacs query  wrote:
>> Hi Szilárd,
>>
>> {I had to trim the message as my message is put on hold because only 50kb
>> allowed and this message has reached 58 KB! Not due to files attached as
>> they are shared via dropbox}; Sorry seamless reading might be compromised
>> for future readers.
>>
>> Thanks for your replies. I have shared log files here:
>>
>> https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0
>>
>> Two self-describing name folders have all the test logs. The test_*.log
>> file serial numbers correspond to my simulations briefly described here
>> [with folder names].
>>
>> For quick look one can: grep Performance *.log
>>
>> Folder 2gpu_4np:
>> Sr. no.  Remarks  performance (ns/day)
>> 1.  only one job  345 ns/day
>> 2a,b.  two same jobs together (without pin on)  16.1 and 15.9
>> 3a,b.  two same jobs together (without pin on, with -multidir)  270 and 276
>> 4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301
>>
>>
>>
>> Folder:4gpu_16np
>>
>>
>>
>>
>> Remarks  performance (ns/day)
>> 5.  only one job  694 ns/day
>> 6a,b.  two same jobs together (without pin on)  340 and 350
>> 7a,b.  two same jobs together (without pin on, with -multidir)  302 and 304
>> 8a,b.  two same jobs together (pin on, pinoffset at 0 and 17)  204 and 546
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at 
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
>> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance

2017-09-21 Thread Szilárd Páll
Hi,

A few remarks in no particular order:

1. Avoid domain-decomposition unless necessary (especially in
CPU-bound runs, and especially with PME), it has a non-negligible
overhead (greatest when going from no DD to using DD). Running
multi-threading only typically has better performance. There are
exceptions (e.g. your case of reaction-field runs could be such a
case, but I'm doubtful as the DD cost is signiificant). Hence, I
suggest trying 1, 2, 4... ranks per simulation, i.e.
mpirun -np 1 gmx mdrun -ntomp N (single-run)
mpirun -np 2 gmx mdrun -ntomp N/2 (single-run)
mpirun -np 4 gmx mdrun -ntomp N/4 (single-run)
[...]
The multi-run equivalents of the above would simply use M ranks where
M=Nmulti * Nranks_per_run.

2. If you're aiming for best throughput place two or more
_independent_ runs on the same GPU, e.g. assuming 4 GPUs + 40 cores
(and that no DD turns out to be best) to run 2 sim/GPU you can do:
mpirun -np 8 -multi 8 gmx mdrun [-ntomp 5] [-gpu_id 00112233]
The last two args can be omitted, but you should make sure that's what
you get, i.e. that sim #0/#1 use GPU #0, sim #2/#3 use GPU#1, etc.

3. 2a,b are clearly off, my hypothesis is still that they get pinned
to the wrong cores. I suspect 6a,b are just lucky and happen to not be
placed too badly. Plus 6 use 4 GPUs vs 7 only 2 GPUs, so that's not a
fair comparison (and probably explains the 350 vs 300 ns/day).

4. -pin on is faster than letting the scheduler place jobs (e.g. 3ab
vs 4b) which is in line with what I would expect.

5. The strange asymmetry in 8a vs 8b is due to 8b having failed to pin
and running where it should not be (empty socket -> core turbo-ing?).
The 4a / 4b mismatch is strange; are those using the very same system
(tpr?) -- one of them reports higher load imbalance!


Overall, I suggest starting over and determining performance first by
deciding: What DD setup is best and how to lay out jobs in a node to
get best throughput. Start with run configs testing settings with
-multi to avoid pinning headaches and fill at least half a node (or a
full node) with #concurrent simulations >= #GPUs.

Cheers,
--
Szilárd


On Mon, Sep 18, 2017 at 9:25 PM, gromacs query  wrote:
> Hi Szilárd,
>
> {I had to trim the message as my message is put on hold because only 50kb
> allowed and this message has reached 58 KB! Not due to files attached as
> they are shared via dropbox}; Sorry seamless reading might be compromised
> for future readers.
>
> Thanks for your replies. I have shared log files here:
>
> https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0
>
> Two self-describing name folders have all the test logs. The test_*.log
> file serial numbers correspond to my simulations briefly described here
> [with folder names].
>
> For quick look one can: grep Performance *.log
>
> Folder 2gpu_4np:
> Sr. no.  Remarks  performance (ns/day)
> 1.  only one job  345 ns/day
> 2a,b.  two same jobs together (without pin on)  16.1 and 15.9
> 3a,b.  two same jobs together (without pin on, with -multidir)  270 and 276
> 4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301
>
>
>
> Folder:4gpu_16np
>
>
>
>
> Remarks  performance (ns/day)
> 5.  only one job  694 ns/day
> 6a,b.  two same jobs together (without pin on)  340 and 350
> 7a,b.  two same jobs together (without pin on, with -multidir)  302 and 304
> 8a,b.  two same jobs together (pin on, pinoffset at 0 and 17)  204 and 546
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance issue of GROMACS

2017-09-19 Thread Tomek Stępniewski
Hi Szilad,
thank You for your response,
I attach the log file

2017-09-19 15:19 GMT+02:00 Szilárd Páll :

> PS: A bit of extrapolation from my standard historical benchmark data
> shows that regular cut-off kernels should run at ~3.0 ms/step, so
> force shift will be ~3.5-4 ms/step (with nstlist=20 and 2 fs step);
> assuming 70% CPU-GPU overlap that's 5-5.5 ms/step which corresponds to
> ~35 ns/day (with 2 fs).
>
> That's just a rough estimate, though, and it assumes that you have
> enough CPU cores for a balanced run.
>
> --
> Szilárd
>
>
> On Tue, Sep 19, 2017 at 3:16 PM, Szilárd Páll 
> wrote:
> > On Tue, Sep 19, 2017 at 2:20 PM, Tomek Stępniewski
> >  wrote:
> >> Hi everybody,
> >> I am running gromacs 5.1.4 on a system that uses NVIDIA Tesla K40m,
> >> surprisingly I get a speed of only 15 ns a day when carrying out nvt
> >> simulations, my colleagues say that on a new GPU like this with my
> system
> >> size it should be around 60 ns a day,
> >> are there any apparent errors in my input files that might hhinder the
> >> simulation?
> >
> > 15 ns/day seems a bit low, but I can't say for sure if it's far too
> > low. Can you share logs?
> >
> >> input file:
> >> integrator  = md
> >> dt  = 0.002
> >> nsteps  = 1
> >> nstlog  = 1
> >> nstxout = 5
> >> nstvout = 5
> >> nstfout = 5
> >> nstcalcenergy   = 100
> >> nstenergy   = 1000
> >> ;
> >> cutoff-scheme   = Verlet
> >> nstlist = 20
> >> rlist   = 1.2
> >> coulombtype = pme
> >> rcoulomb= 1.2
> >> vdwtype = Cut-off
> >> vdw-modifier= Force-switch
> >> rvdw_switch = 1.0
> >> rvdw= 1.2
> >> ;
> >> tcoupl  = Nose-Hoover
> >> tc_grps = PROT   MEMB   SOL_ION
> >> tau_t   = 1.01.01.0
> >> ref_t   = 310 310 310
> >> ;
> >> constraints = h-bonds
> >> constraint_algorithm= LINCS
> >> continuation= yes
> >> ;
> >> nstcomm = 100
> >> comm_mode   = linear
> >> comm_grps   = PROT   MEMB   SOL_ION
> >> ;
> >> refcoord_scaling= com
> >>
> >> the system has around 70,000 atoms,
> >>
> >> can this issue depend on the CUDA drivers?:
> >
> > A bit, but not to a factor of 4.
> >
> >> CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda
> compiler
> >> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> >> Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0,
> V8.0.61
> >> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
> >> compute_30,code=sm_30;-gencode;arch=compute_35,code=
> >> sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
> >> compute_50,code=sm_50;-gencode;arch=compute_52,code=
> >> sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
> >> compute_61,code=sm_61;-gencode;arch=compute_60,code=
> >> compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> >> ;-march=core-avx2;-Wextra;-Wno-missing-field-
> initializers;-Wpointer-arith;-
> >> Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
> >> fexcess-precision=fast;-Wno-array-bounds;
> >> CUDA driver:8.0
> >> CUDA runtime:   8.0
> >> GPU info:
> >> Number of GPUs detected: 1
> >> #0: NVIDIA Tesla K40m, compute cap.: 3.5, ECC: yes, stat: compatible
> >>
> >> NOTE: GROMACS was configured without NVML support hence it can not
> exploit
> >>   application clocks of the detected Tesla K40m GPU to improve
> >> performance.
> >>   Recompile with the NVML library (compatible with the driver used)
> or
> >> set application clocks manually.
> >>
> >>
> >> Using GPU 8x8 non-bonded kernels
> >>
> >> I will be extremely grateful for any help,
> >> best
> >>
> >> --
> >> Tomasz M Stepniewski
> >> Research Group on Biomedical Informatics (GRIB)
> >> Hospital del Mar Medical Research Institute (IMIM)
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>



-- 
Tomasz M Stepniewski
Research Group on 

Re: [gmx-users] performance issue of GROMACS

2017-09-19 Thread Szilárd Páll
PS: A bit of extrapolation from my standard historical benchmark data
shows that regular cut-off kernels should run at ~3.0 ms/step, so
force shift will be ~3.5-4 ms/step (with nstlist=20 and 2 fs step);
assuming 70% CPU-GPU overlap that's 5-5.5 ms/step which corresponds to
~35 ns/day (with 2 fs).

That's just a rough estimate, though, and it assumes that you have
enough CPU cores for a balanced run.

--
Szilárd


On Tue, Sep 19, 2017 at 3:16 PM, Szilárd Páll  wrote:
> On Tue, Sep 19, 2017 at 2:20 PM, Tomek Stępniewski
>  wrote:
>> Hi everybody,
>> I am running gromacs 5.1.4 on a system that uses NVIDIA Tesla K40m,
>> surprisingly I get a speed of only 15 ns a day when carrying out nvt
>> simulations, my colleagues say that on a new GPU like this with my system
>> size it should be around 60 ns a day,
>> are there any apparent errors in my input files that might hhinder the
>> simulation?
>
> 15 ns/day seems a bit low, but I can't say for sure if it's far too
> low. Can you share logs?
>
>> input file:
>> integrator  = md
>> dt  = 0.002
>> nsteps  = 1
>> nstlog  = 1
>> nstxout = 5
>> nstvout = 5
>> nstfout = 5
>> nstcalcenergy   = 100
>> nstenergy   = 1000
>> ;
>> cutoff-scheme   = Verlet
>> nstlist = 20
>> rlist   = 1.2
>> coulombtype = pme
>> rcoulomb= 1.2
>> vdwtype = Cut-off
>> vdw-modifier= Force-switch
>> rvdw_switch = 1.0
>> rvdw= 1.2
>> ;
>> tcoupl  = Nose-Hoover
>> tc_grps = PROT   MEMB   SOL_ION
>> tau_t   = 1.01.01.0
>> ref_t   = 310 310 310
>> ;
>> constraints = h-bonds
>> constraint_algorithm= LINCS
>> continuation= yes
>> ;
>> nstcomm = 100
>> comm_mode   = linear
>> comm_grps   = PROT   MEMB   SOL_ION
>> ;
>> refcoord_scaling= com
>>
>> the system has around 70,000 atoms,
>>
>> can this issue depend on the CUDA drivers?:
>
> A bit, but not to a factor of 4.
>
>> CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
>> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
>> Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
>> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
>> compute_30,code=sm_30;-gencode;arch=compute_35,code=
>> sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
>> compute_50,code=sm_50;-gencode;arch=compute_52,code=
>> sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
>> compute_61,code=sm_61;-gencode;arch=compute_60,code=
>> compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
>> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
>> Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
>> fexcess-precision=fast;-Wno-array-bounds;
>> CUDA driver:8.0
>> CUDA runtime:   8.0
>> GPU info:
>> Number of GPUs detected: 1
>> #0: NVIDIA Tesla K40m, compute cap.: 3.5, ECC: yes, stat: compatible
>>
>> NOTE: GROMACS was configured without NVML support hence it can not exploit
>>   application clocks of the detected Tesla K40m GPU to improve
>> performance.
>>   Recompile with the NVML library (compatible with the driver used) or
>> set application clocks manually.
>>
>>
>> Using GPU 8x8 non-bonded kernels
>>
>> I will be extremely grateful for any help,
>> best
>>
>> --
>> Tomasz M Stepniewski
>> Research Group on Biomedical Informatics (GRIB)
>> Hospital del Mar Medical Research Institute (IMIM)
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at 
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
>> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance issue of GROMACS

2017-09-19 Thread Szilárd Páll
On Tue, Sep 19, 2017 at 2:20 PM, Tomek Stępniewski
 wrote:
> Hi everybody,
> I am running gromacs 5.1.4 on a system that uses NVIDIA Tesla K40m,
> surprisingly I get a speed of only 15 ns a day when carrying out nvt
> simulations, my colleagues say that on a new GPU like this with my system
> size it should be around 60 ns a day,
> are there any apparent errors in my input files that might hhinder the
> simulation?

15 ns/day seems a bit low, but I can't say for sure if it's far too
low. Can you share logs?

> input file:
> integrator  = md
> dt  = 0.002
> nsteps  = 1
> nstlog  = 1
> nstxout = 5
> nstvout = 5
> nstfout = 5
> nstcalcenergy   = 100
> nstenergy   = 1000
> ;
> cutoff-scheme   = Verlet
> nstlist = 20
> rlist   = 1.2
> coulombtype = pme
> rcoulomb= 1.2
> vdwtype = Cut-off
> vdw-modifier= Force-switch
> rvdw_switch = 1.0
> rvdw= 1.2
> ;
> tcoupl  = Nose-Hoover
> tc_grps = PROT   MEMB   SOL_ION
> tau_t   = 1.01.01.0
> ref_t   = 310 310 310
> ;
> constraints = h-bonds
> constraint_algorithm= LINCS
> continuation= yes
> ;
> nstcomm = 100
> comm_mode   = linear
> comm_grps   = PROT   MEMB   SOL_ION
> ;
> refcoord_scaling= com
>
> the system has around 70,000 atoms,
>
> can this issue depend on the CUDA drivers?:

A bit, but not to a factor of 4.

> CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
> driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
> CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
> compute_30,code=sm_30;-gencode;arch=compute_35,code=
> sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
> compute_50,code=sm_50;-gencode;arch=compute_52,code=
> sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
> compute_61,code=sm_61;-gencode;arch=compute_60,code=
> compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
> Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
> fexcess-precision=fast;-Wno-array-bounds;
> CUDA driver:8.0
> CUDA runtime:   8.0
> GPU info:
> Number of GPUs detected: 1
> #0: NVIDIA Tesla K40m, compute cap.: 3.5, ECC: yes, stat: compatible
>
> NOTE: GROMACS was configured without NVML support hence it can not exploit
>   application clocks of the detected Tesla K40m GPU to improve
> performance.
>   Recompile with the NVML library (compatible with the driver used) or
> set application clocks manually.
>
>
> Using GPU 8x8 non-bonded kernels
>
> I will be extremely grateful for any help,
> best
>
> --
> Tomasz M Stepniewski
> Research Group on Biomedical Informatics (GRIB)
> Hospital del Mar Medical Research Institute (IMIM)
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] performance issue of GROMACS

2017-09-19 Thread Tomek Stępniewski
Hi everybody,
I am running gromacs 5.1.4 on a system that uses NVIDIA Tesla K40m,
surprisingly I get a speed of only 15 ns a day when carrying out nvt
simulations, my colleagues say that on a new GPU like this with my system
size it should be around 60 ns a day,
are there any apparent errors in my input files that might hhinder the
simulation?
input file:
integrator  = md
dt  = 0.002
nsteps  = 1
nstlog  = 1
nstxout = 5
nstvout = 5
nstfout = 5
nstcalcenergy   = 100
nstenergy   = 1000
;
cutoff-scheme   = Verlet
nstlist = 20
rlist   = 1.2
coulombtype = pme
rcoulomb= 1.2
vdwtype = Cut-off
vdw-modifier= Force-switch
rvdw_switch = 1.0
rvdw= 1.2
;
tcoupl  = Nose-Hoover
tc_grps = PROT   MEMB   SOL_ION
tau_t   = 1.01.01.0
ref_t   = 310 310 310
;
constraints = h-bonds
constraint_algorithm= LINCS
continuation= yes
;
nstcomm = 100
comm_mode   = linear
comm_grps   = PROT   MEMB   SOL_ION
;
refcoord_scaling= com

the system has around 70,000 atoms,

can this issue depend on the CUDA drivers?:
CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=
compute_30,code=sm_30;-gencode;arch=compute_35,code=
sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=
compute_50,code=sm_50;-gencode;arch=compute_52,code=
sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=
compute_61,code=sm_61;-gencode;arch=compute_60,code=
compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-
Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-
fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:8.0
CUDA runtime:   8.0
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla K40m, compute cap.: 3.5, ECC: yes, stat: compatible

NOTE: GROMACS was configured without NVML support hence it can not exploit
  application clocks of the detected Tesla K40m GPU to improve
performance.
  Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.


Using GPU 8x8 non-bonded kernels

I will be extremely grateful for any help,
best

-- 
Tomasz M Stepniewski
Research Group on Biomedical Informatics (GRIB)
Hospital del Mar Medical Research Institute (IMIM)
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance of GROMACS 5.1.4

2017-09-19 Thread Tomek Stępniewski
Hi everybody,
I am running gromacs 5.1.4 on a system that uses NVIDIA Tesla K40m,
surprisingly I get a speed of only 15 ns a day when carrying out nvt
simulations, my colleagues say that on a new GPU like this with my system
size it should be around 60 ns a day,
are there any apparent errors in my input files that might hhinder the
simulation?
input file:
integrator  = md
dt  = 0.002
nsteps  = 1
nstlog  = 1
nstxout = 5
nstvout = 5
nstfout = 5
nstcalcenergy   = 100
nstenergy   = 1000
;
cutoff-scheme   = Verlet
nstlist = 20
rlist   = 1.2
coulombtype = pme
rcoulomb= 1.2
vdwtype = Cut-off
vdw-modifier= Force-switch
rvdw_switch = 1.0
rvdw= 1.2
;
tcoupl  = Nose-Hoover
tc_grps = PROT   MEMB   SOL_ION
tau_t   = 1.01.01.0
ref_t   = 310 310 310
;
constraints = h-bonds
constraint_algorithm= LINCS
continuation= yes
;
nstcomm = 100
comm_mode   = linear
comm_grps   = PROT   MEMB   SOL_ION
;
refcoord_scaling= com

the system has around 70,000 atoms,

can this issue depend on the CUDA drivers?:
CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler
driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
Tue_Jan_10_13:22:03_CST_2017;Cuda compilation tools, release 8.0, V8.0.61
CUDA compiler
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;;
;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
CUDA driver:8.0
CUDA runtime:   8.0
GPU info:
Number of GPUs detected: 1
#0: NVIDIA Tesla K40m, compute cap.: 3.5, ECC: yes, stat: compatible

NOTE: GROMACS was configured without NVML support hence it can not exploit
  application clocks of the detected Tesla K40m GPU to improve
performance.
  Recompile with the NVML library (compatible with the driver used) or
set application clocks manually.


Using GPU 8x8 non-bonded kernels

I will be extremely grateful for any help,
best
T

-- 
Tomasz M Stepniewski
Research Group on Biomedical Informatics (GRIB)
Hospital del Mar Medical Research Institute (IMIM)
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] performance

2017-09-18 Thread gromacs query
Hi Szilárd,

{I had to trim the message as my message is put on hold because only 50kb
allowed and this message has reached 58 KB! Not due to files attached as
they are shared via dropbox}; Sorry seamless reading might be compromised
for future readers.

Thanks for your replies. I have shared log files here:

https://www.dropbox.com/s/m9mqqans0jci873/test_logs.zip?dl=0

Two self-describing name folders have all the test logs. The test_*.log
file serial numbers correspond to my simulations briefly described here
[with folder names].

For quick look one can: grep Performance *.log

Folder 2gpu_4np:
Sr. no.  Remarks  performance (ns/day)
1.  only one job  345 ns/day
2a,b.  two same jobs together (without pin on)  16.1 and 15.9
3a,b.  two same jobs together (without pin on, with -multidir)  270 and 276
4a,b.  two same jobs together (pin on, pinoffset at 0 and 5)  160 and 301



Folder:4gpu_16np




Remarks  performance (ns/day)
5.  only one job  694 ns/day
6a,b.  two same jobs together (without pin on)  340 and 350
7a,b.  two same jobs together (without pin on, with -multidir)  302 and 304
8a,b.  two same jobs together (pin on, pinoffset at 0 and 17)  204 and 546
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance

2017-09-18 Thread Szilárd Páll
On Fri, Sep 15, 2017 at 1:06 AM, gromacs query  wrote:
> Hi Szilárd,
>
> Sorry this discussion is going long.
> Finally I got one node empty and did some serious tests specially
> considering your first point (discrepancies in benchmarking comparing jobs
> running on empty node vs occupied node). I tested in both ways.
>
> I ran following cases (single job vs two jobs for 2GPU+4 procs and also for
> 4GPU+16 procs). Happy to send log files.

Please do share them, it's hard to assess what's going on without those.

> Pinoffset results are surprising (4th and 8th test case below) though I get
> in log file a WARNING: Requested offset too large for available cores for
> the case 8; [should not be an issue as the first job binds the cores]

That means the offsets are not set correctly.

> As suggested defining affinity should help with pinoffset set 'manually'
> (in practice with script) but these results are quite variable. Am bit lost
> now, what should be the best practice in case nodes are shared among
> different users and multidir can be tricky in such case (if other gromacs
> users are not using multidir option!).

I suggest fixing the above issue first. I don't fully understand what
the below descriptions mean, please be more specific about the details
or share logs.

>
> Sr. no. each job 2GPU; 4 procs performance (ns/day)
> 1 only one job 345
> 2 two same jobs together (without pin on) 16.1 and 15.9
> 3 two same jobs together (without pin on, with -multidir) 178 and 191
> 4 two same jobs together (pin on, pinoffset at 0 and 5) 160 and 301
> each job 4GPU; 16 procs performance (ns/day)
> 5 only one job 694
> 6 two same jobs together (without pin on) 340 and 350
> 7 two same jobs together (without pin on, with -multidir) 346 and 344
> 8 two same jobs together (pin on, pinoffset at 0 and 17) 204 and 546
>
>
> On Thu, Sep 14, 2017 at 12:02 PM, gromacs query 
> wrote:
>
>> Hi Szilárd,
>>
>> Here are my replies:
>>
>> >> Did you run the "fast" single job on an otherwise empty node? That
>> might explain it as, when most of the CPU cores are left empty, modern CPUs
>> increase clocks (tubo boost) on the used cores higher than they could with
>> all cores busy.
>>
>> Yes the "fast" single job was on empty node. Sorry I don't get it when you
>> say 'modern CPUs increase clocks', you mean the ns/day I get is pseudo in
>> that case?
>>
>> >> and if you post an actual log I can certainly give more informed
>> comments
>>
>> Sure, if its ok can I post it off-mailing list to you?
>>
>> >> However, note that if you are sharing a node with others, if their jobs
>> are not correctly affinitized, those processes will affect the performance
>> of your job.
>>
>> Yes exactly. In this case I would need to manually set pinoffset but this
>> can be but frustrating if other Gromacs users are not binding :)
>> Would it be possible to fix this in the default algorithm, though am
>> unaware of other issues it might cause? Also mutidir is not convenient
>> sometimes when job crashes in the middle and automatic restart from cpt
>> file would be difficult.
>>
>> -J
>>
>>
>> On Thu, Sep 14, 2017 at 11:26 AM, Szilárd Páll 
>> wrote:
>>
>>> On Wed, Sep 13, 2017 at 11:14 PM, gromacs query 
>>> wrote:
>>> > Hi Szilárd,
>>> >
>>> > Thanks again. I tried now with -multidir like this:
>>> >
>>> > mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2
>>> t3 t4
>>> >
>>> > So this runs 4 jobs on same node so for each job np is = 16/4, and each
>>> job
>>> > using 2 GPU. I get now quite improved performance and equal performance
>>> for
>>> > each job (~ 220 ns) though still slightly less than single independent
>>> job
>>> > (where I get 300 ns). I can live with that but -
>>>
>>> That is not normal and it is more likely to be a benchmarking
>>> discrepancy: you are likely not comparing apples to apples. Did you
>>> run the "fast" single job on an otherwise empty node? That might
>>> explain it as, when most of the CPU cores are left empty, modern CPUs
>>> increase clocks (tubo boost) on the used cores higher than they could
>>> with all cores busy.
>>>
>>> > Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
>>> > jobs should consume 8 GPUS.
>>>
>>> Note that even if those are 40 real cores (rather than 20 core with
>>> HyperThreading), the current GROMACS release will be unlikely to run
>>> efficiently with at least 6-8 cores per GPU. This will likely change
>>> with the next release.
>>>
>>> > So I am bit surprised with the fact the same
>>> > node on which my four jobs were running was already occupied with jobs
>>> by
>>> > some other user, which I think should not happen (may be slurm.config
>>> admin
>>> > issue?). Either my some jobs should have gone in queue or run on other
>>> node
>>> > if free.
>>>
>>> Sounds like a job scheduler issue (you can always check in the log the
>>> detected hardware) -- 

Re: [gmx-users] performance

2017-09-18 Thread Szilárd Páll
On Thu, Sep 14, 2017 at 1:02 PM, gromacs query  wrote:
> Hi Szilárd,
>
> Here are my replies:
>
>>> Did you run the "fast" single job on an otherwise empty node? That might
> explain it as, when most of the CPU cores are left empty, modern CPUs
> increase clocks (tubo boost) on the used cores higher than they could with
> all cores busy.
>
> Yes the "fast" single job was on empty node. Sorry I don't get it when you
> say 'modern CPUs increase clocks', you mean the ns/day I get is pseudo in
> that case?

It's called DVFS or Turbo Boost on Intel. Here are some pointers:
https://en.wikipedia.org/wiki/Dynamic_frequency_scaling
https://en.wikipedia.org/wiki/Intel_Turbo_Boost

>>> and if you post an actual log I can certainly give more informed comments
>
> Sure, if its ok can I post it off-mailing list to you?

Please use an online file sharing service of your linking so everyone
has access to the information referred to here.

>>> However, note that if you are sharing a node with others, if their jobs
> are not correctly affinitized, those processes will affect the performance
> of your job.
>
> Yes exactly. In this case I would need to manually set pinoffset but this
> can be but frustrating if other Gromacs users are not binding :)
> Would it be possible to fix this in the default algorithm, though am
> unaware of other issues it might cause?

No, there is no issue on the GROMACS-side to fix. This is an issue
that the jobs scheduler/you as user needs to deal with to avoid the
pitfalls and performance-cliff inherent to node-sharing.

> Also mutidir is not convenient
> sometimes when job crashes in the middle and automatic restart from cpt
> file would be difficult.

Let me answer that separately to emphasize a few technical issues.

Cheers,
--
Szilárd

> -J
>
>
> On Thu, Sep 14, 2017 at 11:26 AM, Szilárd Páll 
> wrote:
>
>> On Wed, Sep 13, 2017 at 11:14 PM, gromacs query 
>> wrote:
>> > Hi Szilárd,
>> >
>> > Thanks again. I tried now with -multidir like this:
>> >
>> > mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2
>> t3 t4
>> >
>> > So this runs 4 jobs on same node so for each job np is = 16/4, and each
>> job
>> > using 2 GPU. I get now quite improved performance and equal performance
>> for
>> > each job (~ 220 ns) though still slightly less than single independent
>> job
>> > (where I get 300 ns). I can live with that but -
>>
>> That is not normal and it is more likely to be a benchmarking
>> discrepancy: you are likely not comparing apples to apples. Did you
>> run the "fast" single job on an otherwise empty node? That might
>> explain it as, when most of the CPU cores are left empty, modern CPUs
>> increase clocks (tubo boost) on the used cores higher than they could
>> with all cores busy.
>>
>> > Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
>> > jobs should consume 8 GPUS.
>>
>> Note that even if those are 40 real cores (rather than 20 core with
>> HyperThreading), the current GROMACS release will be unlikely to run
>> efficiently with at least 6-8 cores per GPU. This will likely change
>> with the next release.
>>
>> > So I am bit surprised with the fact the same
>> > node on which my four jobs were running was already occupied with jobs by
>> > some other user, which I think should not happen (may be slurm.config
>> admin
>> > issue?). Either my some jobs should have gone in queue or run on other
>> node
>> > if free.
>>
>> Sounds like a job scheduler issue (you can always check in the log the
>> detected hardware) -- and if you post an actual log I can certainly
>> give more informed comments.
>>
>> > What to do: Importantly though as an individual user I can submit
>> -multidir
>> > job but lets say, which is normally the case, there will be many other
>> > unknown users who submit one or two jobs in that case performance will be
>> > an issue (which is equivalent to my case when I submit many jobs without
>> > -multi/multidir).
>>
>> Not sure I follow: if you always have a number of similar runs to do,
>> submit them together and benefit from not having to manual hardware
>> assignment. Otherwise, if your cluster relies on node sharing, you
>> will have to make sure that you specify correctly the affinity/binding
>> arguments to your job scheduler (or work around it with manual offset
>> calculation). However, note that if you are sharing a node with
>> others, if their jobs are not correctly affinitized, those processes
>> will affect the performance of your job.
>>
>> > I think still they will need -pinoffset. Could you
>> > please suggest what best can be done in such case?
>>
>> See above.
>>
>> Cheers,
>> --
>> Szilárd
>>
>> >
>> > -Jiom
>> >
>> >
>> >
>> >
>> > On Wed, Sep 13, 2017 at 9:15 PM, Szilárd Páll 
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> First off, have you considered options 2) using multi-sim? That would
>> >> allow you to not have to 

Re: [gmx-users] performance

2017-09-14 Thread gromacs query
Hi Szilárd,

Sorry this discussion is going long.
Finally I got one node empty and did some serious tests specially
considering your first point (discrepancies in benchmarking comparing jobs
running on empty node vs occupied node). I tested in both ways.

I ran following cases (single job vs two jobs for 2GPU+4 procs and also for
4GPU+16 procs). Happy to send log files.

Pinoffset results are surprising (4th and 8th test case below) though I get
in log file a WARNING: Requested offset too large for available cores for
the case 8; [should not be an issue as the first job binds the cores]

As suggested defining affinity should help with pinoffset set 'manually'
(in practice with script) but these results are quite variable. Am bit lost
now, what should be the best practice in case nodes are shared among
different users and multidir can be tricky in such case (if other gromacs
users are not using multidir option!).


Sr. no. each job 2GPU; 4 procs performance (ns/day)
1 only one job 345
2 two same jobs together (without pin on) 16.1 and 15.9
3 two same jobs together (without pin on, with -multidir) 178 and 191
4 two same jobs together (pin on, pinoffset at 0 and 5) 160 and 301
each job 4GPU; 16 procs performance (ns/day)
5 only one job 694
6 two same jobs together (without pin on) 340 and 350
7 two same jobs together (without pin on, with -multidir) 346 and 344
8 two same jobs together (pin on, pinoffset at 0 and 17) 204 and 546


On Thu, Sep 14, 2017 at 12:02 PM, gromacs query 
wrote:

> Hi Szilárd,
>
> Here are my replies:
>
> >> Did you run the "fast" single job on an otherwise empty node? That
> might explain it as, when most of the CPU cores are left empty, modern CPUs
> increase clocks (tubo boost) on the used cores higher than they could with
> all cores busy.
>
> Yes the "fast" single job was on empty node. Sorry I don't get it when you
> say 'modern CPUs increase clocks', you mean the ns/day I get is pseudo in
> that case?
>
> >> and if you post an actual log I can certainly give more informed
> comments
>
> Sure, if its ok can I post it off-mailing list to you?
>
> >> However, note that if you are sharing a node with others, if their jobs
> are not correctly affinitized, those processes will affect the performance
> of your job.
>
> Yes exactly. In this case I would need to manually set pinoffset but this
> can be but frustrating if other Gromacs users are not binding :)
> Would it be possible to fix this in the default algorithm, though am
> unaware of other issues it might cause? Also mutidir is not convenient
> sometimes when job crashes in the middle and automatic restart from cpt
> file would be difficult.
>
> -J
>
>
> On Thu, Sep 14, 2017 at 11:26 AM, Szilárd Páll 
> wrote:
>
>> On Wed, Sep 13, 2017 at 11:14 PM, gromacs query 
>> wrote:
>> > Hi Szilárd,
>> >
>> > Thanks again. I tried now with -multidir like this:
>> >
>> > mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2
>> t3 t4
>> >
>> > So this runs 4 jobs on same node so for each job np is = 16/4, and each
>> job
>> > using 2 GPU. I get now quite improved performance and equal performance
>> for
>> > each job (~ 220 ns) though still slightly less than single independent
>> job
>> > (where I get 300 ns). I can live with that but -
>>
>> That is not normal and it is more likely to be a benchmarking
>> discrepancy: you are likely not comparing apples to apples. Did you
>> run the "fast" single job on an otherwise empty node? That might
>> explain it as, when most of the CPU cores are left empty, modern CPUs
>> increase clocks (tubo boost) on the used cores higher than they could
>> with all cores busy.
>>
>> > Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
>> > jobs should consume 8 GPUS.
>>
>> Note that even if those are 40 real cores (rather than 20 core with
>> HyperThreading), the current GROMACS release will be unlikely to run
>> efficiently with at least 6-8 cores per GPU. This will likely change
>> with the next release.
>>
>> > So I am bit surprised with the fact the same
>> > node on which my four jobs were running was already occupied with jobs
>> by
>> > some other user, which I think should not happen (may be slurm.config
>> admin
>> > issue?). Either my some jobs should have gone in queue or run on other
>> node
>> > if free.
>>
>> Sounds like a job scheduler issue (you can always check in the log the
>> detected hardware) -- and if you post an actual log I can certainly
>> give more informed comments.
>>
>> > What to do: Importantly though as an individual user I can submit
>> -multidir
>> > job but lets say, which is normally the case, there will be many other
>> > unknown users who submit one or two jobs in that case performance will
>> be
>> > an issue (which is equivalent to my case when I submit many jobs without
>> > -multi/multidir).
>>
>> Not sure I follow: if you always have a number of similar runs 

Re: [gmx-users] performance

2017-09-14 Thread gromacs query
Hi Szilárd,

Here are my replies:

>> Did you run the "fast" single job on an otherwise empty node? That might
explain it as, when most of the CPU cores are left empty, modern CPUs
increase clocks (tubo boost) on the used cores higher than they could with
all cores busy.

Yes the "fast" single job was on empty node. Sorry I don't get it when you
say 'modern CPUs increase clocks', you mean the ns/day I get is pseudo in
that case?

>> and if you post an actual log I can certainly give more informed comments

Sure, if its ok can I post it off-mailing list to you?

>> However, note that if you are sharing a node with others, if their jobs
are not correctly affinitized, those processes will affect the performance
of your job.

Yes exactly. In this case I would need to manually set pinoffset but this
can be but frustrating if other Gromacs users are not binding :)
Would it be possible to fix this in the default algorithm, though am
unaware of other issues it might cause? Also mutidir is not convenient
sometimes when job crashes in the middle and automatic restart from cpt
file would be difficult.

-J


On Thu, Sep 14, 2017 at 11:26 AM, Szilárd Páll 
wrote:

> On Wed, Sep 13, 2017 at 11:14 PM, gromacs query 
> wrote:
> > Hi Szilárd,
> >
> > Thanks again. I tried now with -multidir like this:
> >
> > mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2
> t3 t4
> >
> > So this runs 4 jobs on same node so for each job np is = 16/4, and each
> job
> > using 2 GPU. I get now quite improved performance and equal performance
> for
> > each job (~ 220 ns) though still slightly less than single independent
> job
> > (where I get 300 ns). I can live with that but -
>
> That is not normal and it is more likely to be a benchmarking
> discrepancy: you are likely not comparing apples to apples. Did you
> run the "fast" single job on an otherwise empty node? That might
> explain it as, when most of the CPU cores are left empty, modern CPUs
> increase clocks (tubo boost) on the used cores higher than they could
> with all cores busy.
>
> > Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
> > jobs should consume 8 GPUS.
>
> Note that even if those are 40 real cores (rather than 20 core with
> HyperThreading), the current GROMACS release will be unlikely to run
> efficiently with at least 6-8 cores per GPU. This will likely change
> with the next release.
>
> > So I am bit surprised with the fact the same
> > node on which my four jobs were running was already occupied with jobs by
> > some other user, which I think should not happen (may be slurm.config
> admin
> > issue?). Either my some jobs should have gone in queue or run on other
> node
> > if free.
>
> Sounds like a job scheduler issue (you can always check in the log the
> detected hardware) -- and if you post an actual log I can certainly
> give more informed comments.
>
> > What to do: Importantly though as an individual user I can submit
> -multidir
> > job but lets say, which is normally the case, there will be many other
> > unknown users who submit one or two jobs in that case performance will be
> > an issue (which is equivalent to my case when I submit many jobs without
> > -multi/multidir).
>
> Not sure I follow: if you always have a number of similar runs to do,
> submit them together and benefit from not having to manual hardware
> assignment. Otherwise, if your cluster relies on node sharing, you
> will have to make sure that you specify correctly the affinity/binding
> arguments to your job scheduler (or work around it with manual offset
> calculation). However, note that if you are sharing a node with
> others, if their jobs are not correctly affinitized, those processes
> will affect the performance of your job.
>
> > I think still they will need -pinoffset. Could you
> > please suggest what best can be done in such case?
>
> See above.
>
> Cheers,
> --
> Szilárd
>
> >
> > -Jiom
> >
> >
> >
> >
> > On Wed, Sep 13, 2017 at 9:15 PM, Szilárd Páll 
> > wrote:
> >
> >> Hi,
> >>
> >> First off, have you considered options 2) using multi-sim? That would
> >> allow you to not have to bother manually set offsets. Can you not
> >> submit your jobs such that you fill at least a node?
> >>
> >> How many threads/cores does you node have? Can you share log files?
> >>
> >> Cheers,
> >> --
> >> Szilárd
> >>
> >>
> >> On Wed, Sep 13, 2017 at 9:14 PM, gromacs query 
> >> wrote:
> >> > Hi Szilárd,
> >> >
> >> > Sorry I was bit quick to say its working with pinoffset. I just
> submitted
> >> > four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
> >> > different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there
> are
> >> > 40 cores on node). Still I don't get same performance (all variably
> less
> >> > than 50%) as expected from a single independent job. Now am wondering
> if
> >> > its still related to overlap of 

Re: [gmx-users] performance

2017-09-14 Thread Szilárd Páll
On Wed, Sep 13, 2017 at 11:14 PM, gromacs query  wrote:
> Hi Szilárd,
>
> Thanks again. I tried now with -multidir like this:
>
> mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2 t3 t4
>
> So this runs 4 jobs on same node so for each job np is = 16/4, and each job
> using 2 GPU. I get now quite improved performance and equal performance for
> each job (~ 220 ns) though still slightly less than single independent job
> (where I get 300 ns). I can live with that but -

That is not normal and it is more likely to be a benchmarking
discrepancy: you are likely not comparing apples to apples. Did you
run the "fast" single job on an otherwise empty node? That might
explain it as, when most of the CPU cores are left empty, modern CPUs
increase clocks (tubo boost) on the used cores higher than they could
with all cores busy.

> Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
> jobs should consume 8 GPUS.

Note that even if those are 40 real cores (rather than 20 core with
HyperThreading), the current GROMACS release will be unlikely to run
efficiently with at least 6-8 cores per GPU. This will likely change
with the next release.

> So I am bit surprised with the fact the same
> node on which my four jobs were running was already occupied with jobs by
> some other user, which I think should not happen (may be slurm.config admin
> issue?). Either my some jobs should have gone in queue or run on other node
> if free.

Sounds like a job scheduler issue (you can always check in the log the
detected hardware) -- and if you post an actual log I can certainly
give more informed comments.

> What to do: Importantly though as an individual user I can submit -multidir
> job but lets say, which is normally the case, there will be many other
> unknown users who submit one or two jobs in that case performance will be
> an issue (which is equivalent to my case when I submit many jobs without
> -multi/multidir).

Not sure I follow: if you always have a number of similar runs to do,
submit them together and benefit from not having to manual hardware
assignment. Otherwise, if your cluster relies on node sharing, you
will have to make sure that you specify correctly the affinity/binding
arguments to your job scheduler (or work around it with manual offset
calculation). However, note that if you are sharing a node with
others, if their jobs are not correctly affinitized, those processes
will affect the performance of your job.

> I think still they will need -pinoffset. Could you
> please suggest what best can be done in such case?

See above.

Cheers,
--
Szilárd

>
> -Jiom
>
>
>
>
> On Wed, Sep 13, 2017 at 9:15 PM, Szilárd Páll 
> wrote:
>
>> Hi,
>>
>> First off, have you considered options 2) using multi-sim? That would
>> allow you to not have to bother manually set offsets. Can you not
>> submit your jobs such that you fill at least a node?
>>
>> How many threads/cores does you node have? Can you share log files?
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Wed, Sep 13, 2017 at 9:14 PM, gromacs query 
>> wrote:
>> > Hi Szilárd,
>> >
>> > Sorry I was bit quick to say its working with pinoffset. I just submitted
>> > four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
>> > different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there are
>> > 40 cores on node). Still I don't get same performance (all variably less
>> > than 50%) as expected from a single independent job. Now am wondering if
>> > its still related to overlap of cores as pin on should lock the cores for
>> > the same job.
>> >
>> > -J
>> >
>> > On Wed, Sep 13, 2017 at 7:33 PM, gromacs query 
>> > wrote:
>> >
>> >> Hi Szilárd,
>> >>
>> >> Thanks, option 3 was in my mind but I need to figure out now how :)
>> >> Manually fixing pinoffset as of now seems working with some quick test.
>> >> I think option 1 would require to ask the admin but I can try option 3
>> >> myself. As there are other users from different places who may not
>> bother
>> >> using option 3. I think I would need to ask the admin to force option 1
>> but
>> >> before that I will try option 3.
>> >>
>> >> JIom
>> >>
>> >> On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll 
>> >> wrote:
>> >>
>> >>> J,
>> >>>
>> >>> You have a few options:
>> >>>
>> >>> * Use SLURM to assign not only the set of GPUs, but also the correct
>> >>> set of CPU cores to each mdrun process. If you do so, mdrun will
>> >>> respect the affinity mask it will inherit and your two mdrun jobs
>> >>> should be running on the right set of cores. This has the drawback
>> >>> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
>> >>> application thread to a core/hardware thread (which is what mdrun
>> >>> does), only a process to a group of cores/hw threads which can
>> >>> sometimes lead to performance loss. (You might be able to 

Re: [gmx-users] performance

2017-09-13 Thread gromacs query
Hi Szilárd,

Thanks again. I tried now with -multidir like this:

mpirun -np 16 gmx_mpi mdrun -s test -ntomp 2 -maxh 0.1 -multidir t1 t2 t3 t4

So this runs 4 jobs on same node so for each job np is = 16/4, and each job
using 2 GPU. I get now quite improved performance and equal performance for
each job (~ 220 ns) though still slightly less than single independent job
(where I get 300 ns). I can live with that but -

Surprised: There are maximum 40 cores and 8 GPUs per node and thus my 4
jobs should consume 8 GPUS. So I am bit surprised with the fact the same
node on which my four jobs were running was already occupied with jobs by
some other user, which I think should not happen (may be slurm.config admin
issue?). Either my some jobs should have gone in queue or run on other node
if free.

What to do: Importantly though as an individual user I can submit -multidir
job but lets say, which is normally the case, there will be many other
unknown users who submit one or two jobs in that case performance will be
an issue (which is equivalent to my case when I submit many jobs without
-multi/multidir).  I think still they will need -pinoffset. Could you
please suggest what best can be done in such case?


-Jiom




On Wed, Sep 13, 2017 at 9:15 PM, Szilárd Páll 
wrote:

> Hi,
>
> First off, have you considered options 2) using multi-sim? That would
> allow you to not have to bother manually set offsets. Can you not
> submit your jobs such that you fill at least a node?
>
> How many threads/cores does you node have? Can you share log files?
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, Sep 13, 2017 at 9:14 PM, gromacs query 
> wrote:
> > Hi Szilárd,
> >
> > Sorry I was bit quick to say its working with pinoffset. I just submitted
> > four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
> > different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there are
> > 40 cores on node). Still I don't get same performance (all variably less
> > than 50%) as expected from a single independent job. Now am wondering if
> > its still related to overlap of cores as pin on should lock the cores for
> > the same job.
> >
> > -J
> >
> > On Wed, Sep 13, 2017 at 7:33 PM, gromacs query 
> > wrote:
> >
> >> Hi Szilárd,
> >>
> >> Thanks, option 3 was in my mind but I need to figure out now how :)
> >> Manually fixing pinoffset as of now seems working with some quick test.
> >> I think option 1 would require to ask the admin but I can try option 3
> >> myself. As there are other users from different places who may not
> bother
> >> using option 3. I think I would need to ask the admin to force option 1
> but
> >> before that I will try option 3.
> >>
> >> JIom
> >>
> >> On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll 
> >> wrote:
> >>
> >>> J,
> >>>
> >>> You have a few options:
> >>>
> >>> * Use SLURM to assign not only the set of GPUs, but also the correct
> >>> set of CPU cores to each mdrun process. If you do so, mdrun will
> >>> respect the affinity mask it will inherit and your two mdrun jobs
> >>> should be running on the right set of cores. This has the drawback
> >>> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
> >>> application thread to a core/hardware thread (which is what mdrun
> >>> does), only a process to a group of cores/hw threads which can
> >>> sometimes lead to performance loss. (You might be able to compensate
> >>> using some OpenMP library environment variables, though.)
> >>>
> >>> * Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
> >>> node or mulitple across nodes) and benefit from the rank/thread to
> >>> core/hw thread assignment that's supported also across multiple
> >>> simulations part of a multi-run; e.g.:
> >>> mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir
> my_input_dir{1,2,3,4}
> >>> will launch 4 ranks and start 4 simulations in each of the four
> >>> directories passed.
> >>>
> >>> * Write a wrapper script around gmx mdrun which will be what you
> >>> launch with SLURM; you can then inspect the node and decide what
> >>> pinoffset value to pass to your mdrun launch command.
> >>>
> >>>
> >>> I hope one of these will deliver the desired results :)
> >>>
> >>> Cheers,
> >>> --
> >>> Szilárd
> >>>
> >>>
> >>> On Wed, Sep 13, 2017 at 7:47 PM, gromacs query  >
> >>> wrote:
> >>> > Hi Szilárd,
> >>> >
> >>> > Thanks for your reply. This is useful but now am thinking because the
> >>> slurm
> >>> > launches job in an automated way it is not really in my control to
> >>> choose
> >>> > the node. So following things can happen; say for two mdrun jobs I
> set
> >>> > -pinoffset 0 and -pinoffset 4;
> >>> >
> >>> > - if they are running on the same node this is good
> >>> > - if jobs run on different nodes (partially occupied or free) whether
> >>> these
> >>> > chosen pinoffsets will make sense or not as I don't know what
> 

Re: [gmx-users] performance

2017-09-13 Thread Szilárd Páll
Hi,

First off, have you considered options 2) using multi-sim? That would
allow you to not have to bother manually set offsets. Can you not
submit your jobs such that you fill at least a node?

How many threads/cores does you node have? Can you share log files?

Cheers,
--
Szilárd


On Wed, Sep 13, 2017 at 9:14 PM, gromacs query  wrote:
> Hi Szilárd,
>
> Sorry I was bit quick to say its working with pinoffset. I just submitted
> four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
> different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there are
> 40 cores on node). Still I don't get same performance (all variably less
> than 50%) as expected from a single independent job. Now am wondering if
> its still related to overlap of cores as pin on should lock the cores for
> the same job.
>
> -J
>
> On Wed, Sep 13, 2017 at 7:33 PM, gromacs query 
> wrote:
>
>> Hi Szilárd,
>>
>> Thanks, option 3 was in my mind but I need to figure out now how :)
>> Manually fixing pinoffset as of now seems working with some quick test.
>> I think option 1 would require to ask the admin but I can try option 3
>> myself. As there are other users from different places who may not bother
>> using option 3. I think I would need to ask the admin to force option 1 but
>> before that I will try option 3.
>>
>> JIom
>>
>> On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll 
>> wrote:
>>
>>> J,
>>>
>>> You have a few options:
>>>
>>> * Use SLURM to assign not only the set of GPUs, but also the correct
>>> set of CPU cores to each mdrun process. If you do so, mdrun will
>>> respect the affinity mask it will inherit and your two mdrun jobs
>>> should be running on the right set of cores. This has the drawback
>>> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
>>> application thread to a core/hardware thread (which is what mdrun
>>> does), only a process to a group of cores/hw threads which can
>>> sometimes lead to performance loss. (You might be able to compensate
>>> using some OpenMP library environment variables, though.)
>>>
>>> * Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
>>> node or mulitple across nodes) and benefit from the rank/thread to
>>> core/hw thread assignment that's supported also across multiple
>>> simulations part of a multi-run; e.g.:
>>> mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir my_input_dir{1,2,3,4}
>>> will launch 4 ranks and start 4 simulations in each of the four
>>> directories passed.
>>>
>>> * Write a wrapper script around gmx mdrun which will be what you
>>> launch with SLURM; you can then inspect the node and decide what
>>> pinoffset value to pass to your mdrun launch command.
>>>
>>>
>>> I hope one of these will deliver the desired results :)
>>>
>>> Cheers,
>>> --
>>> Szilárd
>>>
>>>
>>> On Wed, Sep 13, 2017 at 7:47 PM, gromacs query 
>>> wrote:
>>> > Hi Szilárd,
>>> >
>>> > Thanks for your reply. This is useful but now am thinking because the
>>> slurm
>>> > launches job in an automated way it is not really in my control to
>>> choose
>>> > the node. So following things can happen; say for two mdrun jobs I set
>>> > -pinoffset 0 and -pinoffset 4;
>>> >
>>> > - if they are running on the same node this is good
>>> > - if jobs run on different nodes (partially occupied or free) whether
>>> these
>>> > chosen pinoffsets will make sense or not as I don't know what pinoffset
>>> I
>>> > would need to set
>>> > - if I have to submit many jobs together and slurm chooses
>>> different/same
>>> > node itself then I think it is difficult to define pinoffset.
>>> >
>>> > -
>>> > J
>>> >
>>> > On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll 
>>> > wrote:
>>> >
>>> >> My guess is that the two jobs are using the same cores -- either all
>>> >> cores/threads or only half of them, but the same set.
>>> >>
>>> >> You should use -pinoffset; see:
>>> >>
>>> >> - Docs and example:
>>> >> http://manual.gromacs.org/documentation/2016/user-guide/
>>> >> mdrun-performance.html
>>> >>
>>> >> - More explanation on the thread pinning behavior on the old website:
>>> >> http://www.gromacs.org/Documentation/Acceleration_
>>> >> and_parallelization#Pinning_threads_to_physical_cores
>>> >>
>>> >> Cheers,
>>> >> --
>>> >> Szilárd
>>> >>
>>> >>
>>> >> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query >> >
>>> >> wrote:
>>> >> > Sorry forgot to add; we thought the two jobs are using same GPU ids
>>> but
>>> >> > cuda visible devices show both jobs are using different ids (0,1 and
>>> 2,3)
>>> >> >
>>> >> > -
>>> >> > J
>>> >> >
>>> >> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query <
>>> gromacsqu...@gmail.com>
>>> >> > wrote:
>>> >> >
>>> >> >> Hi All,
>>> >> >>
>>> >> >> I have some issues with gromacs performance. There are many nodes
>>> and
>>> >> each
>>> >> >> node has number of gpus and the batch process is controlled by

Re: [gmx-users] performance

2017-09-13 Thread gromacs query
Hi Szilárd,

Sorry I was bit quick to say its working with pinoffset. I just submitted
four same jobs (2 gpus, 4 nprocs) on the same node with -pin on and
different -pinoffset to 0, 5, 10, 15 (numbers should be fine as there are
40 cores on node). Still I don't get same performance (all variably less
than 50%) as expected from a single independent job. Now am wondering if
its still related to overlap of cores as pin on should lock the cores for
the same job.

-J

On Wed, Sep 13, 2017 at 7:33 PM, gromacs query 
wrote:

> Hi Szilárd,
>
> Thanks, option 3 was in my mind but I need to figure out now how :)
> Manually fixing pinoffset as of now seems working with some quick test.
> I think option 1 would require to ask the admin but I can try option 3
> myself. As there are other users from different places who may not bother
> using option 3. I think I would need to ask the admin to force option 1 but
> before that I will try option 3.
>
> JIom
>
> On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll 
> wrote:
>
>> J,
>>
>> You have a few options:
>>
>> * Use SLURM to assign not only the set of GPUs, but also the correct
>> set of CPU cores to each mdrun process. If you do so, mdrun will
>> respect the affinity mask it will inherit and your two mdrun jobs
>> should be running on the right set of cores. This has the drawback
>> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
>> application thread to a core/hardware thread (which is what mdrun
>> does), only a process to a group of cores/hw threads which can
>> sometimes lead to performance loss. (You might be able to compensate
>> using some OpenMP library environment variables, though.)
>>
>> * Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
>> node or mulitple across nodes) and benefit from the rank/thread to
>> core/hw thread assignment that's supported also across multiple
>> simulations part of a multi-run; e.g.:
>> mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir my_input_dir{1,2,3,4}
>> will launch 4 ranks and start 4 simulations in each of the four
>> directories passed.
>>
>> * Write a wrapper script around gmx mdrun which will be what you
>> launch with SLURM; you can then inspect the node and decide what
>> pinoffset value to pass to your mdrun launch command.
>>
>>
>> I hope one of these will deliver the desired results :)
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Wed, Sep 13, 2017 at 7:47 PM, gromacs query 
>> wrote:
>> > Hi Szilárd,
>> >
>> > Thanks for your reply. This is useful but now am thinking because the
>> slurm
>> > launches job in an automated way it is not really in my control to
>> choose
>> > the node. So following things can happen; say for two mdrun jobs I set
>> > -pinoffset 0 and -pinoffset 4;
>> >
>> > - if they are running on the same node this is good
>> > - if jobs run on different nodes (partially occupied or free) whether
>> these
>> > chosen pinoffsets will make sense or not as I don't know what pinoffset
>> I
>> > would need to set
>> > - if I have to submit many jobs together and slurm chooses
>> different/same
>> > node itself then I think it is difficult to define pinoffset.
>> >
>> > -
>> > J
>> >
>> > On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll 
>> > wrote:
>> >
>> >> My guess is that the two jobs are using the same cores -- either all
>> >> cores/threads or only half of them, but the same set.
>> >>
>> >> You should use -pinoffset; see:
>> >>
>> >> - Docs and example:
>> >> http://manual.gromacs.org/documentation/2016/user-guide/
>> >> mdrun-performance.html
>> >>
>> >> - More explanation on the thread pinning behavior on the old website:
>> >> http://www.gromacs.org/Documentation/Acceleration_
>> >> and_parallelization#Pinning_threads_to_physical_cores
>> >>
>> >> Cheers,
>> >> --
>> >> Szilárd
>> >>
>> >>
>> >> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query > >
>> >> wrote:
>> >> > Sorry forgot to add; we thought the two jobs are using same GPU ids
>> but
>> >> > cuda visible devices show both jobs are using different ids (0,1 and
>> 2,3)
>> >> >
>> >> > -
>> >> > J
>> >> >
>> >> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query <
>> gromacsqu...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi All,
>> >> >>
>> >> >> I have some issues with gromacs performance. There are many nodes
>> and
>> >> each
>> >> >> node has number of gpus and the batch process is controlled by
>> slurm.
>> >> >> Although I get good performance with some settings of number of
>> gpus and
>> >> >> nprocs but when I submit same job twice on the same node then the
>> >> >> performance is reduced drastically. e.g
>> >> >>
>> >> >> For 2 GPUs I get 300 ns per day when there is no other job running
>> on
>> >> the
>> >> >> node. When I submit same job twice on the same node & at the same
>> time,
>> >> I
>> >> >> get only 17 ns/day for both the jobs. I am using this:
>> >> >>
>> >> >> mpirun 

Re: [gmx-users] performance

2017-09-13 Thread gromacs query
Hi Szilárd,

Thanks, option 3 was in my mind but I need to figure out now how :)
Manually fixing pinoffset as of now seems working with some quick test.
I think option 1 would require to ask the admin but I can try option 3
myself. As there are other users from different places who may not bother
using option 3. I think I would need to ask the admin to force option 1 but
before that I will try option 3.

JIom

On Wed, Sep 13, 2017 at 7:10 PM, Szilárd Páll 
wrote:

> J,
>
> You have a few options:
>
> * Use SLURM to assign not only the set of GPUs, but also the correct
> set of CPU cores to each mdrun process. If you do so, mdrun will
> respect the affinity mask it will inherit and your two mdrun jobs
> should be running on the right set of cores. This has the drawback
> that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
> application thread to a core/hardware thread (which is what mdrun
> does), only a process to a group of cores/hw threads which can
> sometimes lead to performance loss. (You might be able to compensate
> using some OpenMP library environment variables, though.)
>
> * Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
> node or mulitple across nodes) and benefit from the rank/thread to
> core/hw thread assignment that's supported also across multiple
> simulations part of a multi-run; e.g.:
> mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir my_input_dir{1,2,3,4}
> will launch 4 ranks and start 4 simulations in each of the four
> directories passed.
>
> * Write a wrapper script around gmx mdrun which will be what you
> launch with SLURM; you can then inspect the node and decide what
> pinoffset value to pass to your mdrun launch command.
>
>
> I hope one of these will deliver the desired results :)
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, Sep 13, 2017 at 7:47 PM, gromacs query 
> wrote:
> > Hi Szilárd,
> >
> > Thanks for your reply. This is useful but now am thinking because the
> slurm
> > launches job in an automated way it is not really in my control to choose
> > the node. So following things can happen; say for two mdrun jobs I set
> > -pinoffset 0 and -pinoffset 4;
> >
> > - if they are running on the same node this is good
> > - if jobs run on different nodes (partially occupied or free) whether
> these
> > chosen pinoffsets will make sense or not as I don't know what pinoffset I
> > would need to set
> > - if I have to submit many jobs together and slurm chooses different/same
> > node itself then I think it is difficult to define pinoffset.
> >
> > -
> > J
> >
> > On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll 
> > wrote:
> >
> >> My guess is that the two jobs are using the same cores -- either all
> >> cores/threads or only half of them, but the same set.
> >>
> >> You should use -pinoffset; see:
> >>
> >> - Docs and example:
> >> http://manual.gromacs.org/documentation/2016/user-guide/
> >> mdrun-performance.html
> >>
> >> - More explanation on the thread pinning behavior on the old website:
> >> http://www.gromacs.org/Documentation/Acceleration_
> >> and_parallelization#Pinning_threads_to_physical_cores
> >>
> >> Cheers,
> >> --
> >> Szilárd
> >>
> >>
> >> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query 
> >> wrote:
> >> > Sorry forgot to add; we thought the two jobs are using same GPU ids
> but
> >> > cuda visible devices show both jobs are using different ids (0,1 and
> 2,3)
> >> >
> >> > -
> >> > J
> >> >
> >> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query <
> gromacsqu...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi All,
> >> >>
> >> >> I have some issues with gromacs performance. There are many nodes and
> >> each
> >> >> node has number of gpus and the batch process is controlled by slurm.
> >> >> Although I get good performance with some settings of number of gpus
> and
> >> >> nprocs but when I submit same job twice on the same node then the
> >> >> performance is reduced drastically. e.g
> >> >>
> >> >> For 2 GPUs I get 300 ns per day when there is no other job running on
> >> the
> >> >> node. When I submit same job twice on the same node & at the same
> time,
> >> I
> >> >> get only 17 ns/day for both the jobs. I am using this:
> >> >>
> >> >> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
> >> >>
> >> >> Any suggestions highly appreciated.
> >> >>
> >> >> Thanks
> >> >>
> >> >> Jiom
> >> >>
> >> > --
> >> > Gromacs Users mailing list
> >> >
> >> > * Please search the archive at http://www.gromacs.org/
> >> Support/Mailing_Lists/GMX-Users_List before posting!
> >> >
> >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >> >
> >> > * For (un)subscribe requests visit
> >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> >> send a mail to gmx-users-requ...@gromacs.org.
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at http://www.gromacs.org/
> >> 

Re: [gmx-users] performance

2017-09-13 Thread Szilárd Páll
J,

You have a few options:

* Use SLURM to assign not only the set of GPUs, but also the correct
set of CPU cores to each mdrun process. If you do so, mdrun will
respect the affinity mask it will inherit and your two mdrun jobs
should be running on the right set of cores. This has the drawback
that (AFAIK) SLURM/aprun (or srun) will not allow you to bind each
application thread to a core/hardware thread (which is what mdrun
does), only a process to a group of cores/hw threads which can
sometimes lead to performance loss. (You might be able to compensate
using some OpenMP library environment variables, though.)

* Run multiple jobs with mdrun "-multi"/"-multidir"  (either two per
node or mulitple across nodes) and benefit from the rank/thread to
core/hw thread assignment that's supported also across multiple
simulations part of a multi-run; e.g.:
mpirun -np 4 gmx mdrun -multi 4 -ntomp N -multidir my_input_dir{1,2,3,4}
will launch 4 ranks and start 4 simulations in each of the four
directories passed.

* Write a wrapper script around gmx mdrun which will be what you
launch with SLURM; you can then inspect the node and decide what
pinoffset value to pass to your mdrun launch command.


I hope one of these will deliver the desired results :)

Cheers,
--
Szilárd


On Wed, Sep 13, 2017 at 7:47 PM, gromacs query  wrote:
> Hi Szilárd,
>
> Thanks for your reply. This is useful but now am thinking because the slurm
> launches job in an automated way it is not really in my control to choose
> the node. So following things can happen; say for two mdrun jobs I set
> -pinoffset 0 and -pinoffset 4;
>
> - if they are running on the same node this is good
> - if jobs run on different nodes (partially occupied or free) whether these
> chosen pinoffsets will make sense or not as I don't know what pinoffset I
> would need to set
> - if I have to submit many jobs together and slurm chooses different/same
> node itself then I think it is difficult to define pinoffset.
>
> -
> J
>
> On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll 
> wrote:
>
>> My guess is that the two jobs are using the same cores -- either all
>> cores/threads or only half of them, but the same set.
>>
>> You should use -pinoffset; see:
>>
>> - Docs and example:
>> http://manual.gromacs.org/documentation/2016/user-guide/
>> mdrun-performance.html
>>
>> - More explanation on the thread pinning behavior on the old website:
>> http://www.gromacs.org/Documentation/Acceleration_
>> and_parallelization#Pinning_threads_to_physical_cores
>>
>> Cheers,
>> --
>> Szilárd
>>
>>
>> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query 
>> wrote:
>> > Sorry forgot to add; we thought the two jobs are using same GPU ids but
>> > cuda visible devices show both jobs are using different ids (0,1 and 2,3)
>> >
>> > -
>> > J
>> >
>> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query 
>> > wrote:
>> >
>> >> Hi All,
>> >>
>> >> I have some issues with gromacs performance. There are many nodes and
>> each
>> >> node has number of gpus and the batch process is controlled by slurm.
>> >> Although I get good performance with some settings of number of gpus and
>> >> nprocs but when I submit same job twice on the same node then the
>> >> performance is reduced drastically. e.g
>> >>
>> >> For 2 GPUs I get 300 ns per day when there is no other job running on
>> the
>> >> node. When I submit same job twice on the same node & at the same time,
>> I
>> >> get only 17 ns/day for both the jobs. I am using this:
>> >>
>> >> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
>> >>
>> >> Any suggestions highly appreciated.
>> >>
>> >> Thanks
>> >>
>> >> Jiom
>> >>
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/
>> Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 

Re: [gmx-users] performance

2017-09-13 Thread gromacs query
Hi Szilárd,

Thanks for your reply. This is useful but now am thinking because the slurm
launches job in an automated way it is not really in my control to choose
the node. So following things can happen; say for two mdrun jobs I set
-pinoffset 0 and -pinoffset 4;

- if they are running on the same node this is good
- if jobs run on different nodes (partially occupied or free) whether these
chosen pinoffsets will make sense or not as I don't know what pinoffset I
would need to set
- if I have to submit many jobs together and slurm chooses different/same
node itself then I think it is difficult to define pinoffset.

-
J

On Wed, Sep 13, 2017 at 6:14 PM, Szilárd Páll 
wrote:

> My guess is that the two jobs are using the same cores -- either all
> cores/threads or only half of them, but the same set.
>
> You should use -pinoffset; see:
>
> - Docs and example:
> http://manual.gromacs.org/documentation/2016/user-guide/
> mdrun-performance.html
>
> - More explanation on the thread pinning behavior on the old website:
> http://www.gromacs.org/Documentation/Acceleration_
> and_parallelization#Pinning_threads_to_physical_cores
>
> Cheers,
> --
> Szilárd
>
>
> On Wed, Sep 13, 2017 at 6:35 PM, gromacs query 
> wrote:
> > Sorry forgot to add; we thought the two jobs are using same GPU ids but
> > cuda visible devices show both jobs are using different ids (0,1 and 2,3)
> >
> > -
> > J
> >
> > On Wed, Sep 13, 2017 at 5:33 PM, gromacs query 
> > wrote:
> >
> >> Hi All,
> >>
> >> I have some issues with gromacs performance. There are many nodes and
> each
> >> node has number of gpus and the batch process is controlled by slurm.
> >> Although I get good performance with some settings of number of gpus and
> >> nprocs but when I submit same job twice on the same node then the
> >> performance is reduced drastically. e.g
> >>
> >> For 2 GPUs I get 300 ns per day when there is no other job running on
> the
> >> node. When I submit same job twice on the same node & at the same time,
> I
> >> get only 17 ns/day for both the jobs. I am using this:
> >>
> >> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
> >>
> >> Any suggestions highly appreciated.
> >>
> >> Thanks
> >>
> >> Jiom
> >>
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/
> Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance

2017-09-13 Thread Szilárd Páll
My guess is that the two jobs are using the same cores -- either all
cores/threads or only half of them, but the same set.

You should use -pinoffset; see:

- Docs and example:
http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html

- More explanation on the thread pinning behavior on the old website:
http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Pinning_threads_to_physical_cores

Cheers,
--
Szilárd


On Wed, Sep 13, 2017 at 6:35 PM, gromacs query  wrote:
> Sorry forgot to add; we thought the two jobs are using same GPU ids but
> cuda visible devices show both jobs are using different ids (0,1 and 2,3)
>
> -
> J
>
> On Wed, Sep 13, 2017 at 5:33 PM, gromacs query 
> wrote:
>
>> Hi All,
>>
>> I have some issues with gromacs performance. There are many nodes and each
>> node has number of gpus and the batch process is controlled by slurm.
>> Although I get good performance with some settings of number of gpus and
>> nprocs but when I submit same job twice on the same node then the
>> performance is reduced drastically. e.g
>>
>> For 2 GPUs I get 300 ns per day when there is no other job running on the
>> node. When I submit same job twice on the same node & at the same time, I
>> get only 17 ns/day for both the jobs. I am using this:
>>
>> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
>>
>> Any suggestions highly appreciated.
>>
>> Thanks
>>
>> Jiom
>>
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance

2017-09-13 Thread gromacs query
Sorry forgot to add; we thought the two jobs are using same GPU ids but
cuda visible devices show both jobs are using different ids (0,1 and 2,3)

-
J

On Wed, Sep 13, 2017 at 5:33 PM, gromacs query 
wrote:

> Hi All,
>
> I have some issues with gromacs performance. There are many nodes and each
> node has number of gpus and the batch process is controlled by slurm.
> Although I get good performance with some settings of number of gpus and
> nprocs but when I submit same job twice on the same node then the
> performance is reduced drastically. e.g
>
> For 2 GPUs I get 300 ns per day when there is no other job running on the
> node. When I submit same job twice on the same node & at the same time, I
> get only 17 ns/day for both the jobs. I am using this:
>
> mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12
>
> Any suggestions highly appreciated.
>
> Thanks
>
> Jiom
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] performance

2017-09-13 Thread gromacs query
Hi All,

I have some issues with gromacs performance. There are many nodes and each
node has number of gpus and the batch process is controlled by slurm.
Although I get good performance with some settings of number of gpus and
nprocs but when I submit same job twice on the same node then the
performance is reduced drastically. e.g

For 2 GPUs I get 300 ns per day when there is no other job running on the
node. When I submit same job twice on the same node & at the same time, I
get only 17 ns/day for both the jobs. I am using this:

mpirun -np 4 gmx_mpi mdrun -deffnm test -ntomp 2 -maxh 0.12

Any suggestions highly appreciated.

Thanks

Jiom
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance values

2017-08-28 Thread Szilárd Páll
On Mon, Aug 7, 2017 at 6:10 PM, Maureen Chew <maureen.c...@oracle.com> wrote:
> Szilárd,
> Thank you so very much for the reply!You mention
> that time/step is important if trying to do an apples-to-apples
> comparison for any given simulation.
>
> I have a few questions - For specific example, use the RNAse reference here:
> (http://www.gromacs.org/gpu <http://www.gromacs.org/gpu>)
> Influence of box geometry and virtual interaction sites
> This is a simulation of the protein RNAse, which contains roughly 24,000 
> atoms in a cubic box.
> The image rnase.png <http://www.gromacs.org/@api/deki/files/224/=rnase.png>, 
> shows a 6 core baseline to be, roughly 50ns/day
>
> First,  assuming this is from rnase_cubic @ 
> ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz 
> <ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz>
> which contains these files:
> rnase_cubic/rf_verlet.mdp
> rnase_cubic/conf.gro
> rnase_cubic/topol.top
> rnase_cubic/pme_verlet.mdp
> 3 questions:
> - What gmx grompp command was use to generate the tpr file for the result in 
> rnase.png?

Those are two sets of runs, the first four use the cubic RNAse setup
without virtual sites (i.e. the inputs you list above) and the "PME"
mdp settings file, so you need to do
gmx grompp -f pme_verlet

The latter four bars use the other input for which you'll need a
command line similar to the above to set up, but using the appropriate
input files from the "rnase_dodec_vsites" tarball.

> Apologies if its intuitively obvious which mdp  was used
> - Aside from -ntmpi and -ntomp parms, what gmx mdrun command was used to 
> obtain
> the 6 core result?

In all single-node (and single-GPU) runs OpenMP-only parallelization
is used, i.e. 1 MPI rank and as many OpenMP threads as hardware
threads, i.e. for the 6-core run "mdrun -ntmpi 1 -ntomp 12" (which is
also the default).

Note that the date you're looking at is several years old by now and
was collected on an Intel Sandy Bridge-E CPU (i7-3930K), so it is not
highly representative of the current performance of GROMACS.

> - From that 6 core run, what is the time/step that you refer to?

Have a look in the "dt" field in the mdp or log file; in this case
dt=0.002, i.e 2 fs. For the vsites case it is 5 fs.

> Is that
> real cycle  and time accounting for neighbor search, force, PME mesh 
> etc,
> and  time/step that you refer to is the wall time/call count?

Note sure if I understand the question, but I'll try to answer it:
that is simulation time/wall time, i.e. number of time-steps * dt /
wall-t in days.


Cheers,
--
Szilárd


> Thanks in advance!
> —maureen
>
>
> Date: Mon, 7 Aug 2017 16:01:16 +0200
> From: Szilárd Páll <pall.szil...@gmail.com <mailto:pall.szil...@gmail.com>>
> To: Discussion list for GROMACS users <gmx-us...@gromacs.org 
> <mailto:gmx-us...@gromacs.org>>
> Subject: Re: [gmx-users] Performance values
>
>
> Indeed, "Wall t" is real application wall-time, nanoseconds/day is the
> typical molecular dynamics performance unit that corresponds to the
> effective amount of simulation throughput (note that this however
> depends on the time-step and without that specified it is not useful
> to compare to other runs), so often it is useful to use it convert it
> to time/step.
> --
> Szilárd
>
>
> On Fri, Jul 28, 2017 at 10:20 AM, Maureen Chew <maureen.c...@oracle.com 
> <mailto:maureen.c...@oracle.com>> wrote:
>> You might find this reference handy - it has a really nice explanation for 
>> how to look
>> at a log file
>> Topology preparation, "What's in a log file", basic performance 
>> improvements: Mark Abraham, Session 1A 
>> <http://www.gromacs.org/Documentation/Tutorials/GROMACS_USA_Workshop_and_Conference_2013/Topology_preparation,_%22What's_in_a_log_file%22,_basic_performance_improvements:_Mark_Abraham,_Session_1A
>>  
>> <http://www.gromacs.org/Documentation/Tutorials/GROMACS_USA_Workshop_and_Conference_2013/Topology_preparation,_%22What's_in_a_log_file%22,_basic_performance_improvements:_Mark_Abraham,_Session_1A>>
>>
>> The ?Performance:? values are a throughput measure where both values 
>> represent
>> the same thing in different terms.  In your sample below, 3.964 is the
>> number of nanoseconds that can be simulated in 24 hours while it takes
>> 6.054 hours to simulate 1 ns
>>
>> HTH
>>
>>
>> On Jul 27, 2017, at 10:15 AM, Maureen Chew <maureen.c...@oracle.com 
>> <mailto:maureen.c...@oracle.com>> wrote:
>>> Where is it documented how the mdrun performance metrics are calculated ? 
>>&g

Re: [gmx-users] Performance values

2017-08-07 Thread Maureen Chew
Szilárd,
Thank you so very much for the reply!You mention
that time/step is important if trying to do an apples-to-apples
comparison for any given simulation.

I have a few questions - For specific example, use the RNAse reference here:
(http://www.gromacs.org/gpu <http://www.gromacs.org/gpu>)
Influence of box geometry and virtual interaction sites
This is a simulation of the protein RNAse, which contains roughly 24,000 atoms 
in a cubic box.
The image rnase.png <http://www.gromacs.org/@api/deki/files/224/=rnase.png>, 
shows a 6 core baseline to be, roughly 50ns/day

First,  assuming this is from rnase_cubic @ 
ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz 
<ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz>
which contains these files:
rnase_cubic/rf_verlet.mdp
rnase_cubic/conf.gro
rnase_cubic/topol.top
rnase_cubic/pme_verlet.mdp

3 questions:
- What gmx grompp command was use to generate the tpr file for the result in 
rnase.png?
Apologies if its intuitively obvious which mdp  was used
- Aside from -ntmpi and -ntomp parms, what gmx mdrun command was used to obtain
the 6 core result?
- From that 6 core run, what is the time/step that you refer to?  Is that
real cycle  and time accounting for neighbor search, force, PME mesh 
etc,
and  time/step that you refer to is the wall time/call count?

Thanks in advance!
—maureen


Date: Mon, 7 Aug 2017 16:01:16 +0200
From: Szilárd Páll <pall.szil...@gmail.com <mailto:pall.szil...@gmail.com>>
To: Discussion list for GROMACS users <gmx-us...@gromacs.org 
<mailto:gmx-us...@gromacs.org>>
Subject: Re: [gmx-users] Performance values


Indeed, "Wall t" is real application wall-time, nanoseconds/day is the
typical molecular dynamics performance unit that corresponds to the
effective amount of simulation throughput (note that this however
depends on the time-step and without that specified it is not useful
to compare to other runs), so often it is useful to use it convert it
to time/step.
--
Szilárd


On Fri, Jul 28, 2017 at 10:20 AM, Maureen Chew <maureen.c...@oracle.com 
<mailto:maureen.c...@oracle.com>> wrote:
> You might find this reference handy - it has a really nice explanation for 
> how to look
> at a log file
> Topology preparation, "What's in a log file", basic performance improvements: 
> Mark Abraham, Session 1A 
> <http://www.gromacs.org/Documentation/Tutorials/GROMACS_USA_Workshop_and_Conference_2013/Topology_preparation,_%22What's_in_a_log_file%22,_basic_performance_improvements:_Mark_Abraham,_Session_1A
>  
> <http://www.gromacs.org/Documentation/Tutorials/GROMACS_USA_Workshop_and_Conference_2013/Topology_preparation,_%22What's_in_a_log_file%22,_basic_performance_improvements:_Mark_Abraham,_Session_1A>>
> 
> The ?Performance:? values are a throughput measure where both values represent
> the same thing in different terms.  In your sample below, 3.964 is the
> number of nanoseconds that can be simulated in 24 hours while it takes
> 6.054 hours to simulate 1 ns
> 
> HTH
> 
> 
> On Jul 27, 2017, at 10:15 AM, Maureen Chew <maureen.c...@oracle.com 
> <mailto:maureen.c...@oracle.com>> wrote:
>> Where is it documented how the mdrun performance metrics are calculated ? 
>> I?ve
>> looked here
>> http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html
>>  
>> <http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html>
>>  
>> <http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html
>>  
>> <http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html>>
>> and here
>> http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf 
>> <http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf> 
>> <http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf 
>> <http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf>>
>> 
>> but seem to have missed  explanation.
>> 
>> Are the sample mdrun times below user time or real time?  Generally, wall is 
>> real time
>> I understand that ?Performance:? is not a linear scale but what is the scale
>> in the 2016.3 sample below?
>> 
>> Core t (s)   Wall t (s)(%)
>>  Time:69761.050  272.50425600.0
>>(ns/day)(hour/ns)
>> Performance:3.9646.054
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance values

2017-08-07 Thread Szilárd Páll
Indeed, "Wall t" is real application wall-time, nanoseconds/day is the
typical molecular dynamics performance unit that corresponds to the
effective amount of simulation throughput (note that this however
depends on the time-step and without that specified it is not useful
to compare to other runs), so often it is useful to use it convert it
to time/step.
--
Szilárd


On Fri, Jul 28, 2017 at 10:20 AM, Maureen Chew  wrote:
> You might find this reference handy - it has a really nice explanation for 
> how to look
> at a log file
> Topology preparation, "What's in a log file", basic performance improvements: 
> Mark Abraham, Session 1A 
> 
>
> The “Performance:” values are a throughput measure where both values represent
> the same thing in different terms.  In your sample below, 3.964 is the
> number of nanoseconds that can be simulated in 24 hours while it takes
> 6.054 hours to simulate 1 ns
>
> HTH
>
>
> On Jul 27, 2017, at 10:15 AM, Maureen Chew  wrote:
>> Where is it documented how the mdrun performance metrics are calculated ? 
>> I?ve
>> looked here
>> http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html
>>  
>> 
>> and here
>> http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf 
>> 
>>
>> but seem to have missed  explanation.
>>
>> Are the sample mdrun times below user time or real time?  Generally, wall is 
>> real time
>> I understand that ?Performance:? is not a linear scale but what is the scale
>> in the 2016.3 sample below?
>>
>>  Core t (s)   Wall t (s)(%)
>>   Time:69761.050  272.50425600.0
>> (ns/day)(hour/ns)
>> Performance:3.9646.054
>>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance values

2017-07-28 Thread Maureen Chew
You might find this reference handy - it has a really nice explanation for how 
to look 
at a log file
Topology preparation, "What's in a log file", basic performance improvements: 
Mark Abraham, Session 1A 


The “Performance:” values are a throughput measure where both values represent
the same thing in different terms.  In your sample below, 3.964 is the
number of nanoseconds that can be simulated in 24 hours while it takes
6.054 hours to simulate 1 ns

HTH


On Jul 27, 2017, at 10:15 AM, Maureen Chew  wrote:
> Where is it documented how the mdrun performance metrics are calculated ? I?ve
> looked here
> http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html
>  
> 
> and here
> http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf 
> 
> 
> but seem to have missed  explanation.  
> 
> Are the sample mdrun times below user time or real time?  Generally, wall is 
> real time
> I understand that ?Performance:? is not a linear scale but what is the scale
> in the 2016.3 sample below?
> 
>  Core t (s)   Wall t (s)(%)
>   Time:69761.050  272.50425600.0
> (ns/day)(hour/ns)
> Performance:3.9646.054
> 

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance values

2017-07-27 Thread Maureen Chew
Where is it documented how the mdrun performance metrics are calculated ? I’ve
looked here
http://manual.gromacs.org/documentation/2016/user-guide/mdrun-performance.html 

and here
http://manual.gromacs.org/documentation/2016.3/manual-2016.3.pdf 


but seem to have missed  explanation.  

Are the sample mdrun times below user time or real time?  Generally, wall is 
real time
I understand that “Performance:” is not a linear scale but what is the scale
in the 2016.3 sample below?

  Core t (s)   Wall t (s)(%)
   Time:69761.050  272.50425600.0
 (ns/day)(hour/ns)
Performance:3.9646.054

 

TIA
—maureen
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] performance issue with many short MD runs

2017-03-28 Thread Michael Brunsteiner

Thanks Peter and Mark!
I'll try running on single cores ...
however, comparing the timings I believe the bottleneck might be the time spent 
in I/O(reading/writing to disk) and here running several jobs on a single node 
with multiple coresmight make things even worse.
also funny: In the log files Gromacs reports Wall times for both machines that 
are comparable:0.613 (old machine)0.525 (new machine)but the UNIX time command 
tells a different story:
real    0m0.798s (old machine)
real    0m1.543s (new machine)
i wonder where the missing time goes ... ;)

anyway thanks again!regardsMichael
 === Why be happy when you could be normal?
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] performance issue with many short MD runs

2017-03-27 Thread Mark Abraham
Hi,

As Peter notes, there are cases where the GPU won't be used for the rerun
(specifically, when you request more than one energy group, for which it
would likely be prohibitively slow, even if we'd write and run such a
kernel on the GPU; but that is not the case here). The reason things take a
long time is that a rerun has a wildly different execution profile from
normal mdrun. Each "step" has to get the positions from some cold part of
memory/disk, do a fresh neighbor search (since mdrun can't rely on the
usual assumption that you can re-use the last one quite a few times),
launch a GPU kernel, launch CPU OpenMP regions, compute forces that often
won't even be used for output, and write whatever should be output. Most of
that code is run very rarely in a normal production simulation, so isn't
heavily optimized. But your rerun is spending most of its time there. Since
you note that your compute load is a single small molecule, it would not be
at all surprising for the mdrun performance breakdown in the log file to
show that all the overheads take very much more time than the GPU kernel
that computes the energy that you want. Those can take wildly different
amounts of time on different machines for all sorts of reasons, including
CUDA API overhead (as Peter noted), Linux kernel configuration, OS version,
hard disk performance, machine load, whether the sysadmin showered lately,
the phase of the moon, etc. :-)

Compare the final sections of the log files to see what I mean. Try gmx
mdrun -rerun -nb cpu, as it might be faster to waste the GPU. If you really
are doing many machine-hours of such jobs and care about turn-around time,
invest human time in writing a script to break up your trajectory into
pieces, and give each piece to a single mdrun that you place on e.g. a
different single core (e.g. with tools like numactl or taskset) and run a
different gmx mdrun -rerun -nb cpu -ntmpi 1 -ntomp 1 on each single core.

Mark

On Mon, Mar 27, 2017 at 4:24 PM Peter Kroon  wrote:

> Hi,
>
>
> On the new machine your CUDA runtime and driver versions are lower than
> on the old machine. Maybe that could explain it? (is the GPU even used
> with -rerun?) You would need to recompile gromacs.
>
>
> Peter
>
>
> On 27-03-17 15:51, Michael Brunsteiner wrote:
> > Hi,I have to run a lot (many thousands) of very short MD reruns with
> gmx.Using gmx-2016.3 it works without problems, however, what i see is
> thatthe overall performance (in terms of REAL execution time as measured
> with the unix time command)which I get on a relatively new computer is
> poorer than what i get with a much older machine
> > (by a factor of about 2 -  this in spite of gmx reporting a better
> performance of the new machine in thelog file)
> >
> > both machines run linux (debian), the old has eight intel cores the
> newer one 12.
> > on the newer machine gmx uses a supposedly faster SIMD instruction
> setotherwise hardware (including hard drives) is comparable.
> >
> > below output of a typical job (gmx mdrun -rerun with a trajectory
> containingnot more than a couple of thousand conformations of a single
> small molecule)on both machines (mdp file content below)
> >
> > old machine:prompt> time gmx mdrun ...
> > in the log file:
> >Core t (s)   Wall t (s)(%)
> >Time:4.5270.566  800.0
> >  (ns/day)(hour/ns)
> > Performance:1.527   15.719
> > on the command line:
> > real2m45.562s  <
> > user15m40.901s
> > sys 0m33.319s
> >
> > new machine:
> > prompt> time gmx mdrun ...
> > in the log file:   Core t (s)   Wall t (s)(%)
> >Time:6.0300.502 1200.0
> >  (ns/day)(hour/ns)
> > Performance:1.719   13.958
> >
> > on the command line:real5m30.962s
> <
> > user20m2.208s
> > sys 3m28.676s
> >
> >  The specs of the two gmx installations are given below.I'd be grateful
> if anyone could suggest ways to improve performanceon the newer machine!
> > cheers,Michael
> >
> >
> > the older machine (here the jobs run faster):  gmx --version
> >
> > GROMACS version:2016.3
> > Precision:  single
> > Memory model:   64 bit
> > MPI library:thread_mpi
> > OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support:CUDA
> > SIMD instructions:  SSE4.1
> > FFT library:fftw-3.3.5-sse2
> > RDTSCP usage:   enabled
> > TNG support:enabled
> > Hwloc support:  hwloc-1.8.0
> > Tracing support:disabled
> > Built on:   Tue Mar 21 11:24:42 CET 2017
> > Built by:   root@rcpetemp1 [CMAKE]
> > Build OS/arch:  Linux 3.13.0-79-generic x86_64
> > Build CPU vendor:   Intel
> > Build CPU brand:Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz
> > Build CPU family:   6   Model: 26   Stepping: 5
> > Build CPU features: apic 

Re: [gmx-users] performance issue with many short MD runs

2017-03-27 Thread Peter Kroon
Hi,


On the new machine your CUDA runtime and driver versions are lower than
on the old machine. Maybe that could explain it? (is the GPU even used
with -rerun?) You would need to recompile gromacs.


Peter


On 27-03-17 15:51, Michael Brunsteiner wrote:
> Hi,I have to run a lot (many thousands) of very short MD reruns with 
> gmx.Using gmx-2016.3 it works without problems, however, what i see is 
> thatthe overall performance (in terms of REAL execution time as measured with 
> the unix time command)which I get on a relatively new computer is poorer than 
> what i get with a much older machine 
> (by a factor of about 2 -  this in spite of gmx reporting a better 
> performance of the new machine in thelog file)
>
> both machines run linux (debian), the old has eight intel cores the newer one 
> 12. 
> on the newer machine gmx uses a supposedly faster SIMD instruction 
> setotherwise hardware (including hard drives) is comparable.
>
> below output of a typical job (gmx mdrun -rerun with a trajectory 
> containingnot more than a couple of thousand conformations of a single small 
> molecule)on both machines (mdp file content below)
>
> old machine:prompt> time gmx mdrun ...
> in the log file:
>Core t (s)   Wall t (s)(%)
>Time:4.5270.566  800.0
>  (ns/day)(hour/ns)
> Performance:1.527   15.719
> on the command line:
> real2m45.562s  <
> user15m40.901s
> sys 0m33.319s
>
> new machine:
> prompt> time gmx mdrun ...
> in the log file:   Core t (s)   Wall t (s)(%)
>Time:6.0300.502 1200.0
>  (ns/day)(hour/ns)
> Performance:1.719   13.958
>
> on the command line:real5m30.962s  <
> user20m2.208s
> sys 3m28.676s
>
>  The specs of the two gmx installations are given below.I'd be grateful if 
> anyone could suggest ways to improve performanceon the newer machine!
> cheers,Michael
>
>
> the older machine (here the jobs run faster):  gmx --version
>
> GROMACS version:2016.3
> Precision:  single
> Memory model:   64 bit
> MPI library:thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:CUDA
> SIMD instructions:  SSE4.1
> FFT library:fftw-3.3.5-sse2
> RDTSCP usage:   enabled
> TNG support:enabled
> Hwloc support:  hwloc-1.8.0
> Tracing support:disabled
> Built on:   Tue Mar 21 11:24:42 CET 2017
> Built by:   root@rcpetemp1 [CMAKE]
> Build OS/arch:  Linux 3.13.0-79-generic x86_64
> Build CPU vendor:   Intel
> Build CPU brand:Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz
> Build CPU family:   6   Model: 26   Stepping: 5
> Build CPU features: apic clfsh cmov cx8 cx16 htt lahf mmx msr nonstop_tsc 
> pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
> C compiler: /usr/bin/cc GNU 4.8.4
> C compiler flags:-msse4.1 -O3 -DNDEBUG -funroll-all-loops 
> -fexcess-precision=fast  
> C++ compiler:   /usr/bin/c++ GNU 4.8.4
> C++ compiler flags:  -msse4.1-std=c++0x   -O3 -DNDEBUG -funroll-all-loops 
> -fexcess-precision=fast  
> CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler 
> driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on 
> Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
> CUDA compiler 
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;;;-Xcompiler;,-msse4.1,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
>  
> CUDA driver:7.50
> CUDA runtime:   7.50
>
>
>
> the newer machine (here execution is slower by a factor 2):  gmx --version
>
> GROMACS version:2016.3
> Precision:  single
> Memory model:   64 bit
> MPI library:thread_mpi
> OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
> GPU support:CUDA
> SIMD instructions:  AVX_256
> FFT library:fftw-3.3.5
> RDTSCP usage:   enabled
> TNG support:enabled
> Hwloc support:  hwloc-1.10.0
> Tracing support:disabled
> Built on:   Fri Mar 24 11:18:29 CET 2017
> Built by:   root@rcpe-sbd-node01 [CMAKE]
> Build OS/arch:  Linux 3.14-2-amd64 x86_64
> Build CPU vendor:   Intel
> Build CPU brand:Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
> Build CPU family:   6   Model: 62   Stepping: 4
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf mmx msr 
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 
> sse4.1 sse4.2 ssse3 tdt x2apic
> C compiler: /usr/bin/cc GNU 4.9.2
> C compiler flags:-mavx -O3 -DNDEBUG -funroll-all-loops 
> 

[gmx-users] performance issue with many short MD runs

2017-03-27 Thread Michael Brunsteiner

Hi,I have to run a lot (many thousands) of very short MD reruns with gmx.Using 
gmx-2016.3 it works without problems, however, what i see is thatthe overall 
performance (in terms of REAL execution time as measured with the unix time 
command)which I get on a relatively new computer is poorer than what i get with 
a much older machine 
(by a factor of about 2 -  this in spite of gmx reporting a better performance 
of the new machine in thelog file)

both machines run linux (debian), the old has eight intel cores the newer one 
12. 
on the newer machine gmx uses a supposedly faster SIMD instruction setotherwise 
hardware (including hard drives) is comparable.

below output of a typical job (gmx mdrun -rerun with a trajectory containingnot 
more than a couple of thousand conformations of a single small molecule)on both 
machines (mdp file content below)

old machine:prompt> time gmx mdrun ...
in the log file:
   Core t (s)   Wall t (s)    (%)
   Time:    4.527    0.566  800.0
 (ns/day)    (hour/ns)
Performance:    1.527   15.719
on the command line:
real    2m45.562s  <
user    15m40.901s
sys 0m33.319s

new machine:
prompt> time gmx mdrun ...
in the log file:   Core t (s)   Wall t (s)    (%)
   Time:    6.030    0.502 1200.0
 (ns/day)    (hour/ns)
Performance:    1.719   13.958

on the command line:real    5m30.962s  <
user    20m2.208s
sys 3m28.676s

 The specs of the two gmx installations are given below.I'd be grateful if 
anyone could suggest ways to improve performanceon the newer machine!
cheers,Michael


the older machine (here the jobs run faster):  gmx --version

GROMACS version:    2016.3
Precision:  single
Memory model:   64 bit
MPI library:    thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:    CUDA
SIMD instructions:  SSE4.1
FFT library:    fftw-3.3.5-sse2
RDTSCP usage:   enabled
TNG support:    enabled
Hwloc support:  hwloc-1.8.0
Tracing support:    disabled
Built on:   Tue Mar 21 11:24:42 CET 2017
Built by:   root@rcpetemp1 [CMAKE]
Build OS/arch:  Linux 3.13.0-79-generic x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Core(TM) i7 CPU 960  @ 3.20GHz
Build CPU family:   6   Model: 26   Stepping: 5
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf mmx msr nonstop_tsc pdcm 
popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/cc GNU 4.8.4
C compiler flags:    -msse4.1 -O3 -DNDEBUG -funroll-all-loops 
-fexcess-precision=fast  
C++ compiler:   /usr/bin/c++ GNU 4.8.4
C++ compiler flags:  -msse4.1    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops 
-fexcess-precision=fast  
CUDA compiler:  /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler 
driver;Copyright (c) 2005-2015 NVIDIA Corporation;Built on 
Tue_Aug_11_14:27:32_CDT_2015;Cuda compilation tools, release 7.5, V7.5.17
CUDA compiler 
flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-use_fast_math;;;-Xcompiler;,-msse4.1,,;-Xcompiler;-O3,-DNDEBUG,-funroll-all-loops,-fexcess-precision=fast,,;
 
CUDA driver:    7.50
CUDA runtime:   7.50



the newer machine (here execution is slower by a factor 2):  gmx --version

GROMACS version:    2016.3
Precision:  single
Memory model:   64 bit
MPI library:    thread_mpi
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32)
GPU support:    CUDA
SIMD instructions:  AVX_256
FFT library:    fftw-3.3.5
RDTSCP usage:   enabled
TNG support:    enabled
Hwloc support:  hwloc-1.10.0
Tracing support:    disabled
Built on:   Fri Mar 24 11:18:29 CET 2017
Built by:   root@rcpe-sbd-node01 [CMAKE]
Build OS/arch:  Linux 3.14-2-amd64 x86_64
Build CPU vendor:   Intel
Build CPU brand:    Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
Build CPU family:   6   Model: 62   Stepping: 4
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf mmx msr 
nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 
sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 4.9.2
C compiler flags:    -mavx -O3 -DNDEBUG -funroll-all-loops 
-fexcess-precision=fast  
C++ compiler:   /usr/bin/c++ GNU 4.9.2
C++ compiler flags:  -mavx    -std=c++0x   -O3 -DNDEBUG -funroll-all-loops 
-fexcess-precision=fast  
CUDA compiler:  /usr/bin/nvcc nvcc: NVIDIA (R) Cuda compiler 
driver;Copyright (c) 2005-2013 NVIDIA Corporation;Built on 
Wed_Jul_17_18:36:13_PDT_2013;Cuda compilation tools, release 5.5, V5.5.0
CUDA compiler 

Re: [gmx-users] Performance advice for newest Pascal architecture

2017-03-16 Thread Szilárd Páll
Hi,

I'd recommend the Quadro P6000 or Tesla P40. These are the large chips
equivalent with the TITAN X Pascal with high single precision
throughput, slightly faster than the P100.

The P5000 is not bad, but it's slower. It has a chip similar to the
1080 (not sure about the clocks, those might be lower; FYI the 1080 is
~30-35% slower than the TITAN X-P IIRC).

I'd rule out K40s real quick, they're 2-gen older than Pascal, power
hogs with at least 2-3x lower performance than either of the Pascals.

Cheers,
--
Szilárd

PS: I assume you have considered buying 1080 or 1080Ti cards. You can
get 2-3 of them for the price of a P5000.


On Thu, Mar 9, 2017 at 11:58 AM, Téletchéa Stéphane
 wrote:
> Dear colleagues,
>
> We are willing to invest on nodes for GROMACS-specific calculations, and
> trying to best the best for our bucks (as everyone).
>
> For now our decisions comes close to nodes using the following
> configuration:
>
> 2 * Xeon E5-2630 v4
> 1 P100 or 2 * P5000 or 2 * K40
> Cluster node interconnection: Intel OmniPath
>
> Our system will will range from 50k to 200k atoms most of the time, using
> AMBER-99SB-ILDn, GROMCAS 2016.1 and above.
>
> I am aware of various benchmark and recommandations like "Best Bang for your
> Bucks", but is there any reference (internal may be) for latest Pascal
> architecture, or any general advice against/for ?
>
> Thanks a lot in advance for the feedback, if we are able to benchmark on our
> systems using the different setups above we'll share as possible by the
> upstream vendor the results.
>
> Stéphane
>
> --
> Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein
> Design In Silico
> UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes
> cedex 03, France
> Tél : +33 251 125 636 / Fax : +33 251 125 632
> http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
> mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance advice for newest Pascal architecture

2017-03-14 Thread Mark Abraham
Hi,

There's not a great deal of change that is specific to Pascal. It's more
and better and faster, and if running PME you'd want to improve your ratio
of CPU cores to GPUs (e.g. put fewer Pascal GPUs in a node than earlier
generation). We are working on PME running on GPUs, hopefully for this
year's release, but it won't yet be such that you'd get away with a cheap
CPU.

Mark

On Thu, Mar 9, 2017 at 11:59 AM Téletchéa Stéphane <
stephane.teletc...@univ-nantes.fr> wrote:

> Dear colleagues,
>
> We are willing to invest on nodes for GROMACS-specific calculations, and
> trying to best the best for our bucks (as everyone).
>
> For now our decisions comes close to nodes using the following
> configuration:
>
> 2 * Xeon E5-2630 v4
> 1 P100 or 2 * P5000 or 2 * K40
> Cluster node interconnection: Intel OmniPath
>
> Our system will will range from 50k to 200k atoms most of the time,
> using AMBER-99SB-ILDn, GROMCAS 2016.1 and above.
>
> I am aware of various benchmark and recommandations like "Best Bang for
> your Bucks", but is there any reference (internal may be) for latest
> Pascal architecture, or any general advice against/for ?
>
> Thanks a lot in advance for the feedback, if we are able to benchmark on
> our systems using the different setups above we'll share as possible by
> the upstream vendor the results.
>
> Stéphane
>
> --
> Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein
> Design In Silico
> UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> Nantes cedex 03, France
> Tél : +33 251 125 636 / Fax : +33 251 125 632
> http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance advice for newest Pascal architecture

2017-03-09 Thread Téletchéa Stéphane

Dear colleagues,

We are willing to invest on nodes for GROMACS-specific calculations, and 
trying to best the best for our bucks (as everyone).


For now our decisions comes close to nodes using the following 
configuration:


2 * Xeon E5-2630 v4
1 P100 or 2 * P5000 or 2 * K40
Cluster node interconnection: Intel OmniPath

Our system will will range from 50k to 200k atoms most of the time, 
using AMBER-99SB-ILDn, GROMCAS 2016.1 and above.


I am aware of various benchmark and recommandations like "Best Bang for 
your Bucks", but is there any reference (internal may be) for latest 
Pascal architecture, or any general advice against/for ?


Thanks a lot in advance for the feedback, if we are able to benchmark on 
our systems using the different setups above we'll share as possible by 
the upstream vendor the results.


Stéphane

--
Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein 
Design In Silico
UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 
Nantes cedex 03, France

Tél : +33 251 125 636 / Fax : +33 251 125 632
http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Performance loss during pulling simulation

2016-06-27 Thread Mark Abraham
Hi,

That sounds strange. The diagnostic information is all in the .log files,
so a side-by-side diff is often instructive. If you can't find anything
that points to a problem, please upload some log files to a file-sharing
service and share some links to think (list can't take attachments).

Mark

On Mon, Jun 27, 2016 at 12:23 PM Jan Meyer  wrote:

>
> Dear gromacs user,
>
> i am running a system consisting of two silica-clusters (~3000 Atoms)
> who are surrounded by 85 polymer
> chains. Each polymer chain contains 200 monomer-units. The overall
> number of atoms is roughly 10^5. During the simulation i fix one of
> the clusters. The other one is
> pulled towards the fixed one. When i run this pulling simulation for
> T=300 K i get  a performance of ~6.5 ns/day. Increasing
> the temperature to 400 K leads to a performance of ~3.0 ns/day. This
> effect seems to be systematic for all the simulations i am doing on
> these and similar systems.
>
> I wonder where this performance loss comes from. So far i think it has
> to do with the pulling because i dont see this difference in
> performance doing equilibrium simulations.
>
> I would be greateful for any  explanations or suggestions to fix this.
>
> Best regards,
> Jan
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] Performance loss during pulling simulation

2016-06-27 Thread Jan Meyer


Dear gromacs user,

i am running a system consisting of two silica-clusters (~3000 Atoms)  
who are surrounded by 85 polymer
chains. Each polymer chain contains 200 monomer-units. The overall  
number of atoms is roughly 10^5. During the simulation i fix one of  
the clusters. The other one is
pulled towards the fixed one. When i run this pulling simulation for  
T=300 K i get  a performance of ~6.5 ns/day. Increasing
the temperature to 400 K leads to a performance of ~3.0 ns/day. This  
effect seems to be systematic for all the simulations i am doing on  
these and similar systems.


I wonder where this performance loss comes from. So far i think it has  
to do with the pulling because i dont see this difference in  
performance doing equilibrium simulations.


I would be greateful for any  explanations or suggestions to fix this.

Best regards,
Jan

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance on multiple GPUs per node

2015-12-11 Thread Szilárd Páll
Hi,

Without details of your benchmarks it's hard to comment on why do you not
see performance improvement with multiple GPUs per node. Sharing some logs
would be helpful.

Are you comparing performance with N cores and varying number of GPUs? The
balance of hardware resources is a key factor to scaling and my guess is
that your runs are essentially CPU-bound, hence adding more GPUs does not
help.

Have a look at these papers:
https://doi.org/10.1002/jcc.24030
https://doi.org/10.1007/978-3-319-15976-8_1
especially the former covers the topic quite well and both show scaling of
<100k protein system to 32-64 nodes (dual socket/dual GPU).

Cheers,

--
Szilárd

On Fri, Dec 11, 2015 at 11:54 AM, Jens Krüger <
krue...@informatik.uni-tuebingen.de> wrote:

> Dear all,
>
> we are currently planning a new cluster at our universities compute
> centre. The big question on our side is, how many and which GPUs we should
> put into the nodes.
>
> We have access to a test system with four Tesla K80s per Node. Using one
> GPU node we can reach something like 23 ns/day for the ADH system (PME,
> cubic) which is pretty much in line with e.g.,
> http://exxactcorp.com/index.php/solution/solu_list/84
>
> When trying to use 2 or more GPUs on one node, the performance plunges to
> below 10 ns/day no matter how we split the MPI/OMP threads. Has anybody of
> you guys access to a comparable hardware setup? We would be interested in
> benchmark data answering the question: Does GROMACS-5.1 scales on more than
> one GPU per node?
>
> Thanks and best wishes,
>
> Jens
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance on multiple GPUs per node

2015-12-11 Thread Jens Krüger

Dear all,

we are currently planning a new cluster at our universities compute 
centre. The big question on our side is, how many and which GPUs we 
should put into the nodes.


We have access to a test system with four Tesla K80s per Node. Using one 
GPU node we can reach something like 23 ns/day for the ADH system (PME, 
cubic) which is pretty much in line with e.g., 
http://exxactcorp.com/index.php/solution/solu_list/84


When trying to use 2 or more GPUs on one node, the performance plunges 
to below 10 ns/day no matter how we split the MPI/OMP threads. Has 
anybody of you guys access to a comparable hardware setup? We would be 
interested in benchmark data answering the question: Does GROMACS-5.1 
scales on more than one GPU per node?


Thanks and best wishes,

Jens


--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] performance of e5 2630 CPU with gtx-titan GPU

2015-07-29 Thread Kutzner, Carsten
Hi Netaly,

in this study http://arxiv.org/abs/1507.00898 are GROMACS performance
evaluations for many CPU/GPU combinations. Although your combination
is not among them, you could try to estimate its performance from
similar setups.

There is for example an E5-1620 CPU with a TITAN GPU. Although the E5-1620
has only 4 cores instead of the 6 of the E5-2630 (factor 1.5 difference), 
it is also clocked higher (3.6 GHz as compared to 2.3 GHz, about the same
factor).

Although there are other differences between the two CPUs, for an estimate
it should be OK. 

Best,
  Carsten


 On 28 Jul 2015, at 20:31, Netaly Khazanov neta...@gmail.com wrote:
 
 Hello All,
 Does anybody know what is the performance of this combination of CPU and
 GPU?
 
 Thanks in advance.
 
 
 
 -- 
 Netaly
 -- 
 Gromacs Users mailing list
 
 * Please search the archive at 
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
 
 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 
 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
 mail to gmx-users-requ...@gromacs.org.

--
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics
Am Fassberg 11, 37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
http://www.mpibpc.mpg.de/grubmueller/kutzner
http://www.mpibpc.mpg.de/grubmueller/sppexa

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] performance of e5 2630 CPU with gtx-titan GPU

2015-07-29 Thread Netaly Khazanov
 thanks a lot for your answer.
I  will definitely take a look on this study.
Regards,
Netaly

On Wed, Jul 29, 2015 at 11:22 AM, Kutzner, Carsten ckut...@gwdg.de wrote:

 Hi Netaly,

 in this study http://arxiv.org/abs/1507.00898 are GROMACS performance
 evaluations for many CPU/GPU combinations. Although your combination
 is not among them, you could try to estimate its performance from
 similar setups.

 There is for example an E5-1620 CPU with a TITAN GPU. Although the E5-1620
 has only 4 cores instead of the 6 of the E5-2630 (factor 1.5 difference),
 it is also clocked higher (3.6 GHz as compared to 2.3 GHz, about the same
 factor).

 Although there are other differences between the two CPUs, for an estimate
 it should be OK.

 Best,
   Carsten


  On 28 Jul 2015, at 20:31, Netaly Khazanov neta...@gmail.com wrote:
 
  Hello All,
  Does anybody know what is the performance of this combination of CPU and
  GPU?
 
  Thanks in advance.
 
 
 
  --
  Netaly
  --
  Gromacs Users mailing list
 
  * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
 posting!
 
  * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
 
  * For (un)subscribe requests visit
  https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

 --
 Dr. Carsten Kutzner
 Max Planck Institute for Biophysical Chemistry
 Theoretical and Computational Biophysics
 Am Fassberg 11, 37077 Goettingen, Germany
 Tel. +49-551-2012313, Fax: +49-551-2012302
 http://www.mpibpc.mpg.de/grubmueller/kutzner
 http://www.mpibpc.mpg.de/grubmueller/sppexa

 --
 Gromacs Users mailing list

 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
 posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.




-- 
Netaly
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] performance of e5 2630 CPU with gtx-titan GPU

2015-07-28 Thread Netaly Khazanov
Hello All,
Does anybody know what is the performance of this combination of CPU and
GPU?

Thanks in advance.



-- 
Netaly
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


[gmx-users] performance of e5 2630 CPU with gtx-titan GPU

2015-07-27 Thread Netaly Khazanov
Hello All,
Does anybody know what is the performance of this combination of CPU and
GPU?

Thanks in advance.

Netaly Khazanov
-- 
Netaly
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance of NVIDIA GTX980 in PCI-e 3.0 x8 or x16 slots ?

2015-06-15 Thread Mirco Wahab

On 11.06.2015 14:07, David McGiven wrote:

Your 1-3% claim is based on the webpage you linked ?

Is it reliable to compare GPU performances for gromacs with those of 3D
videogames ?


OK, you got me on this. As much as I'd wish I cannot
really back up my claim of comparability. I have been
out of office for one week but did some tests here
for myself today.

The system is Haswell/E (i7-5820K), GPU is single GTX-980
(the normal Gigabyte Model), the test run is ADH-cubic-vsites
(reaction field) from the bottom of the Gromacs acceleration page
(http://www.gromacs.org/GPU_acceleration).

I can explicitly set the PCIe-x16 slots to 1.0, 2.0, and 3.0
(which I did). Theoretically (and practically), PCIe-x16 2.0
should be relatively  close in bandwidth to PCIe-x8 3.0, so this
should give some hints as what to expect.

adh-cubic-vsites/rf - ns/day:
  PCIE-x16/1.0  54.46   
  PCIE-x16/2.0  61.81   
  PCIE-x16/3.0  64.52   

percentage related to PCIE-x16/3.0
  PCIE-x16/1.0  84.4
  PCIE-x16/2.0  95.8
  PCIE-x16/3.0  100

(each value = avg. of three runs)

Therefore one could support the hypothesis
that using one card in x16 and one in x8
would probably show a performance penalty
of around  5% on the x8 card.

Regards

M.

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance of NVIDIA GTX980 in PCI-e 3.0 x8 or x16 slots ?

2015-06-15 Thread Szilárd Páll
Good data Mirco, but let me emphasize that your measurements only
reflect the case of heavily GPU-bound workloads!

5-6% performance improvement with PCI-E 3.0 vs 2.0 is about the
maximum you'll see when, like in RF runs where, we there is no enough
CPU work to fully overlap with the GPU computation (indicated  by more
than a few % in the Waif for GPU counter). However, if there is PME
CPU work that balance well with the GPU work, while you will still get
shorter CPU-GPU transfer times, the impact of this on the total
runtime will be smaller.

Cheers,
--
Szilárd


On Mon, Jun 15, 2015 at 4:10 PM, Mirco Wahab
mirco.wa...@chemie.tu-freiberg.de wrote:
 On 11.06.2015 14:07, David McGiven wrote:

 Your 1-3% claim is based on the webpage you linked ?

 Is it reliable to compare GPU performances for gromacs with those of 3D
 videogames ?


 OK, you got me on this. As much as I'd wish I cannot
 really back up my claim of comparability. I have been
 out of office for one week but did some tests here
 for myself today.

 The system is Haswell/E (i7-5820K), GPU is single GTX-980
 (the normal Gigabyte Model), the test run is ADH-cubic-vsites
 (reaction field) from the bottom of the Gromacs acceleration page
 (http://www.gromacs.org/GPU_acceleration).

 I can explicitly set the PCIe-x16 slots to 1.0, 2.0, and 3.0
 (which I did). Theoretically (and practically), PCIe-x16 2.0
 should be relatively  close in bandwidth to PCIe-x8 3.0, so this
 should give some hints as what to expect.

 adh-cubic-vsites/rf - ns/day:
   PCIE-x16/1.0  54.46
   PCIE-x16/2.0  61.81
   PCIE-x16/3.0  64.52

 percentage related to PCIE-x16/3.0
   PCIE-x16/1.0  84.4
   PCIE-x16/2.0  95.8
   PCIE-x16/3.0  100

 (each value = avg. of three runs)

 Therefore one could support the hypothesis
 that using one card in x16 and one in x8
 would probably show a performance penalty
 of around  5% on the x8 card.

 Regards

 M.


 --
 Gromacs Users mailing list

 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a
 mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] Performance of NVIDIA GTX980 in PCI-e 3.0 x8 or x16 slots ?

2015-06-11 Thread David McGiven
Dear Gromacs Users,

We're finally buying some Intel E52650 servers + NVIDIA GTX980 cards.

However, there's some servers that come with only PCI-e 3.0 x8 slots and
others with x16 slots.

Do you think this is relevant for gromacs performance ? And if so, how much
relevant ?

Thanks in advance.

Cheers,
David.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance of NVIDIA GTX980 in PCI-e 3.0 x8 or x16 slots ?

2015-06-11 Thread Mirco Wahab

On 11.06.2015 13:08, David McGiven wrote:

We're finally buying some Intel E52650 servers + NVIDIA GTX980 cards.
However, there's some servers that come with only PCI-e 3.0 x8 slots and
others with x16 slots.
Do you think this is relevant for gromacs performance ? And if so, how much
relevant ?


It's more important to select PCIe 3.0 mode. Then, the difference
between 16x and 8x is imho very low (1-3%).

M.


P.S.: http://www.techpowerup.com/reviews/NVIDIA/GTX_980_PCI-Express_Scaling/

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance of NVIDIA GTX980 in PCI-e 3.0 x8 or x16 slots ?

2015-06-11 Thread David McGiven
Hey Mirco,

Your 1-3% claim is based on the webpage you linked ?

Is it reliable to compare GPU performances for gromacs with those of 3D
videogames ?

Thanks!

2015-06-11 13:21 GMT+02:00 Mirco Wahab mirco.wa...@chemie.tu-freiberg.de:

 On 11.06.2015 13:08, David McGiven wrote:

 We're finally buying some Intel E52650 servers + NVIDIA GTX980 cards.
 However, there's some servers that come with only PCI-e 3.0 x8 slots and
 others with x16 slots.
 Do you think this is relevant for gromacs performance ? And if so, how
 much
 relevant ?


 It's more important to select PCIe 3.0 mode. Then, the difference
 between 16x and 8x is imho very low (1-3%).

 M.


 P.S.:
 http://www.techpowerup.com/reviews/NVIDIA/GTX_980_PCI-Express_Scaling/

 --
 Gromacs Users mailing list

 * Please search the archive at
 http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
 posting!

 * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

 * For (un)subscribe requests visit
 https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
 send a mail to gmx-users-requ...@gromacs.org.

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] Performance drops when simulating protein with small ligands

2015-03-20 Thread Justin Lemkul



On 3/20/15 1:13 PM, Yunlong Liu wrote:

Hi,

I am running my protein with two ligands. Both ligands are small molecules like
ATP. However, my simulation performance drops a lot by adding this two ligands
with the same set of other parameters.

Previously with ligands, I have 30 ns/day with 64-cpus and 4gpus. But now I can
only gain 17 ns/day with the same setting. I want to know whether this is a
common phenomenon or I do something stupid.



Probably some other process is using resources and degrading your performance, 
or you're using different run settings (the .log file is definitive here).  The 
mere addition of ligands does not degrade performance.


-Justin

--
==

Justin A. Lemkul, Ph.D.
Ruth L. Kirschstein NRSA Postdoctoral Fellow

Department of Pharmaceutical Sciences
School of Pharmacy
Health Sciences Facility II, Room 629
University of Maryland, Baltimore
20 Penn St.
Baltimore, MD 21201

jalem...@outerbanks.umaryland.edu | (410) 706-7441
http://mackerell.umaryland.edu/~jalemkul

==
--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


  1   2   >