Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-31 Thread Szilárd Páll
On Fri, Jul 19, 2013 at 6:59 PM, gigo  wrote:
> Hi!
>
>
> On 2013-07-17 21:08, Mark Abraham wrote:
>>
>> You tried ppn3 (with and without --loadbalance)?
>
>
> I was testing on 8-replicas simulation.
>
> 1) Without --loadbalance and -np 8.
> Excerpts from the script:
> #PBS -l nodes=8:ppn=3
> setenv OMP_NUM_THREADS 4
> mpiexec mdrun_mpi -v -cpt 20 -multi 8 -ntomp 4 -replex 2500 -cpi -pin on
>
> Excerpts from logs:
> Using 3 MPI processes
> Using 4 OpenMP threads per MPI process
> (...)
> Overriding thread affinity set outside mdrun_mpi
>
> Pinning threads with an auto-selected logical core stride of 1
>
> WARNING: In MPI process #0: Affinity setting for 1/4 threads failed.
>  This can cause performance degradation! If you think your setting
> are
>  correct, contact the GROMACS developers.
>
>
> WARNING: In MPI process #2: Affinity setting for 4/4 threads failed.
>
> Load: The job was allocated 24 cores (3 cores on 8 different nodes). Each
> OpenMP thread uses ~1/3 of a CPU core on average.
> Conclusions: MPI runs as many processes as cores requested (nnodes*ppn=24),
> it ignores OMP_NUM_THREADS env ==> this is wrong and is not Gromacs issue.
> Each MPI process forks to 4 threads as requested. The 24-core limit granted
> by Torque is not violated.
>
> 2) The same script, but with -np 8, to limit the number of MPI processes to
> the number of replicas
>
> Logs:
> Using 1 MPI process
> Using 4 OpenMP threads
> (...)
>
> Replicas 0,3 and 6: WARNING: Affinity setting for 1/4 threads failed.
> Replicas 1,2,4,5,7: WARNING: Affinity setting for 4/4 threads failed.
>
>
> Load: The job was allocated 24 cores on 8 nodes. Only on first 3 nodes
> mpiexec was run. Each OpenMP thread uses ~20% of a CPU core.
>
> 3) -np 8 --loadbalance
> Excerpts from logs:
>
> Using 1 MPI process
> Using 4 OpenMP threads
> (...)
> Each replica says: WARNING: Affinity setting for 3/4 threads failed.
>
> Load: MPI processes spread evenly on all 8 nodes. Each OpenMP thread uses
> ~50% of a CPU core.
>
> 4) -np 8 --loadbalance, #PBS -l nodes=8:ppn=4 <== this worked ~OK with
> gromacs 4.6.2
> Logs:
> WARNING: Affinity setting for 2/4 threads failed.
>
> Load: 32 cores allocated on 8 nodes. MPI processes spread evenly, each
> OpenMP thread uses ~70% of a CPU core.
> With 144 replicas the simulation did not produce any results, just got
> stuck.
>
>
> Some thoughts: the main problem is most probably in the way MPI interprets
> the information from torque, it is not Gromacs related. MPI ignores
> OMP_NUM_THREADS. The environment is just broken. Since gromacs-4.6.2 behaved
> better than 4.6.3 there, I am coming back to it.

FYI: unless you are setting thread affinities manually/through the job
scheduler, as the mdrun internal affinity setting has a bug in 4.6.2,
you are advised to use 4.6.3 (and the "better" behavior may actually
be caused by the non-functional affinity setting).

> Best,
>
> G
>
>>
>> Mark
>>
>> On Wed, Jul 17, 2013 at 6:30 PM, gigo  wrote:
>>>
>>> On 2013-07-13 11:10, Mark Abraham wrote:


 On Sat, Jul 13, 2013 at 1:24 AM, gigo  wrote:
>
>
> On 2013-07-12 20:00, Mark Abraham wrote:
>>
>>
>>
>> On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:
>>>
>>>
>>>
>>> Hi!
>>>
>>> On 2013-07-12 11:15, Mark Abraham wrote:




 What does --loadbalance do?
>>>
>>>
>>>
>>>
>>>
>>> It balances the total number of processes across all allocated nodes.
>>
>>
>>
>>
>> OK, but using it means you are hostage to its assumptions about
>> balance.
>
>
>
>
> Thats true, but as long as I do not try to use more resources that the
> torque gives me, everything is OK. The question is, what is a proper
> way
> of
> running multiple simulations in parallel with MPI that are further
> parallelized with OpenMP, when pinning fails? I could not find any
> other.



 I think pinning fails because you are double-crossing yourself. You do
 not want 12 MPI processes per node, and that is likely what ppn is
 setting. AFAIK your setup should work, but I haven't tested it.

>>
>>> The
>>> thing is that mpiexec does not know that I want each replica to fork
>>> to
>>> 4
>>> OpenMP threads. Thus, without this option and without affinities (in
>>> a
>>> sec
>>> about it) mpiexec starts too many replicas on some nodes - gromacs
>>> complains
>>> about the overload then - while some cores on other nodes are not
>>> used.
>>> It
>>> is possible to run my simulation like that:
>>>
>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>
>>> Then each replica runs on 4 MPI processes (I allocate 4 times more
>>> cores
>>> then replicas and mdrun sees it). The problem is that it is much

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-19 Thread gigo

Hi!

On 2013-07-17 21:08, Mark Abraham wrote:

You tried ppn3 (with and without --loadbalance)?


I was testing on 8-replicas simulation.

1) Without --loadbalance and -np 8.
Excerpts from the script:
#PBS -l nodes=8:ppn=3
setenv OMP_NUM_THREADS 4
mpiexec mdrun_mpi -v -cpt 20 -multi 8 -ntomp 4 -replex 2500 -cpi -pin 
on


Excerpts from logs:
Using 3 MPI processes
Using 4 OpenMP threads per MPI process
(...)
Overriding thread affinity set outside mdrun_mpi

Pinning threads with an auto-selected logical core stride of 1

WARNING: In MPI process #0: Affinity setting for 1/4 threads failed.
 This can cause performance degradation! If you think your 
setting are

 correct, contact the GROMACS developers.


WARNING: In MPI process #2: Affinity setting for 4/4 threads failed.

Load: The job was allocated 24 cores (3 cores on 8 different nodes). 
Each OpenMP thread uses ~1/3 of a CPU core on average.
Conclusions: MPI runs as many processes as cores requested 
(nnodes*ppn=24), it ignores OMP_NUM_THREADS env ==> this is wrong and is 
not Gromacs issue. Each MPI process forks to 4 threads as requested. The 
24-core limit granted by Torque is not violated.


2) The same script, but with -np 8, to limit the number of MPI 
processes to the number of replicas

Logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)

Replicas 0,3 and 6: WARNING: Affinity setting for 1/4 threads failed.
Replicas 1,2,4,5,7: WARNING: Affinity setting for 4/4 threads failed.


Load: The job was allocated 24 cores on 8 nodes. Only on first 3 nodes 
mpiexec was run. Each OpenMP thread uses ~20% of a CPU core.


3) -np 8 --loadbalance
Excerpts from logs:
Using 1 MPI process
Using 4 OpenMP threads
(...)
Each replica says: WARNING: Affinity setting for 3/4 threads failed.

Load: MPI processes spread evenly on all 8 nodes. Each OpenMP thread 
uses ~50% of a CPU core.


4) -np 8 --loadbalance, #PBS -l nodes=8:ppn=4 <== this worked ~OK with 
gromacs 4.6.2

Logs:
WARNING: Affinity setting for 2/4 threads failed.

Load: 32 cores allocated on 8 nodes. MPI processes spread evenly, each 
OpenMP thread uses ~70% of a CPU core.
With 144 replicas the simulation did not produce any results, just got 
stuck.



Some thoughts: the main problem is most probably in the way MPI 
interprets the information from torque, it is not Gromacs related. MPI 
ignores OMP_NUM_THREADS. The environment is just broken. Since 
gromacs-4.6.2 behaved better than 4.6.3 there, I am coming back to it.

Best,
G



Mark

On Wed, Jul 17, 2013 at 6:30 PM, gigo  wrote:

On 2013-07-13 11:10, Mark Abraham wrote:


On Sat, Jul 13, 2013 at 1:24 AM, gigo  wrote:


On 2013-07-12 20:00, Mark Abraham wrote:



On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:



Hi!

On 2013-07-12 11:15, Mark Abraham wrote:




What does --loadbalance do?





It balances the total number of processes across all allocated 
nodes.




OK, but using it means you are hostage to its assumptions about 
balance.




Thats true, but as long as I do not try to use more resources that 
the
torque gives me, everything is OK. The question is, what is a 
proper way

of
running multiple simulations in parallel with MPI that are further
parallelized with OpenMP, when pinning fails? I could not find any 
other.



I think pinning fails because you are double-crossing yourself. You 
do

not want 12 MPI processes per node, and that is likely what ppn is
setting. AFAIK your setup should work, but I haven't tested it.




The
thing is that mpiexec does not know that I want each replica to 
fork to

4
OpenMP threads. Thus, without this option and without affinities 
(in a

sec
about it) mpiexec starts too many replicas on some nodes - 
gromacs

complains
about the overload then - while some cores on other nodes are not 
used.

It
is possible to run my simulation like that:

mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi 
(without

--loadbalance for mpiexec and without -ntomp for mdrun)

Then each replica runs on 4 MPI processes (I allocate 4 times 
more

cores
then replicas and mdrun sees it). The problem is that it is much 
slower

than
using OpenMP for each replica. I did not find any other way than
--loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to 
use

MPI
and OpenMP at the same time on the torque-controlled cluster.




That seems highly surprising. I have not yet encountered a job
scheduler that was completely lacking a "do what I tell you" 
layout
scheme. More importantly, why are you using #PBS -l 
nodes=48:ppn=12?




I thing that torque is very similar to all PBS-like resource 
managers in
this regard. It actually does what I tell it to do. There are 
12-core

nodes,
I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does 
not

work),
end of story. Now, the program that I run is responsible for 
populating

resources that I got.



No, that's not the end of the story. The scheduler and the MPI 
system
typically cooperate to populate the MPI processes on the hardware, 
set
OMP

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-17 Thread Mark Abraham
You tried ppn3 (with and without --loadbalance)?

Mark

On Wed, Jul 17, 2013 at 6:30 PM, gigo  wrote:
> On 2013-07-13 11:10, Mark Abraham wrote:
>>
>> On Sat, Jul 13, 2013 at 1:24 AM, gigo  wrote:
>>>
>>> On 2013-07-12 20:00, Mark Abraham wrote:


 On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:
>
>
> Hi!
>
> On 2013-07-12 11:15, Mark Abraham wrote:
>>
>>
>>
>> What does --loadbalance do?
>
>
>
>
> It balances the total number of processes across all allocated nodes.



 OK, but using it means you are hostage to its assumptions about balance.
>>>
>>>
>>>
>>> Thats true, but as long as I do not try to use more resources that the
>>> torque gives me, everything is OK. The question is, what is a proper way
>>> of
>>> running multiple simulations in parallel with MPI that are further
>>> parallelized with OpenMP, when pinning fails? I could not find any other.
>>
>>
>> I think pinning fails because you are double-crossing yourself. You do
>> not want 12 MPI processes per node, and that is likely what ppn is
>> setting. AFAIK your setup should work, but I haven't tested it.
>>

> The
> thing is that mpiexec does not know that I want each replica to fork to
> 4
> OpenMP threads. Thus, without this option and without affinities (in a
> sec
> about it) mpiexec starts too many replicas on some nodes - gromacs
> complains
> about the overload then - while some cores on other nodes are not used.
> It
> is possible to run my simulation like that:
>
> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
> --loadbalance for mpiexec and without -ntomp for mdrun)
>
> Then each replica runs on 4 MPI processes (I allocate 4 times more
> cores
> then replicas and mdrun sees it). The problem is that it is much slower
> than
> using OpenMP for each replica. I did not find any other way than
> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use
> MPI
> and OpenMP at the same time on the torque-controlled cluster.



 That seems highly surprising. I have not yet encountered a job
 scheduler that was completely lacking a "do what I tell you" layout
 scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
>>>
>>>
>>>
>>> I thing that torque is very similar to all PBS-like resource managers in
>>> this regard. It actually does what I tell it to do. There are 12-core
>>> nodes,
>>> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not
>>> work),
>>> end of story. Now, the program that I run is responsible for populating
>>> resources that I got.
>>
>>
>> No, that's not the end of the story. The scheduler and the MPI system
>> typically cooperate to populate the MPI processes on the hardware, set
>> OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
>> set.
>
>
> I was able to run what I wanted flawlessly on another cluster with PBS-Pro.
> The torque cluster seem to work like I said ("the end of story" behaviour).
> REMD runs well on torque when I give a whole physical node to one replica.
> Otherwise the simulation does not go or the pinning fails (sometimes
> partially). I run out of options, I did not find any working
> example/documentation on running hybrid MPI/OpenMP jobs in torque. It seems
> that I stumbled upon limitations of this resource manager, and it is not
> really the Gromacs issue.
> Best Regards,
> Grzegorz
>
>
>>
>> You seem to be using 12 because you know there are 12 cores per node.
>> The scheduler should know that already. ppn should be a command about
>> what to do with the hardware, not a description of what it is. More to
>> the point, you should read the docs and be sure what it does.
>>
 Surely you want 3 MPI processes per 12-core node?
>>>
>>>
>>>
>>> Yes - I want each node to run 3 MPI processes. Preferably, I would like
>>> to
>>> run each MPI process on separate node (spread on 12 cores with OpenMP)
>>> but I
>>> will not get as much of resources. But again, without the --loadbalance
>>> hack
>>> I would not be able to properly populate the nodes...
>>
>>
>> So try ppn 3!
>>

>> What do the .log files say about
>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>
>
>
>
> Each replica logs:
> "Using 1 MPI process
> Using 4 OpenMP threads",
> That is is correct. As I said, the threads are forked, but 3 out of 4
> don't
> do anything, and the simulation does not go at all.
>
> About affinities Gromacs says:
> "Can not set thread affinities on the current platform. On NUMA systems
> this
> can cause performance degradation. If you think your platform should
> support
> setting affinities, contact the GROMACS developers."
>
> Well, the "current platform" is normal x86_64 cluster, but the whole
> information about resources is passed by Tor

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-17 Thread gigo

On 2013-07-13 11:10, Mark Abraham wrote:

On Sat, Jul 13, 2013 at 1:24 AM, gigo  wrote:

On 2013-07-12 20:00, Mark Abraham wrote:


On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:


Hi!

On 2013-07-12 11:15, Mark Abraham wrote:



What does --loadbalance do?




It balances the total number of processes across all allocated 
nodes.



OK, but using it means you are hostage to its assumptions about 
balance.



Thats true, but as long as I do not try to use more resources that 
the
torque gives me, everything is OK. The question is, what is a proper 
way of

running multiple simulations in parallel with MPI that are further
parallelized with OpenMP, when pinning fails? I could not find any 
other.


I think pinning fails because you are double-crossing yourself. You do
not want 12 MPI processes per node, and that is likely what ppn is
setting. AFAIK your setup should work, but I haven't tested it.




The
thing is that mpiexec does not know that I want each replica to 
fork to 4
OpenMP threads. Thus, without this option and without affinities 
(in a

sec
about it) mpiexec starts too many replicas on some nodes - gromacs
complains
about the overload then - while some cores on other nodes are not 
used.

It
is possible to run my simulation like that:

mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
--loadbalance for mpiexec and without -ntomp for mdrun)

Then each replica runs on 4 MPI processes (I allocate 4 times more 
cores
then replicas and mdrun sees it). The problem is that it is much 
slower

than
using OpenMP for each replica. I did not find any other way than
--loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to 
use MPI

and OpenMP at the same time on the torque-controlled cluster.



That seems highly surprising. I have not yet encountered a job
scheduler that was completely lacking a "do what I tell you" layout
scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?



I thing that torque is very similar to all PBS-like resource managers 
in
this regard. It actually does what I tell it to do. There are 12-core 
nodes,
I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not 
work),
end of story. Now, the program that I run is responsible for 
populating

resources that I got.


No, that's not the end of the story. The scheduler and the MPI system
typically cooperate to populate the MPI processes on the hardware, set
OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
set.


I was able to run what I wanted flawlessly on another cluster with 
PBS-Pro. The torque cluster seem to work like I said ("the end of story" 
behaviour). REMD runs well on torque when I give a whole physical node 
to one replica. Otherwise the simulation does not go or the pinning 
fails (sometimes partially). I run out of options, I did not find any 
working example/documentation on running hybrid MPI/OpenMP jobs in 
torque. It seems that I stumbled upon limitations of this resource 
manager, and it is not really the Gromacs issue.

Best Regards,
Grzegorz



You seem to be using 12 because you know there are 12 cores per node.
The scheduler should know that already. ppn should be a command about
what to do with the hardware, not a description of what it is. More to
the point, you should read the docs and be sure what it does.


Surely you want 3 MPI processes per 12-core node?



Yes - I want each node to run 3 MPI processes. Preferably, I would 
like to
run each MPI process on separate node (spread on 12 cores with 
OpenMP) but I
will not get as much of resources. But again, without the 
--loadbalance hack

I would not be able to properly populate the nodes...


So try ppn 3!




What do the .log files say about
OMP_NUM_THREADS, thread affinities, pinning, etc?




Each replica logs:
"Using 1 MPI process
Using 4 OpenMP threads",
That is is correct. As I said, the threads are forked, but 3 out of 
4

don't
do anything, and the simulation does not go at all.

About affinities Gromacs says:
"Can not set thread affinities on the current platform. On NUMA 
systems

this
can cause performance degradation. If you think your platform 
should

support
setting affinities, contact the GROMACS developers."

Well, the "current platform" is normal x86_64 cluster, but the 
whole

information about resources is passed by Torque to OpenMPI-linked
Gromacs.
Can it be that mdrun sees the resources allocated by torque as a 
big pool

of
cpus and misses the information about nodes topology?



mdrun gets its processor topology from the MPI layer, so that is 
where
you need to focus. The error message confirms that GROMACS sees 
things

that seem wrong.



Thank you, I will take a look. But the first thing I want to do is 
finding
the reason why Gromacs 4.6.3 is not able to run on my (slightly 
weird, I

admit) setup, while 4.6.2 does it very well.


4.6.2 had a bug that inhibited any MPI-based mdrun from attempting to
set affinities. It's still not clear why ppn 12 worked at 

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-13 Thread Mark Abraham
On Sat, Jul 13, 2013 at 1:24 AM, gigo  wrote:
> On 2013-07-12 20:00, Mark Abraham wrote:
>>
>> On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:
>>>
>>> Hi!
>>>
>>> On 2013-07-12 11:15, Mark Abraham wrote:


 What does --loadbalance do?
>>>
>>>
>>>
>>> It balances the total number of processes across all allocated nodes.
>>
>>
>> OK, but using it means you are hostage to its assumptions about balance.
>
>
> Thats true, but as long as I do not try to use more resources that the
> torque gives me, everything is OK. The question is, what is a proper way of
> running multiple simulations in parallel with MPI that are further
> parallelized with OpenMP, when pinning fails? I could not find any other.

I think pinning fails because you are double-crossing yourself. You do
not want 12 MPI processes per node, and that is likely what ppn is
setting. AFAIK your setup should work, but I haven't tested it.

>>
>>> The
>>> thing is that mpiexec does not know that I want each replica to fork to 4
>>> OpenMP threads. Thus, without this option and without affinities (in a
>>> sec
>>> about it) mpiexec starts too many replicas on some nodes - gromacs
>>> complains
>>> about the overload then - while some cores on other nodes are not used.
>>> It
>>> is possible to run my simulation like that:
>>>
>>> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
>>> --loadbalance for mpiexec and without -ntomp for mdrun)
>>>
>>> Then each replica runs on 4 MPI processes (I allocate 4 times more cores
>>> then replicas and mdrun sees it). The problem is that it is much slower
>>> than
>>> using OpenMP for each replica. I did not find any other way than
>>> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use MPI
>>> and OpenMP at the same time on the torque-controlled cluster.
>>
>>
>> That seems highly surprising. I have not yet encountered a job
>> scheduler that was completely lacking a "do what I tell you" layout
>> scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
>
>
> I thing that torque is very similar to all PBS-like resource managers in
> this regard. It actually does what I tell it to do. There are 12-core nodes,
> I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does not work),
> end of story. Now, the program that I run is responsible for populating
> resources that I got.

No, that's not the end of the story. The scheduler and the MPI system
typically cooperate to populate the MPI processes on the hardware, set
OMP_NUM_THREADS, set affinities, etc. mdrun honours those if they are
set.

You seem to be using 12 because you know there are 12 cores per node.
The scheduler should know that already. ppn should be a command about
what to do with the hardware, not a description of what it is. More to
the point, you should read the docs and be sure what it does.

>> Surely you want 3 MPI processes per 12-core node?
>
>
> Yes - I want each node to run 3 MPI processes. Preferably, I would like to
> run each MPI process on separate node (spread on 12 cores with OpenMP) but I
> will not get as much of resources. But again, without the --loadbalance hack
> I would not be able to properly populate the nodes...

So try ppn 3!

>>
 What do the .log files say about
 OMP_NUM_THREADS, thread affinities, pinning, etc?
>>>
>>>
>>>
>>> Each replica logs:
>>> "Using 1 MPI process
>>> Using 4 OpenMP threads",
>>> That is is correct. As I said, the threads are forked, but 3 out of 4
>>> don't
>>> do anything, and the simulation does not go at all.
>>>
>>> About affinities Gromacs says:
>>> "Can not set thread affinities on the current platform. On NUMA systems
>>> this
>>> can cause performance degradation. If you think your platform should
>>> support
>>> setting affinities, contact the GROMACS developers."
>>>
>>> Well, the "current platform" is normal x86_64 cluster, but the whole
>>> information about resources is passed by Torque to OpenMPI-linked
>>> Gromacs.
>>> Can it be that mdrun sees the resources allocated by torque as a big pool
>>> of
>>> cpus and misses the information about nodes topology?
>>
>>
>> mdrun gets its processor topology from the MPI layer, so that is where
>> you need to focus. The error message confirms that GROMACS sees things
>> that seem wrong.
>
>
> Thank you, I will take a look. But the first thing I want to do is finding
> the reason why Gromacs 4.6.3 is not able to run on my (slightly weird, I
> admit) setup, while 4.6.2 does it very well.

4.6.2 had a bug that inhibited any MPI-based mdrun from attempting to
set affinities. It's still not clear why ppn 12 worked at all.
Apparently mdrun was able to float some processes around to get
something that worked. The good news is that when you get it working
in 4.6.3, you will see a performance boost.

Mark
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-12 Thread gigo

On 2013-07-12 20:00, Mark Abraham wrote:

On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:

Hi!

On 2013-07-12 11:15, Mark Abraham wrote:


What does --loadbalance do?



It balances the total number of processes across all allocated nodes.


OK, but using it means you are hostage to its assumptions about 
balance.


Thats true, but as long as I do not try to use more resources that the 
torque gives me, everything is OK. The question is, what is a proper way 
of running multiple simulations in parallel with MPI that are further 
parallelized with OpenMP, when pinning fails? I could not find any 
other.





The
thing is that mpiexec does not know that I want each replica to fork 
to 4
OpenMP threads. Thus, without this option and without affinities (in 
a sec
about it) mpiexec starts too many replicas on some nodes - gromacs 
complains
about the overload then - while some cores on other nodes are not 
used. It

is possible to run my simulation like that:

mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
--loadbalance for mpiexec and without -ntomp for mdrun)

Then each replica runs on 4 MPI processes (I allocate 4 times more 
cores
then replicas and mdrun sees it). The problem is that it is much 
slower than

using OpenMP for each replica. I did not find any other way than
--loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use 
MPI

and OpenMP at the same time on the torque-controlled cluster.


That seems highly surprising. I have not yet encountered a job
scheduler that was completely lacking a "do what I tell you" layout
scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?


I thing that torque is very similar to all PBS-like resource managers 
in this regard. It actually does what I tell it to do. There are 12-core 
nodes, I ask for 48 of them - I get them (simple #PBS -l ncpus=576 does 
not work), end of story. Now, the program that I run is responsible for 
populating resources that I got.



Surely you want 3 MPI processes per 12-core node?


Yes - I want each node to run 3 MPI processes. Preferably, I would like 
to run each MPI process on separate node (spread on 12 cores with 
OpenMP) but I will not get as much of resources. But again, without the 
--loadbalance hack I would not be able to properly populate the nodes...





What do the .log files say about
OMP_NUM_THREADS, thread affinities, pinning, etc?



Each replica logs:
"Using 1 MPI process
Using 4 OpenMP threads",
That is is correct. As I said, the threads are forked, but 3 out of 4 
don't

do anything, and the simulation does not go at all.

About affinities Gromacs says:
"Can not set thread affinities on the current platform. On NUMA 
systems this
can cause performance degradation. If you think your platform should 
support

setting affinities, contact the GROMACS developers."

Well, the "current platform" is normal x86_64 cluster, but the whole
information about resources is passed by Torque to OpenMPI-linked 
Gromacs.
Can it be that mdrun sees the resources allocated by torque as a big 
pool of

cpus and misses the information about nodes topology?


mdrun gets its processor topology from the MPI layer, so that is where
you need to focus. The error message confirms that GROMACS sees things
that seem wrong.


Thank you, I will take a look. But the first thing I want to do is 
finding the reason why Gromacs 4.6.3 is not able to run on my (slightly 
weird, I admit) setup, while 4.6.2 does it very well.

Best,

Grzegorz
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-12 Thread Mark Abraham
On Fri, Jul 12, 2013 at 4:27 PM, gigo  wrote:
> Hi!
>
> On 2013-07-12 11:15, Mark Abraham wrote:
>>
>> What does --loadbalance do?
>
>
> It balances the total number of processes across all allocated nodes.

OK, but using it means you are hostage to its assumptions about balance.

> The
> thing is that mpiexec does not know that I want each replica to fork to 4
> OpenMP threads. Thus, without this option and without affinities (in a sec
> about it) mpiexec starts too many replicas on some nodes - gromacs complains
> about the overload then - while some cores on other nodes are not used. It
> is possible to run my simulation like that:
>
> mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without
> --loadbalance for mpiexec and without -ntomp for mdrun)
>
> Then each replica runs on 4 MPI processes (I allocate 4 times more cores
> then replicas and mdrun sees it). The problem is that it is much slower than
> using OpenMP for each replica. I did not find any other way than
> --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to use MPI
> and OpenMP at the same time on the torque-controlled cluster.

That seems highly surprising. I have not yet encountered a job
scheduler that was completely lacking a "do what I tell you" layout
scheme. More importantly, why are you using #PBS -l nodes=48:ppn=12?
Surely you want 3 MPI processes per 12-core node?

>> What do the .log files say about
>> OMP_NUM_THREADS, thread affinities, pinning, etc?
>
>
> Each replica logs:
> "Using 1 MPI process
> Using 4 OpenMP threads",
> That is is correct. As I said, the threads are forked, but 3 out of 4 don't
> do anything, and the simulation does not go at all.
>
> About affinities Gromacs says:
> "Can not set thread affinities on the current platform. On NUMA systems this
> can cause performance degradation. If you think your platform should support
> setting affinities, contact the GROMACS developers."
>
> Well, the "current platform" is normal x86_64 cluster, but the whole
> information about resources is passed by Torque to OpenMPI-linked Gromacs.
> Can it be that mdrun sees the resources allocated by torque as a big pool of
> cpus and misses the information about nodes topology?

mdrun gets its processor topology from the MPI layer, so that is where
you need to focus. The error message confirms that GROMACS sees things
that seem wrong.

Mark

>
> If you have any suggestions how to debug or trace this issue, I would be
> glad to participate.
> Best,
>
> G
>
>
>
>
>
>
>>
>> Mark
>>
>> On Fri, Jul 12, 2013 at 3:46 AM, gigo  wrote:
>>>
>>> Dear GMXers,
>>> With Gromacs 4.6.2 I was running REMD with 144 replicas. Replicas were
>>> separate MPI jobs of course (OpenMPI 1.6.4). Each replica I run on 4
>>> cores
>>> with OpenMP. There is Torque installed on the cluster build of 12-cores
>>> nodes, so I used the following script:
>>>
>>> #!/bin/tcsh -f
>>> #PBS -S /bin/tcsh
>>> #PBS -N test
>>> #PBS -l nodes=48:ppn=12
>>> #PBS -l walltime=300:00:00
>>> #PBS -l mem=288Gb
>>> #PBS -r n
>>> cd $PBS_O_WORKDIR
>>> mpiexec -np 144 --loadbalance mdrun_mpi -v -cpt 20 -multi 144 -ntomp 4
>>> -replex 2000
>>>
>>> It was working just great with 4.6.2. It does not work with 4.6.3. The
>>> new
>>> version was compiled with the same options in the same environment.
>>> Mpiexec
>>> spreads the replicas evenly over the cluster. Each replica forks 4
>>> threads,
>>> but only one of them uses any cpu. Logs end at the citations. Some empty
>>> energy and trajectory files are created, nothing is written to them.
>>> Please let me know if you have any immediate suggestion on how to make it
>>> work (maybe based on some differences between versions), or if I should
>>> fill
>>> the bug report with all the technical details.
>>> Best Regards,
>>>
>>> Grzegorz Wieczorek
>>>
>>> --
>>> gmx-users mailing listgmx-users@gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> * Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> * Please don't post (un)subscribe requests to the list. Use the www
>>> interface or send it to gmx-users-requ...@gromacs.org.
>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.grom

Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-12 Thread gigo

Hi!

On 2013-07-12 11:15, Mark Abraham wrote:

What does --loadbalance do?


It balances the total number of processes across all allocated nodes. 
The thing is that mpiexec does not know that I want each replica to fork 
to 4 OpenMP threads. Thus, without this option and without affinities 
(in a sec about it) mpiexec starts too many replicas on some nodes - 
gromacs complains about the overload then - while some cores on other 
nodes are not used. It is possible to run my simulation like that:


mpiexec mdrun_mpi -v -cpt 20 -multi 144 -replex 2000 -cpi (without 
--loadbalance for mpiexec and without -ntomp for mdrun)


Then each replica runs on 4 MPI processes (I allocate 4 times more 
cores then replicas and mdrun sees it). The problem is that it is much 
slower than using OpenMP for each replica. I did not find any other way 
than --loadbalance in mpiexec and then -multi 144 -ntomp 4 in mdrun to 
use MPI and OpenMP at the same time on the torque-controlled cluster.



What do the .log files say about
OMP_NUM_THREADS, thread affinities, pinning, etc?


Each replica logs:
"Using 1 MPI process
Using 4 OpenMP threads",
That is is correct. As I said, the threads are forked, but 3 out of 4 
don't do anything, and the simulation does not go at all.


About affinities Gromacs says:
"Can not set thread affinities on the current platform. On NUMA systems 
this
can cause performance degradation. If you think your platform should 
support

setting affinities, contact the GROMACS developers."

Well, the "current platform" is normal x86_64 cluster, but the whole 
information about resources is passed by Torque to OpenMPI-linked 
Gromacs. Can it be that mdrun sees the resources allocated by torque as 
a big pool of cpus and misses the information about nodes topology?


If you have any suggestions how to debug or trace this issue, I would 
be glad to participate.

Best,
G








Mark

On Fri, Jul 12, 2013 at 3:46 AM, gigo  wrote:

Dear GMXers,
With Gromacs 4.6.2 I was running REMD with 144 replicas. Replicas 
were
separate MPI jobs of course (OpenMPI 1.6.4). Each replica I run on 4 
cores
with OpenMP. There is Torque installed on the cluster build of 
12-cores

nodes, so I used the following script:

#!/bin/tcsh -f
#PBS -S /bin/tcsh
#PBS -N test
#PBS -l nodes=48:ppn=12
#PBS -l walltime=300:00:00
#PBS -l mem=288Gb
#PBS -r n
cd $PBS_O_WORKDIR
mpiexec -np 144 --loadbalance mdrun_mpi -v -cpt 20 -multi 144 -ntomp 
4

-replex 2000

It was working just great with 4.6.2. It does not work with 4.6.3. 
The new
version was compiled with the same options in the same environment. 
Mpiexec
spreads the replicas evenly over the cluster. Each replica forks 4 
threads,
but only one of them uses any cpu. Logs end at the citations. Some 
empty

energy and trajectory files are created, nothing is written to them.
Please let me know if you have any immediate suggestion on how to 
make it
work (maybe based on some differences between versions), or if I 
should fill

the bug report with all the technical details.
Best Regards,

Grzegorz Wieczorek

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www
interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Problems with REMD in Gromacs 4.6.3

2013-07-12 Thread Mark Abraham
What does --loadbalance do? What do the .log files say about
OMP_NUM_THREADS, thread affinities, pinning, etc?

Mark

On Fri, Jul 12, 2013 at 3:46 AM, gigo  wrote:
> Dear GMXers,
> With Gromacs 4.6.2 I was running REMD with 144 replicas. Replicas were
> separate MPI jobs of course (OpenMPI 1.6.4). Each replica I run on 4 cores
> with OpenMP. There is Torque installed on the cluster build of 12-cores
> nodes, so I used the following script:
>
> #!/bin/tcsh -f
> #PBS -S /bin/tcsh
> #PBS -N test
> #PBS -l nodes=48:ppn=12
> #PBS -l walltime=300:00:00
> #PBS -l mem=288Gb
> #PBS -r n
> cd $PBS_O_WORKDIR
> mpiexec -np 144 --loadbalance mdrun_mpi -v -cpt 20 -multi 144 -ntomp 4
> -replex 2000
>
> It was working just great with 4.6.2. It does not work with 4.6.3. The new
> version was compiled with the same options in the same environment. Mpiexec
> spreads the replicas evenly over the cluster. Each replica forks 4 threads,
> but only one of them uses any cpu. Logs end at the citations. Some empty
> energy and trajectory files are created, nothing is written to them.
> Please let me know if you have any immediate suggestion on how to make it
> work (maybe based on some differences between versions), or if I should fill
> the bug report with all the technical details.
> Best Regards,
>
> Grzegorz Wieczorek
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists