Re: Re: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-24 Thread Wojtyczka , André

 I'm not sure that PD has any advantage here. From memory it has to
 create a 128x1x1 grid, and you can direct that with DD also.

See mdrun -h -hidden for -dd.

Mark

 The contents of your .log file will be far more helpful than stdout in
 diagnosing what condition led to the problem.

 Mark

 So the only difference is the number of cores I am using.


I used -dd but then my system consists only of 4 or slightly more domains
which gives me almost no advantage over -pd. The minimum size of a domain
is connected to the largest bond length which in my case is half of the box
size or more.
I will post my .log file but it will probably be next year.

So merry christmas and a jolly time.
André



Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-24 Thread Mark Abraham

On 24/12/2010 9:59 PM, Wojtyczka, André wrote:

I'm not sure that PD has any advantage here. From memory it has to
create a 128x1x1 grid, and you can direct that with DD also.

See mdrun -h -hidden for -dd.

Mark


The contents of your .log file will be far more helpful than stdout in
diagnosing what condition led to the problem.

Mark


So the only difference is the number of cores I am using.


I used -dd but then my system consists only of 4 or slightly more domains
which gives me almost no advantage over -pd. The minimum size of a domain
is connected to the largest bond length which in my case is half of the box
size or more.


If it were more than half the box size, then since that restricts the 
minimum diameter of the DD cell, surely DD would produce a single 
domain. Either way, it sounds like the ratio of system size to bond 
length is too small to permit efficient GROMACS-style parallelism. Not 
all systems are worth parallelising, even if you have a good algorithm 
for the case at hand... and both DD and PD are targeted at the usual 
situation in MD where the box size is many times larger than the typical 
bond length.


Mark


I will post my .log file but it will probably be next year.

So merry christmas and a jolly time.
André



Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


[gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Wojtyczka , André
Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
installation.

While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
  For optimal PME load balancing
  PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
(128)
  and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y 
(1)


Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's independent 
from my problem.

Cheers
André



Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Mark Abraham

On 23/12/2010 10:01 PM, Wojtyczka, André wrote:

Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr


Unless you know you need it, don't use -pd. DD will be faster and is 
probably better bug-tested too.


Mark


So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
installation.

While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
   For optimal PME load balancing
   PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
(128)
   and PME grid_y (144) and grid_z (144) should be divisible by 
#PME_nodes_y (1)


Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's independent 
from my problem.

Cheers
André



Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Wojtyczka , André
On 23/12/2010 10:01 PM, Wojtyczka, André wrote:
 Dear Gromacs Enthusiasts.

 I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

 Problem:
 This runs fine:
 mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

 This produces a segmentation fault:
 mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate situation 
where
I really need -pd because I have long bonds which is the reason why my large 
system
is decomposable just into a little number of domains.



 So the only difference is the number of cores I am using.

 mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
 installation.

 While configuring and make mdrun / make install-mdrun no errors came
 up.

 Is there some issue with threading or mpi?

 If someone has a clue please give me a hint.


 integrator   = md
 dt  = 0.004
 nsteps  = 2500
 nstxout  = 0
 nstvout  = 0
 nstlog  = 25
 nstenergy   = 25
 nstxtcout   = 12500
 xtc_grps = protein
 energygrps   = protein non-protein
 nstlist  = 2
 ns_type  = grid
 rlist= 0.9
 coulombtype  = PME
 rcoulomb = 0.9
 fourierspacing   = 0.12
 pme_order= 4
 ewald_rtol   = 1e-5
 rvdw = 0.9
 pbc  = xyz
 periodic_molecules   = yes
 tcoupl   = nose-hoover
 nsttcouple   = 1
 tc-grps  = protein non-protein
 tau_t= 0.1 0.1
 ref_t= 310 310
 Pcoupl   = no
 gen_vel  = yes
 gen_temp = 310
 gen_seed = 173529
 constraints  = all-bonds



 Error:
 Getting Loaded...
 Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
 Loaded with Money


 NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
 (128)
and PME grid_y (144) and grid_z (144) should be divisible by 
 #PME_nodes_y (1)


 Step 0, time 0 (ps)
 PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 96 exited on signal 6: Aborted
 ...

 Ps, for now I don't care about the imbalanced PME load unless it's 
 independent from my problem.

 Cheers
 André




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Mark Abraham

On 24/12/2010 3:28 AM, Wojtyczka, André wrote:

On 23/12/2010 10:01 PM, Wojtyczka, André wrote:

Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate situation 
where
I really need -pd because I have long bonds which is the reason why my large 
system
is decomposable just into a little number of domains.


I'm not sure that PD has any advantage here. From memory it has to 
create a 128x1x1 grid, and you can direct that with DD also.


The contents of your .log file will be far more helpful than stdout in 
diagnosing what condition led to the problem.


Mark


So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
installation.

While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
(128)
and PME grid_y (144) and grid_z (144) should be divisible by 
#PME_nodes_y (1)


Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's independent 
from my problem.

Cheers
André




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Mark Abraham

On 24/12/2010 8:34 AM, Mark Abraham wrote:

On 24/12/2010 3:28 AM, Wojtyczka, André wrote:

On 23/12/2010 10:01 PM, Wojtyczka, André wrote:

Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem 
cluster.


Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate 
situation where
I really need -pd because I have long bonds which is the reason why 
my large system

is decomposable just into a little number of domains.


I'm not sure that PD has any advantage here. From memory it has to 
create a 128x1x1 grid, and you can direct that with DD also.


See mdrun -h -hidden for -dd.

Mark

The contents of your .log file will be far more helpful than stdout in 
diagnosing what condition led to the problem.


Mark


So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my 
own fftw3 installation.


While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by 
#PME_nodes_x (128)
and PME grid_y (144) and grid_z (144) should be divisible 
by #PME_nodes_y (1)



Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's 
independent from my problem.


Cheers
André


 

 


Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
 

 





--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists