AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Wojtyczka , André
On 23/12/2010 10:01 PM, Wojtyczka, André wrote:
 Dear Gromacs Enthusiasts.

 I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

 Problem:
 This runs fine:
 mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

 This produces a segmentation fault:
 mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate situation 
where
I really need -pd because I have long bonds which is the reason why my large 
system
is decomposable just into a little number of domains.



 So the only difference is the number of cores I am using.

 mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
 installation.

 While configuring and make mdrun / make install-mdrun no errors came
 up.

 Is there some issue with threading or mpi?

 If someone has a clue please give me a hint.


 integrator   = md
 dt  = 0.004
 nsteps  = 2500
 nstxout  = 0
 nstvout  = 0
 nstlog  = 25
 nstenergy   = 25
 nstxtcout   = 12500
 xtc_grps = protein
 energygrps   = protein non-protein
 nstlist  = 2
 ns_type  = grid
 rlist= 0.9
 coulombtype  = PME
 rcoulomb = 0.9
 fourierspacing   = 0.12
 pme_order= 4
 ewald_rtol   = 1e-5
 rvdw = 0.9
 pbc  = xyz
 periodic_molecules   = yes
 tcoupl   = nose-hoover
 nsttcouple   = 1
 tc-grps  = protein non-protein
 tau_t= 0.1 0.1
 ref_t= 310 310
 Pcoupl   = no
 gen_vel  = yes
 gen_temp = 310
 gen_seed = 173529
 constraints  = all-bonds



 Error:
 Getting Loaded...
 Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
 Loaded with Money


 NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
 (128)
and PME grid_y (144) and grid_z (144) should be divisible by 
 #PME_nodes_y (1)


 Step 0, time 0 (ps)
 PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
 PSIlogger: Child with rank 96 exited on signal 6: Aborted
 ...

 Ps, for now I don't care about the imbalanced PME load unless it's 
 independent from my problem.

 Cheers
 André




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Mark Abraham

On 24/12/2010 3:28 AM, Wojtyczka, André wrote:

On 23/12/2010 10:01 PM, Wojtyczka, André wrote:

Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster.

Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate situation 
where
I really need -pd because I have long bonds which is the reason why my large 
system
is decomposable just into a little number of domains.


I'm not sure that PD has any advantage here. From memory it has to 
create a 128x1x1 grid, and you can direct that with DD also.


The contents of your .log file will be far more helpful than stdout in 
diagnosing what condition led to the problem.


Mark


So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 
installation.

While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x 
(128)
and PME grid_y (144) and grid_z (144) should be divisible by 
#PME_nodes_y (1)


Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's independent 
from my problem.

Cheers
André




Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation

2010-12-23 Thread Mark Abraham

On 24/12/2010 8:34 AM, Mark Abraham wrote:

On 24/12/2010 3:28 AM, Wojtyczka, André wrote:

On 23/12/2010 10:01 PM, Wojtyczka, André wrote:

Dear Gromacs Enthusiasts.

I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem 
cluster.


Problem:
This runs fine:
mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

This produces a segmentation fault:
mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr

Unless you know you need it, don't use -pd. DD will be faster and is
probably better bug-tested too.

Mark

Hi Mark

thanks for the push into that direction, but I am in the unfortunate 
situation where
I really need -pd because I have long bonds which is the reason why 
my large system

is decomposable just into a little number of domains.


I'm not sure that PD has any advantage here. From memory it has to 
create a 128x1x1 grid, and you can direct that with DD also.


See mdrun -h -hidden for -dd.

Mark

The contents of your .log file will be far more helpful than stdout in 
diagnosing what condition led to the problem.


Mark


So the only difference is the number of cores I am using.

mdrun_mpi was compiled using the intel compiler 11.1.072 with my 
own fftw3 installation.


While configuring and make mdrun / make install-mdrun no errors came
up.

Is there some issue with threading or mpi?

If someone has a clue please give me a hint.


integrator   = md
dt  = 0.004
nsteps  = 2500
nstxout  = 0
nstvout  = 0
nstlog  = 25
nstenergy   = 25
nstxtcout   = 12500
xtc_grps = protein
energygrps   = protein non-protein
nstlist  = 2
ns_type  = grid
rlist= 0.9
coulombtype  = PME
rcoulomb = 0.9
fourierspacing   = 0.12
pme_order= 4
ewald_rtol   = 1e-5
rvdw = 0.9
pbc  = xyz
periodic_molecules   = yes
tcoupl   = nose-hoover
nsttcouple   = 1
tc-grps  = protein non-protein
tau_t= 0.1 0.1
ref_t= 310 310
Pcoupl   = no
gen_vel  = yes
gen_temp = 310
gen_seed = 173529
constraints  = all-bonds



Error:
Getting Loaded...
Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision)
Loaded with Money


NOTE: The load imbalance in PME FFT and solve is 48%.
For optimal PME load balancing
PME grid_x (144) and grid_y (144) should be divisible by 
#PME_nodes_x (128)
and PME grid_y (144) and grid_z (144) should be divisible 
by #PME_nodes_y (1)



Step 0, time 0 (ps)
PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault
PSIlogger: Child with rank 96 exited on signal 6: Aborted
...

Ps, for now I don't care about the imbalanced PME load unless it's 
independent from my problem.


Cheers
André


 

 


Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
 

 





--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

Can't post? Read http://www.gromacs.org/Support/Mailing_Lists