Re: Re: [gmx-users] mdrun mpi segmentation fault in high load situation
I'm not sure that PD has any advantage here. From memory it has to create a 128x1x1 grid, and you can direct that with DD also. See mdrun -h -hidden for -dd. Mark The contents of your .log file will be far more helpful than stdout in diagnosing what condition led to the problem. Mark So the only difference is the number of cores I am using. I used -dd but then my system consists only of 4 or slightly more domains which gives me almost no advantage over -pd. The minimum size of a domain is connected to the largest bond length which in my case is half of the box size or more. I will post my .log file but it will probably be next year. So merry christmas and a jolly time. André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] mdrun mpi segmentation fault in high load situation
On 24/12/2010 9:59 PM, Wojtyczka, André wrote: I'm not sure that PD has any advantage here. From memory it has to create a 128x1x1 grid, and you can direct that with DD also. See mdrun -h -hidden for -dd. Mark The contents of your .log file will be far more helpful than stdout in diagnosing what condition led to the problem. Mark So the only difference is the number of cores I am using. I used -dd but then my system consists only of 4 or slightly more domains which gives me almost no advantage over -pd. The minimum size of a domain is connected to the largest bond length which in my case is half of the box size or more. If it were more than half the box size, then since that restricts the minimum diameter of the DD cell, surely DD would produce a single domain. Either way, it sounds like the ratio of system size to bond length is too small to permit efficient GROMACS-style parallelism. Not all systems are worth parallelising, even if you have a good algorithm for the case at hand... and both DD and PD are targeted at the usual situation in MD where the box size is many times larger than the typical bond length. Mark I will post my .log file but it will probably be next year. So merry christmas and a jolly time. André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] mdrun mpi segmentation fault in high load situation
Dear Gromacs Enthusiasts. I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster. Problem: This runs fine: mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr This produces a segmentation fault: mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr So the only difference is the number of cores I am using. mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation. While configuring and make mdrun / make install-mdrun no errors came up. Is there some issue with threading or mpi? If someone has a clue please give me a hint. integrator = md dt = 0.004 nsteps = 2500 nstxout = 0 nstvout = 0 nstlog = 25 nstenergy = 25 nstxtcout = 12500 xtc_grps = protein energygrps = protein non-protein nstlist = 2 ns_type = grid rlist= 0.9 coulombtype = PME rcoulomb = 0.9 fourierspacing = 0.12 pme_order= 4 ewald_rtol = 1e-5 rvdw = 0.9 pbc = xyz periodic_molecules = yes tcoupl = nose-hoover nsttcouple = 1 tc-grps = protein non-protein tau_t= 0.1 0.1 ref_t= 310 310 Pcoupl = no gen_vel = yes gen_temp = 310 gen_seed = 173529 constraints = all-bonds Error: Getting Loaded... Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision) Loaded with Money NOTE: The load imbalance in PME FFT and solve is 48%. For optimal PME load balancing PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128) and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1) Step 0, time 0 (ps) PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault PSIlogger: Child with rank 96 exited on signal 6: Aborted ... Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem. Cheers André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] mdrun mpi segmentation fault in high load situation
On 23/12/2010 10:01 PM, Wojtyczka, André wrote: Dear Gromacs Enthusiasts. I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster. Problem: This runs fine: mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr This produces a segmentation fault: mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr Unless you know you need it, don't use -pd. DD will be faster and is probably better bug-tested too. Mark So the only difference is the number of cores I am using. mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation. While configuring and make mdrun / make install-mdrun no errors came up. Is there some issue with threading or mpi? If someone has a clue please give me a hint. integrator = md dt = 0.004 nsteps = 2500 nstxout = 0 nstvout = 0 nstlog = 25 nstenergy = 25 nstxtcout = 12500 xtc_grps = protein energygrps = protein non-protein nstlist = 2 ns_type = grid rlist= 0.9 coulombtype = PME rcoulomb = 0.9 fourierspacing = 0.12 pme_order= 4 ewald_rtol = 1e-5 rvdw = 0.9 pbc = xyz periodic_molecules = yes tcoupl = nose-hoover nsttcouple = 1 tc-grps = protein non-protein tau_t= 0.1 0.1 ref_t= 310 310 Pcoupl = no gen_vel = yes gen_temp = 310 gen_seed = 173529 constraints = all-bonds Error: Getting Loaded... Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision) Loaded with Money NOTE: The load imbalance in PME FFT and solve is 48%. For optimal PME load balancing PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128) and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1) Step 0, time 0 (ps) PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault PSIlogger: Child with rank 96 exited on signal 6: Aborted ... Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem. Cheers André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
AW: [gmx-users] mdrun mpi segmentation fault in high load situation
On 23/12/2010 10:01 PM, Wojtyczka, André wrote: Dear Gromacs Enthusiasts. I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster. Problem: This runs fine: mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr This produces a segmentation fault: mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr Unless you know you need it, don't use -pd. DD will be faster and is probably better bug-tested too. Mark Hi Mark thanks for the push into that direction, but I am in the unfortunate situation where I really need -pd because I have long bonds which is the reason why my large system is decomposable just into a little number of domains. So the only difference is the number of cores I am using. mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation. While configuring and make mdrun / make install-mdrun no errors came up. Is there some issue with threading or mpi? If someone has a clue please give me a hint. integrator = md dt = 0.004 nsteps = 2500 nstxout = 0 nstvout = 0 nstlog = 25 nstenergy = 25 nstxtcout = 12500 xtc_grps = protein energygrps = protein non-protein nstlist = 2 ns_type = grid rlist= 0.9 coulombtype = PME rcoulomb = 0.9 fourierspacing = 0.12 pme_order= 4 ewald_rtol = 1e-5 rvdw = 0.9 pbc = xyz periodic_molecules = yes tcoupl = nose-hoover nsttcouple = 1 tc-grps = protein non-protein tau_t= 0.1 0.1 ref_t= 310 310 Pcoupl = no gen_vel = yes gen_temp = 310 gen_seed = 173529 constraints = all-bonds Error: Getting Loaded... Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision) Loaded with Money NOTE: The load imbalance in PME FFT and solve is 48%. For optimal PME load balancing PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128) and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1) Step 0, time 0 (ps) PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault PSIlogger: Child with rank 96 exited on signal 6: Aborted ... Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem. Cheers André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation
On 24/12/2010 3:28 AM, Wojtyczka, André wrote: On 23/12/2010 10:01 PM, Wojtyczka, André wrote: Dear Gromacs Enthusiasts. I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster. Problem: This runs fine: mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr This produces a segmentation fault: mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr Unless you know you need it, don't use -pd. DD will be faster and is probably better bug-tested too. Mark Hi Mark thanks for the push into that direction, but I am in the unfortunate situation where I really need -pd because I have long bonds which is the reason why my large system is decomposable just into a little number of domains. I'm not sure that PD has any advantage here. From memory it has to create a 128x1x1 grid, and you can direct that with DD also. The contents of your .log file will be far more helpful than stdout in diagnosing what condition led to the problem. Mark So the only difference is the number of cores I am using. mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation. While configuring and make mdrun / make install-mdrun no errors came up. Is there some issue with threading or mpi? If someone has a clue please give me a hint. integrator = md dt = 0.004 nsteps = 2500 nstxout = 0 nstvout = 0 nstlog = 25 nstenergy = 25 nstxtcout = 12500 xtc_grps = protein energygrps = protein non-protein nstlist = 2 ns_type = grid rlist= 0.9 coulombtype = PME rcoulomb = 0.9 fourierspacing = 0.12 pme_order= 4 ewald_rtol = 1e-5 rvdw = 0.9 pbc = xyz periodic_molecules = yes tcoupl = nose-hoover nsttcouple = 1 tc-grps = protein non-protein tau_t= 0.1 0.1 ref_t= 310 310 Pcoupl = no gen_vel = yes gen_temp = 310 gen_seed = 173529 constraints = all-bonds Error: Getting Loaded... Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision) Loaded with Money NOTE: The load imbalance in PME FFT and solve is 48%. For optimal PME load balancing PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128) and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1) Step 0, time 0 (ps) PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault PSIlogger: Child with rank 96 exited on signal 6: Aborted ... Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem. Cheers André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: AW: [gmx-users] mdrun mpi segmentation fault in high load situation
On 24/12/2010 8:34 AM, Mark Abraham wrote: On 24/12/2010 3:28 AM, Wojtyczka, André wrote: On 23/12/2010 10:01 PM, Wojtyczka, André wrote: Dear Gromacs Enthusiasts. I am experiencing problems with mdrun_mpi (4.5.3) on a Nehalem cluster. Problem: This runs fine: mpiexec -np 72 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr This produces a segmentation fault: mpiexec -np 128 /../mdrun_mpi -pd -s full031K_mdrun_ions.tpr Unless you know you need it, don't use -pd. DD will be faster and is probably better bug-tested too. Mark Hi Mark thanks for the push into that direction, but I am in the unfortunate situation where I really need -pd because I have long bonds which is the reason why my large system is decomposable just into a little number of domains. I'm not sure that PD has any advantage here. From memory it has to create a 128x1x1 grid, and you can direct that with DD also. See mdrun -h -hidden for -dd. Mark The contents of your .log file will be far more helpful than stdout in diagnosing what condition led to the problem. Mark So the only difference is the number of cores I am using. mdrun_mpi was compiled using the intel compiler 11.1.072 with my own fftw3 installation. While configuring and make mdrun / make install-mdrun no errors came up. Is there some issue with threading or mpi? If someone has a clue please give me a hint. integrator = md dt = 0.004 nsteps = 2500 nstxout = 0 nstvout = 0 nstlog = 25 nstenergy = 25 nstxtcout = 12500 xtc_grps = protein energygrps = protein non-protein nstlist = 2 ns_type = grid rlist= 0.9 coulombtype = PME rcoulomb = 0.9 fourierspacing = 0.12 pme_order= 4 ewald_rtol = 1e-5 rvdw = 0.9 pbc = xyz periodic_molecules = yes tcoupl = nose-hoover nsttcouple = 1 tc-grps = protein non-protein tau_t= 0.1 0.1 ref_t= 310 310 Pcoupl = no gen_vel = yes gen_temp = 310 gen_seed = 173529 constraints = all-bonds Error: Getting Loaded... Reading file full031K_mdrun_ions.tpr, VERSION 4.5.3 (single precision) Loaded with Money NOTE: The load imbalance in PME FFT and solve is 48%. For optimal PME load balancing PME grid_x (144) and grid_y (144) should be divisible by #PME_nodes_x (128) and PME grid_y (144) and grid_z (144) should be divisible by #PME_nodes_y (1) Step 0, time 0 (ps) PSIlogger: Child with rank 82 exited on signal 11: Segmentation fault PSIlogger: Child with rank 79 exited on signal 11: Segmentation fault PSIlogger: Child with rank 2 exited on signal 11: Segmentation fault PSIlogger: Child with rank 1 exited on signal 11: Segmentation fault PSIlogger: Child with rank 100 exited on signal 11: Segmentation fault PSIlogger: Child with rank 97 exited on signal 11: Segmentation fault PSIlogger: Child with rank 98 exited on signal 11: Segmentation fault PSIlogger: Child with rank 96 exited on signal 6: Aborted ... Ps, for now I don't care about the imbalanced PME load unless it's independent from my problem. Cheers André Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists