Hi, There is a certain version of MPI that caused a lot of headache until we realized that it is buggy. I'm not entirely sure what version was it, but I suspect it was the 1.4.3 shipped as default on Ubuntu 12.04 server.
I suggest that you try: - using a different MPI version; - using a single rank/no MPI to continue; - using thread-MPI to continue; Cheers, -- Szilárd On Thu, Jul 24, 2014 at 5:29 PM, David de Sancho <daviddesan...@gmail.com> wrote: > Dear all > I am having some trouble continuing some runs with Gromacs 4.5.5 in our > local cluster. Surprisingly, the simulations run smoothly in the same > number of nodes and cores before in the same system. And even more > surprisingly if I reduce the number of nodes to 1 with its 12 processors, > then it runs again. > > And the script I am using to run the simulations looks something like this@ > > # Set some Torque options: class name and max time for the job. Torque >> developed from a program called >> # OpenPBS, hence all the PBS references in this file >> #PBS -l nodes=4:ppn=12,walltime=24:00:00 > > source /home/dd363/src/gromacs-4.5.5/bin/GMXRC.bash >> application="/home/user/src/gromacs-4.5.5/bin/mdrun_openmpi_intel" >> options="-s data/tpr/filename.tpr -deffnm data/filename -cpi data/filename" >> >> #! change the working directory (default is home directory) >> cd $PBS_O_WORKDIR >> echo Running on host `hostname` >> echo Time is `date` >> echo Directory is `pwd` >> echo PBS job ID is $PBS_JOBID >> echo This jobs runs on the following machines: >> echo `cat $PBS_NODEFILE | uniq` >> #! Run the parallel MPI executable >> #!export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/lib64:/usr/lib64" >> echo "Running mpiexec $application $options" >> mpiexec $application $options > > > And the error messages I am getting look something like this > >> [compute-0-11:09645] *** Process received signal *** >> [compute-0-11:09645] Signal: Segmentation fault (11) >> [compute-0-11:09645] Signal code: Address not mapped (1) >> [compute-0-11:09645] Failing at address: 0x10 >> [compute-0-11:09643] *** Process received signal *** >> [compute-0-11:09643] Signal: Segmentation fault (11) >> [compute-0-11:09643] Signal code: Address not mapped (1) >> [compute-0-11:09643] Failing at address: 0xd0 >> [compute-0-11:09645] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] >> [compute-0-11:09645] [ 1] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2af2091443f9] >> [compute-0-11:09645] [ 2] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2af209142963] >> [compute-0-11:09645] [ 3] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so >> [0x2af20996e33c] >> [compute-0-11:09645] [ 4] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) >> [0x2af20572cfa7] >> [compute-0-11:09645] [ 5] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 >> [0x2af205219636] >> [compute-0-11:09645] [ 6] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa2259b] >> [compute-0-11:09645] [ 7] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa2a04b] >> [compute-0-11:09645] [ 8] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2af20aa22da9] >> [compute-0-11:09645] [ 9] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc) >> [0x2af205204dcc] >> [compute-0-11:09645] [10] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c) >> [0x2af205236f0c] >> [compute-0-11:09645] [11] >> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b) >> [0x2af204b8ba6b] >> [compute-0-11:09645] [12] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c) >> [0x415aac] >> [compute-0-11:09645] [13] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928) >> [0x41d968] >> [compute-0-11:09645] [14] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x38d281d994] >> [compute-0-11:09643] [ 0] /lib64/libpthread.so.0 [0x38d300e7c0] >> [compute-0-11:09643] [ 1] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2b56aca403f9] >> [compute-0-11:09643] [ 2] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_pml_ob1.so >> [0x2b56aca3e963] >> [compute-0-11:09643] [ 3] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_btl_sm.so >> [0x2b56ad26a33c] >> [compute-0-11:09643] [ 4] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libopen-pal.so.0(opal_progress+0x87) >> [0x2b56a9028fa7] >> [compute-0-11:09643] [ 5] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0 >> [0x2b56a8b15636] >> [compute-0-11:09643] [ 6] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2b56ae31e59b] >> [compute-0-11:09643] [ 7] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2b56ae32604b] >> [compute-0-11:09643] [ 8] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/openmpi/mca_coll_tuned.so >> [0x2b56ae31eda9] >> [compute-0-11:09643] [ 9] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(ompi_comm_split+0xcc) >> [0x2b56a8b00dcc] >> [compute-0-11:09643] [10] >> /usr/local/shared/redhat-5.4/x86_64/openmpi-1.4.3-intel/lib/libmpi.so.0(MPI_Comm_split+0x3c) >> [0x2b56a8b32f0c] >> [compute-0-11:09643] [11] >> /home/dd363/src/gromacs-4.5.5/lib/libgmx_mpi.so.6(gmx_setup_nodecomm+0x14b) >> [0x2b56a8487a6b] >> [compute-0-11:09643] [12] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(mdrunner+0x86c) >> [0x415aac] >> [compute-0-11:09643] [13] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(main+0x1928) >> [0x41d968] >> [compute-0-11:09643] [14] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x38d281d994] >> [compute-0-11:09643] [15] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189) >> [0x407449] >> [compute-0-11:09643] *** End of error message *** >> [compute-0-11:09645] [15] >> /home/dd363/src/gromacs-4.5.5/bin/mdrun_openmpi_intel(do_cg+0x189) >> [0x407449] >> [compute-0-11:09645] *** End of error message *** >> [compute-0-13.local][[30524,1],19][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking] >> recv(15) failed: Connection reset by peer (104) >> [compute-0-13.local][[30524,1],17][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking] >> recv(15) failed: Connection reset by peer (104) >> [compute-0-12.local][[30524,1],29][btl_tcp_endpoint.c:456:mca_btl_tcp_endpoint_recv_blocking] >> recv(15) failed: Connection reset by peer (104) > > > A number of checks have been carried out. The continuation runs crash right > away. The segfaults have ocurred in two different nodes, so bad compute > nodes are probably ruled out. The MPI library is working fine on a number > of test programs. There are no signs of system problems. On the other hand > Signal 11 means trying to access memory that the computer thinks I should > not have access to. > > Any ideas on what may be going wrong? > > Thanks > > > David > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a > mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.