Dear gromacs users, I run a REMD simulation 20ns long, enabling free energy and using a different init_lambda value for each replica and using gromacs 4.5.3.
I run the simulation on a cluster equipped with torque queue management. 1) I used the following command in the submission script: mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2 -maxh 36 -v >& log.rest2_TrpCage The run went fine and it correctly terminated in 36 hours, before reaching the 20ns and writing each file. 2) Then I extended the simulation using the command: mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2 -maxh 36 -cpi -v >& log.resume.rest2_TrpCage This time, the program crashed with the error: *[[28079,1],72][/caspur/shared/src/openmpi/openmpi-1.4.3/ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc] from neo085 to: neo098 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 427787264 opcode 36099 vendor error 129 qp_idx 0 * *--------------------------------------------------------------------------* *The InfiniBand retry count between two MPI processes has been* *exceeded. "Retry count" is defined in the InfiniBand spec 1.2* *(section 12.7.38):* * * * The total number of times that the sender wishes the receiver to* * retry timeout, packet sequence, etc. errors before posting a* * completion error.* * * *This error typically means that there is something awry within the* *InfiniBand fabric itself. You should note the hosts on which this* *error has occurred; it has been observed that rebooting or removing a* *particular host from the job can sometimes resolve this issue.* * * *Two MCA parameters can be used to control Open MPI's behavior with* *respect to the retry count:* * * ** btl_openib_ib_retry_count - The number of times the sender will* * attempt to retry (defaulted to 7, the maximum value).* ** btl_openib_ib_timeout - The local ACK timeout parameter (defaulted* * to 10). The actual timeout value used is calculated as:* * * * 4.096 microseconds * (2^btl_openib_ib_timeout)* * * * See the InfiniBand spec 1.2 (section 12.7.34) for more details.* * * *Below is some information about the host that raised the error and the* *peer to which it was connected:* * * * Local host: neo085* * Local device: mthca0* * Peer host: neo098* * * *You may need to consult with your system administrator to get this* *problem fixed.* *--------------------------------------------------------------------------* *--------------------------------------------------------------------------* *mpirun has exited due to process rank 72 with PID 2083 on* *node neo085 exiting without calling "finalize". This may* *have caused other processes in the application to be* *terminated by signals sent by mpirun (as reported here).* *--------------------------------------------------------------------------* *mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate* The reached simulation time written in the md0.log file was *12.0766 ns* 3) I assumed it was a network error, imparing the correct comunication among the nodes. I frequently obtain this error and usually I restart the simulation without any troube. Hence I restarted again the simulation: mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2 -maxh 36 -cpi -v >& log.resume1.rest2_TrpCage The simulation went fine, reaching the 20ns and without any complains by gromacs. When I started the data analysis, I noticed that all the 10 trajectory files are nearly *12.07ns, *while energy files are 20ns long. if I check the last modification time by ls -l it says me that the files has been modified nearly simultaneously: [oteri@matrix2 REST2]$ ls -lrt *.trr *.edr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj8.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj3.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj2.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj1.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj7.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj9.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj4.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj6.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj0.trr -rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj5.trr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener9.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener8.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener7.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener6.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener5.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener4.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener3.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener2.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener1.edr -rw-r--r-- 1 oteri be7 1201324 avr 19 15:52 ener0.edr So actually, gromacs accessed to both trajectory and energy files. I have two question: 1) Is this a known bug, has it been corrected in gromacs 4.5.5? 2) How can i check if trajectory are correct? I mean, how can I check whether spurious frames has been inserted? 3) If they are correct, how can I restart for 12ns? You can download log and mdp files from http://160.80.35.105/download/problem/ The other 9 files differs only for the init_lambda value: rest2_0.mdp:init_lambda=-0.000000 rest2_1.mdp:init_lambda=0.143679 rest2_2.mdp:init_lambda=0.274297 rest2_3.mdp:init_lambda=0.388587 rest2_4.mdp:init_lambda=0.501717 rest2_5.mdp:init_lambda=0.611494 rest2_6.mdp:init_lambda=0.716387 rest2_7.mdp:init_lambda=0.818048 rest2_8.mdp:init_lambda=0.910347 rest2_9.mdp:init_lambda=1.000000 Thank you for help Francesco
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists