Dear gromacs users,
I run a REMD simulation 20ns long, enabling free energy and using a
different init_lambda value for each replica  and using gromacs 4.5.3.

I run the simulation on a cluster equipped with torque queue management.

1) I used the following command in the submission script:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -v >& log.rest2_TrpCage

The run went fine and it correctly terminated in 36 hours, before reaching
the 20ns and writing each file.

2) Then I extended the simulation using the command:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume.rest2_TrpCage

This time, the program crashed with the error:

from neo085 to: neo098 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 427787264 opcode 36099  vendor error 129 qp_idx 0
*The InfiniBand retry count between two MPI processes has been*
*exceeded.  "Retry count" is defined in the InfiniBand spec 1.2*
*(section 12.7.38):*
*    The total number of times that the sender wishes the receiver to*
*    retry timeout, packet sequence, etc. errors before posting a*
*    completion error.*
*This error typically means that there is something awry within the*
*InfiniBand fabric itself.  You should note the hosts on which this*
*error has occurred; it has been observed that rebooting or removing a*
*particular host from the job can sometimes resolve this issue.*
*Two MCA parameters can be used to control Open MPI's behavior with*
*respect to the retry count:*
** btl_openib_ib_retry_count - The number of times the sender will*
*  attempt to retry (defaulted to 7, the maximum value).*
** btl_openib_ib_timeout - The local ACK timeout parameter (defaulted*
*  to 10).  The actual timeout value used is calculated as:*
*     4.096 microseconds * (2^btl_openib_ib_timeout)*
*  See the InfiniBand spec 1.2 (section 12.7.34) for more details.*
*Below is some information about the host that raised the error and the*
*peer to which it was connected:*
*  Local host:   neo085*
*  Local device: mthca0*
*  Peer host:    neo098*
*You may need to consult with your system administrator to get this*
*problem fixed.*
*mpirun has exited due to process rank 72 with PID 2083 on*
*node neo085 exiting without calling "finalize". This may*
*have caused other processes in the application to be*
*terminated by signals sent by mpirun (as reported here).*
*mpirun: abort is already in progress...hit ctrl-c again to forcibly

The reached simulation time written  in the md0.log file was *12.0766 ns*

3) I assumed it was a network error, imparing the correct comunication
among the nodes. I frequently obtain this error
and usually I restart the simulation without any troube.
Hence I restarted again the simulation:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume1.rest2_TrpCage

The simulation went fine, reaching the 20ns and without any complains by

When I started the data analysis, I noticed that all the 10 trajectory
files are nearly *12.07ns, *while energy files are 20ns long.
if I check the last modification time by ls -l it says me that the files
has been modified nearly simultaneously:

[oteri@matrix2 REST2]$ ls -lrt *.trr *.edr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj8.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj3.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj2.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj1.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj7.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj9.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj4.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj6.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj0.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj5.trr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener9.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener8.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener7.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener6.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener5.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener4.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener3.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener2.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener1.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener0.edr

So actually, gromacs accessed to both trajectory and energy files.

I have two question:

1) Is this a known bug, has it been corrected in gromacs 4.5.5?

2) How can i check if trajectory are correct? I mean, how can I
check whether spurious frames has been inserted?

3) If they are correct, how can I restart for 12ns?

You can download log and mdp files from

The other 9 files differs only for the init_lambda value:


Thank you for help
gmx-users mailing list
Please search the archive at before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to
Can't post? Read

Reply via email to