[gmx-users] problem with hemiltonian replica exchange restarting

francesco oteri Fri, 20 Apr 2012 06:33:12 -0700

Dear gromacs users,
I run a REMD simulation 20ns long, enabling free energy and using a
different init_lambda value for each replica  and using gromacs 4.5.3.


I run the simulation on a cluster equipped with torque queue management.

1) I used the following command in the submission script:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -v >& log.rest2_TrpCage

The run went fine and it correctly terminated in 36 hours, before reaching
the 20ns and writing each file.

2) Then I extended the simulation using the command:
mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume.rest2_TrpCage

This time, the program crashed with the error:

*[[28079,1],72][/caspur/shared/src/openmpi/openmpi-1.4.3/ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc]
from neo085 to: neo098 error polling LP CQ with status RETRY EXCEEDED ERROR
status number 12 for wr_id 427787264 opcode 36099  vendor error 129 qp_idx 0
*
*--------------------------------------------------------------------------*
*The InfiniBand retry count between two MPI processes has been*
*exceeded.  "Retry count" is defined in the InfiniBand spec 1.2*
*(section 12.7.38):*
*
*
*    The total number of times that the sender wishes the receiver to*
*    retry timeout, packet sequence, etc. errors before posting a*
*    completion error.*
*
*
*This error typically means that there is something awry within the*
*InfiniBand fabric itself.  You should note the hosts on which this*
*error has occurred; it has been observed that rebooting or removing a*
*particular host from the job can sometimes resolve this issue.*
*
*
*Two MCA parameters can be used to control Open MPI's behavior with*
*respect to the retry count:*
*
*
** btl_openib_ib_retry_count - The number of times the sender will*
*  attempt to retry (defaulted to 7, the maximum value).*
** btl_openib_ib_timeout - The local ACK timeout parameter (defaulted*
*  to 10).  The actual timeout value used is calculated as:*
*
*
*     4.096 microseconds * (2^btl_openib_ib_timeout)*
*
*
*  See the InfiniBand spec 1.2 (section 12.7.34) for more details.*
*
*
*Below is some information about the host that raised the error and the*
*peer to which it was connected:*
*
*
*  Local host:   neo085*
*  Local device: mthca0*
*  Peer host:    neo098*
*
*
*You may need to consult with your system administrator to get this*
*problem fixed.*
*--------------------------------------------------------------------------*
*--------------------------------------------------------------------------*
*mpirun has exited due to process rank 72 with PID 2083 on*
*node neo085 exiting without calling "finalize". This may*
*have caused other processes in the application to be*
*terminated by signals sent by mpirun (as reported here).*
*--------------------------------------------------------------------------*
*mpirun: abort is already in progress...hit ctrl-c again to forcibly
terminate*




The reached simulation time written  in the md0.log file was *12.0766 ns*


3) I assumed it was a network error, imparing the correct comunication
among the nodes. I frequently obtain this error
and usually I restart the simulation without any troube.
Hence I restarted again the simulation:

mpirun -np $NP mdrun_mpi_gcc -s rest2_.tpr -multi 10 -replex 1000 -dd 2 2 2
-maxh 36 -cpi -v >& log.resume1.rest2_TrpCage

The simulation went fine, reaching the 20ns and without any complains by
gromacs.


When I started the data analysis, I noticed that all the 10 trajectory
files are nearly *12.07ns, *while energy files are 20ns long.
if I check the last modification time by ls -l it says me that the files
has been modified nearly simultaneously:

[oteri@matrix2 REST2]$ ls -lrt *.trr *.edr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj8.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj3.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj2.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj1.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj7.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj9.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj4.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj6.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj0.trr
-rw-r--r-- 1 oteri be7 1175659492 avr 19 15:52 traj5.trr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener9.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener8.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener7.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener6.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener5.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener4.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener3.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener2.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener1.edr
-rw-r--r-- 1 oteri be7    1201324 avr 19 15:52 ener0.edr


So actually, gromacs accessed to both trajectory and energy files.

I have two question:

1) Is this a known bug, has it been corrected in gromacs 4.5.5?

2) How can i check if trajectory are correct? I mean, how can I
check whether spurious frames has been inserted?

3) If they are correct, how can I restart for 12ns?


You can download log and mdp files from
http://160.80.35.105/download/problem/

The other 9 files differs only for the init_lambda value:

rest2_0.mdp:init_lambda=-0.000000
rest2_1.mdp:init_lambda=0.143679
rest2_2.mdp:init_lambda=0.274297
rest2_3.mdp:init_lambda=0.388587
rest2_4.mdp:init_lambda=0.501717
rest2_5.mdp:init_lambda=0.611494
rest2_6.mdp:init_lambda=0.716387
rest2_7.mdp:init_lambda=0.818048
rest2_8.mdp:init_lambda=0.910347
rest2_9.mdp:init_lambda=1.000000


Thank you for help
                                                     Francesco

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

[gmx-users] problem with hemiltonian replica exchange restarting

Reply via email to