Thank you very much. I didn't notice it until now considering all those numbers look so similar. Great eye for detail!
João On Mon, Apr 8, 2013 at 3:17 PM, Mark Abraham <mark.j.abra...@gmail.com>wrote: > On Apr 8, 2013 8:53 AM, "João Henriques" <joao.henriques.32...@gmail.com> > wrote: > > > > Dear all, > > > > Due to cluster wall-time limitations, I was forced to restart two REMD > > simulations. It ran absolutely fine until hitting the wall-time. To > restart > > I used the following command: > > > > mpirun -np 64 -output-filename MPIoutput $GromDir/mdrun_mpi -s H5_.tpr > > -multi 64 -replex 1000 -deffnm H5_ -cpi -noappend > > > > (I'm using GMX-4.0.7 and yes I know it's old but I have my own reasons > for > > using it.) > > > > Here is a random replica (#1) MPI output: > > > > ######START####### > > NNODES=64, MYRANK=1, HOSTNAME=an091 > > NODEID=1 argc=11 > > Checkpoint file is from part 1, new output files will be suffixed > part0002. > > Reading file H5_1.tpr, VERSION 4.0.7 (single precision) > > > > Reading checkpoint file H5_1.cpt generated: Wed Apr 3 17:13:14 2013 > > > > ------------------------------------------------------- > > Program mdrun_mpi, VERSION 4.0.7 > > Source code file: main.c, line: 116 > > > > Fatal error: > > The 64 subsystems are not compatible > > > > ------------------------------------------------------- > > > > Error on node 1, will try to stop all the nodes > > Halting parallel program mdrun_mpi on CPU 1 out of 64 > > ######END####### > > > > It's reading from the correct cpt and tpr files, so it must be something > > else. > > > > Here is a tail of the respective log file: > > > > ######START####### > > Initializing Replica Exchange > > Repl There are 64 replicas: > > Multi-checking the number of atoms ... OK > > Multi-checking the integrator ... OK > > Multi-checking init_step+nsteps ... OK > > Multi-checking first exchange step: init_step/-replex ... > > first exchange step: init_step/-replex is not equal for all subsystems > > subsystem 0: 3062 > > subsystem 1: 3062 > > subsystem 2: 3062 > > subsystem 3: 3062 > > subsystem 4: 3062 > > subsystem 5: 3062 > > subsystem 6: 3062 > > subsystem 7: 3062 > > subsystem 8: 3062 > > subsystem 9: 3062 > > subsystem 10: 3062 > > subsystem 11: 3062 > > subsystem 12: 3062 > > subsystem 13: 3062 > > subsystem 14: 3062 > > subsystem 15: 3062 > > subsystem 16: 3062 > > subsystem 17: 3062 > > subsystem 18: 3062 > > subsystem 19: 3062 > > subsystem 20: 3062 > > subsystem 21: 3062 > > subsystem 22: 3062 > > subsystem 23: 3062 > > subsystem 24: 3062 > > subsystem 25: 3062 > > subsystem 26: 3062 > > subsystem 27: 3062 > > subsystem 28: 3062 > > subsystem 29: 3062 > > subsystem 30: 3062 > > subsystem 31: 3062 > > subsystem 32: 3062 > > subsystem 33: 3062 > > subsystem 34: 3062 > > subsystem 35: 3062 > > subsystem 36: 3062 > > subsystem 37: 3062 > > subsystem 38: 3062 > > subsystem 39: 3066 > > Seems system 39 got its IO done faster. Its state_prev.cpt will be 3062. > Back up your files. Use gmxcheck to see what's in files. Rename as suitable > so your set of files is consistent. > > Mark > > > subsystem 40: 3062 > > subsystem 41: 3062 > > subsystem 42: 3062 > > subsystem 43: 3062 > > subsystem 44: 3062 > > subsystem 45: 3062 > > subsystem 46: 3062 > > subsystem 47: 3062 > > subsystem 48: 3062 > > subsystem 49: 3062 > > subsystem 50: 3062 > > subsystem 51: 3062 > > subsystem 52: 3062 > > subsystem 53: 3062 > > subsystem 54: 3062 > > subsystem 55: 3062 > > subsystem 56: 3062 > > subsystem 57: 3062 > > subsystem 58: 3062 > > subsystem 59: 3062 > > subsystem 60: 3062 > > subsystem 61: 3062 > > subsystem 62: 3062 > > subsystem 63: 3062 > > > > ------------------------------------------------------- > > Program mdrun_mpi, VERSION 4.0.7 > > Source code file: main.c, line: 116 > > > > Fatal error: > > The 64 subsystems are not compatible > > > > ------------------------------------------------------- > > ######END####### > > > > It's clear that "init_step/-replex is not equal for all subsystems" is > the > > problem, but does anyone know why this is happening and how to solve it? > > > > Thank you for your patience, > > Best regards, > > > > João Henriques > > -- > > gmx-users mailing list gmx-users@gromacs.org > > http://lists.gromacs.org/mailman/listinfo/gmx-users > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > > * Please don't post (un)subscribe requests to the list. Use the > > www interface or send it to gmx-users-requ...@gromacs.org. > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- João Henriques -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists