Dear Chris, While it's always possible that GROMACS can be improved (or debugged), this smells more like a system-level problem. The corrupt checkpoint files are precisely 1MiB or 2MiB, which suggests strongly either 1) GROMACS was in the middle of a buffer flush when it was killed (but the filesystem did everything right; it was just sent incomplete data), or 2) the filesystem itself wrote a truncated file (but GROMACS wrote it successfully, the data was buffered, and GROMACS went on its merry way).
#1 could happen, for example, if GROMACS was killed with SIGKILL while copying .cpt to _prev.cpt -- if GROMACS even copies, rather than renames -- its checkpoint files. #2 could happen in any number of ways, depending on precisely how your disks, filesystems, and network filesystems are all configured (for example, if a RAID array goes down hard with per-drive writeback caches enabled, or your NFS system is soft-mounted and either client or server goes down). With the sizes of the truncated checkpoint files being very convenient numbers, my money is on #2. Have you contacted your sysadmins to report this? They may be able to take some steps to try to prevent this, and (if this is indeed a system problem) doing so would provide all their users an increased measure of safety for their data. Cheers, MZ On Tue, Mar 26, 2013 at 10:04 PM, Christopher Neale < chris.ne...@mail.utoronto.ca> wrote: > Dear Users: > > A cluster that I use went down today with a chiller failure. I lost all 16 > jobs (running gromacs 4.6.1). For 13 of these jobs, not only is the .cpt > file truncated, but also the _prev.cpt file is truncated, meaning that I am > going to have to go back through the files, extract a frame, make a new > .tpr file (using a new, custom .mdp file to get the timestamp right), > restart the runs, and then later join the trajectory data fragments. > > I have experienced this a number of times over the years with different > versions of gromacs (see, for example, > http://redmine.gromacs.org/issues/790 ) and wonder if anybody else has > experienced this? > > Also, does anybody have some advice on how to handle this? For now, my > idea is to run a script in the background to periodically check the .cpt > file and make a copy if it is not corrupted/truncated so that I can always > restart. > > If it is useful information, both the .cpt and the _prev.cpt files have > the same size and timestamp, but are smaller than non-corrupted .cpt files. > E.g.: > > $ ls -ltr --full-time *cpt > -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 > md2d_prev.cpt > -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 > md2d.cpt > -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:02.000000000 -0700 > md3.cpt > -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:03.000000000 -0700 > md3_prev.cpt > > Where, above, md2d.cpt was from the last stage of my equilibration and > md3.cpt was from my production. > > Here is another example from a different run with corruption: > > $ ls -ltr --full-time *cpt > -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 > md2d_prev.cpt > -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 > md2d.cpt > -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 > md3_prev.cpt > -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 > md3.cpt > > I detect corruption/truncation in the .cpt file like this: > $ gmxcheck -f md3.cpt > Fatal error: > Checkpoint file corrupted/truncated, or maybe you are out of disk space? > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > > Also, I confirmed the problem by trying to run mdrun: > > $ mdrun -nt 1 -deffnm md3 -cpi md3.cpt -nsteps 5000000000 > Fatal error: > Checkpoint file corrupted/truncated, or maybe you are out of disk space? > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > > (and I get the same thing using md3_prev.cpt) > > I am not out of disk space, but probably some type of condition like that > existed when the chiller failed and the system went down: > $ df -h . > Filesystem Size Used Avail Use% Mounted on > 342T 57T 281T 17% /global/scratch > > Nor am I out of quota (although I have no command to show that here). > > There is no corruption of .edr, .trr, or .xtc files > > The .log files end like this: > > Writing checkpoint, step 194008250 at Tue Mar 26 10:49:31 2013 > Writing checkpoint, step 194932330 at Tue Mar 26 11:49:31 2013 > Step Time Lambda > 195757661 391515.32200 0.00000 > Writing checkpoint, step 195757661 at Tue Mar 26 12:46:02 2013 > > I am motivated to help solve this problem, but have no idea how to stop > gromacs from copying corrupted/truncated checkpoint files to _prev.cpt . I > presume that one could write a magic number to the end of the .cpt file and > test that it exists prior to moving .cpt to _prev.cpt , but perhaps I > misunderstand the problem. If needs be, perhaps mdrun could call gmxcheck, > since that tool seems to detect the corruption/truncation. If it's done > every 30 minutes, it shouldn't affect the performance. > > Thank you for any advice, > Chris. > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists