Hello, Just a quick update after a few shorts tests we (my colleague and I) quickly did. First, using
"*You can emulate this yourself by calling "sleep 10s" before mdrun and see if that's long enough to solve the latency issue in your case.*" doesn't work for a few reasons, mainly because it doesn't seem to be a latency issue, but also because the load on a node is not affected by "sleep". However, you can reproduce the behavior I have observed pretty easily. It seems to be related to the values of the pointers to the *xtc, *trr, *edr, etc files written at the end of the checkpoint file after abrupt crashes AND to the frequency of access (opening) to those files. How to test: 1. In your input *mdp file put a high frequency of saving coordinates to, say, the *xtc (10, for example) and a low frequency for the *trr file (10,000, for example). 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run) 3. Kill abruptly the run shortly after that (say, after 10-100 steps). 4. You should have a few frames written in the *xtc file, and the only one (the first) in the *trr file. The *cpt file should have different from zero values for "file_offset_low" for all of these files (the pointers have been updated). 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay attention that the frequency for accessing/writing the *trr has not been reached. 7. You should have a few additional frames written in the *xtc file, while the *trr will still have only 1 frame (the first). The *cpt file now has updated all pointer values "file_offset_low", BUT the pointer to the *trr has acquired a value of 0. Obviously, we already now what will happen if we restart again from this last *cpt file. 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). 9. Kill it. 10. File *trr has size zero. Therefore, if a run is killed before the files are accessed for writing (depending on the chosen frequency), the file offset values reported in the *cpt file doesn't seem to be accordingly updated, and hence a new restart inevitably leads to overwritten output files. Do you think this is fixable? Thanks, Dimitar On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <rol...@utk.edu> wrote: > Two comments about the discussion: > > 1) I agree that buffered output (Kernel buffers - not application buffers) > should not affect I/O. If it does it should be filed as bug to the OS. Maybe > someone can write a short test application which tries to reproduce this > idea. Thus writing to a file from one node and immediate after one test > program is killed on one node writing to it from some other node. > > 2) We lock files but only the log file. The idea is that we only need > to guarantee that the set of files is only accessed by one application. This > seems safe but in case someone sees a way of how the trajectory is opened > without the log file being opened, please file a bug. > > Roland > > On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <mark.abra...@anu.edu.au>wrote: > >> On 5/06/2011 11:08 PM, Francesco Oteri wrote: >> >> Dear Dimitar, >> I'm following the debate regarding: >> >> >> The point was not "why" I was getting the restarts, but the fact >> itself that I was getting restarts close in time, as I stated in my first >> post. I actually also don't know whether jobs are deleted or suspended. I've >> thought that a job returned back to the queue will basically start from the >> beginning when later moved to an empty slot ... so don't understand the >> difference from that perspective. >> >> >> In the second mail yoo say: >> >> Submitted by: >> ======================== >> ii=1 >> ifmpi="mpirun -np $NSLOTS" >> -------- >> if [ ! -f run${ii}-i.tpr ];then >> cp run${ii}.tpr run${ii}-i.tpr >> tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr >> fi >> >> k=`ls md-${ii}*.out | wc -l` >> outfile="md-${ii}-$k.out" >> if [[ -f run${ii}.cpt ]]; then >> >> * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v >> -deffnm run${ii} -npme 0 > $outfile 2>&1 >> >> fi >> ========================= >> >> >> If I understand well, you are submitting the SERIAL mdrun. This means >> that multiple instances of mdrun are running at the same time. >> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint >> files, one for each instance (i.e. one for each CPU), are written at the >> same time. >> >> >> Good thought, but Dimitar's stdout excerpts from early in the thread do >> indicate the presence of multiple execution threads. Dynamic load balancing >> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally, >> and by default in the installation process, the MPI-enabled binaries get an >> "_mpi" suffix, but it isn't enforced - or enforceable :-) >> >> Mark >> >> -- >> >> gmx-users mailing list gmx-users@gromacs.org >> http://lists.gromacs.org/mailman/listinfo/gmx-users >> Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >> Please don't post (un)subscribe requests to the list. Use the >> www interface or send it to gmx-users-requ...@gromacs.org. >> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> > > > > -- > ORNL/UT Center for Molecular Biophysics cmb.ornl.gov > 865-241-1537, ORNL PO BOX 2008 MS6309 > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- ===================================================== *Dimitar V Pachov* PhD Physics Postdoctoral Fellow HHMI & Biochemistry Department Phone: (781) 736-2326 Brandeis University, MS 057 Email: dpac...@brandeis.edu =====================================================
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists