Hello, On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pr...@cbr.su.se> wrote:
> Hi Dimitar, > > Thanks for the bug report. Would you mind trying the test program I > attached on the same file system that you get the truncated files on? > > compile it with gcc testje.c -o testio > Yes, but no problem: ==== [dpachov@login-0-0 NEWTEST]$ ./testio TEST PASSED: ftell gives: 46 ==== As for the other questions: HPC OS version: ==== [dpachov@login-0-0 NEWTEST]$ uname -a Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux [dpachov@login-0-0 NEWTEST]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.2 (Tikanga) ==== GROMACS 4.5.4 built: ==== module purge module load INTEL/intel-12.0 module load OPENMPI/1.4.3_INTEL_12.0 module load FFTW/2.1.5-INTEL_12.0 # not needed ##### # GROMACS settings export CC=mpicc export F77=mpif77 export CXX=mpic++ export FC=mpif90 export F90=mpif90 make distclean echo "XXXXXXX building single prec XXXXXX" ./configure --prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \ --enable-mpi \ --enable-shared \ --program-prefix="" --program-suffix="" \ --enable-float --disable-fortran \ --with-fft=mkl \ --with-external-blas \ --with-external-lapack \ --with-gsl \ --without-x \ CFLAGS="-O3 -funroll-all-loops" \ FFLAGS="-O3 -funroll-all-loops" \ CPPFLAGS="-I${MPI_INCLUDE} -I${MKL_INCLUDE} " \ LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 " make -j 8 && make install ==== Just did the same test on Hopper 2: http://www.nersc.gov/users/computational-systems/hopper/ with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result was the same as reported earlier. You could do the test there as well, if you have access, and see what you would get. Hope that helps a bit. Thanks, Dimitar > > Sander > > > > > > On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote: > > Hello, > > Just a quick update after a few shorts tests we (my colleague and I) > quickly did. First, using > > "*You can emulate this yourself by calling "sleep 10s" before mdrun and > see if that's long enough to solve the latency issue in your case.*" > > doesn't work for a few reasons, mainly because it doesn't seem to be a > latency issue, but also because the load on a node is not affected by > "sleep". > > However, you can reproduce the behavior I have observed pretty easily. It > seems to be related to the values of the pointers to the *xtc, *trr, *edr, > etc files written at the end of the checkpoint file after abrupt crashes AND > to the frequency of access (opening) to those files. How to test: > > 1. In your input *mdp file put a high frequency of saving coordinates to, > say, the *xtc (10, for example) and a low frequency for the *trr file > (10,000, for example). > 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run) > 3. Kill abruptly the run shortly after that (say, after 10-100 steps). > 4. You should have a few frames written in the *xtc file, and the only one > (the first) in the *trr file. The *cpt file should have different from zero > values for "file_offset_low" for all of these files (the pointers have been > updated). > > 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). > 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay > attention that the frequency for accessing/writing the *trr has not been > reached. > 7. You should have a few additional frames written in the *xtc file, while > the *trr will still have only 1 frame (the first). The *cpt file now has > updated all pointer values "file_offset_low", BUT the pointer to the *trr > has acquired a value of 0. Obviously, we already now what will happen if we > restart again from this last *cpt file. > > 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). > 9. Kill it. > 10. File *trr has size zero. > > > Therefore, if a run is killed before the files are accessed for writing > (depending on the chosen frequency), the file offset values reported in the > *cpt file doesn't seem to be accordingly updated, and hence a new restart > inevitably leads to overwritten output files. > > Do you think this is fixable? > > Thanks, > Dimitar > > > > > > > On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <rol...@utk.edu> wrote: > >> Two comments about the discussion: >> >> 1) I agree that buffered output (Kernel buffers - not application buffers) >> should not affect I/O. If it does it should be filed as bug to the OS. Maybe >> someone can write a short test application which tries to reproduce this >> idea. Thus writing to a file from one node and immediate after one test >> program is killed on one node writing to it from some other node. >> >> 2) We lock files but only the log file. The idea is that we only need >> to guarantee that the set of files is only accessed by one application. This >> seems safe but in case someone sees a way of how the trajectory is opened >> without the log file being opened, please file a bug. >> >> Roland >> >> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <mark.abra...@anu.edu.au>wrote: >> >>> On 5/06/2011 11:08 PM, Francesco Oteri wrote: >>> >>> Dear Dimitar, >>> I'm following the debate regarding: >>> >>> >>> The point was not "why" I was getting the restarts, but the fact >>> itself that I was getting restarts close in time, as I stated in my first >>> post. I actually also don't know whether jobs are deleted or suspended. I've >>> thought that a job returned back to the queue will basically start from the >>> beginning when later moved to an empty slot ... so don't understand the >>> difference from that perspective. >>> >>> >>> In the second mail yoo say: >>> >>> Submitted by: >>> ======================== >>> ii=1 >>> ifmpi="mpirun -np $NSLOTS" >>> -------- >>> if [ ! -f run${ii}-i.tpr ];then >>> cp run${ii}.tpr run${ii}-i.tpr >>> tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr >>> fi >>> >>> k=`ls md-${ii}*.out | wc -l` >>> outfile="md-${ii}-$k.out" >>> if [[ -f run${ii}.cpt ]]; then >>> >>> * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v >>> -deffnm run${ii} -npme 0 > $outfile 2>&1 >>> >>> fi >>> ========================= >>> >>> >>> If I understand well, you are submitting the SERIAL mdrun. This means >>> that multiple instances of mdrun are running at the same time. >>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint >>> files, one for each instance (i.e. one for each CPU), are written at the >>> same time. >>> >>> >>> Good thought, but Dimitar's stdout excerpts from early in the thread do >>> indicate the presence of multiple execution threads. Dynamic load balancing >>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally, >>> and by default in the installation process, the MPI-enabled binaries get an >>> "_mpi" suffix, but it isn't enforced - or enforceable :-) >>> >>> Mark >>> >>> -- >>> >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> Please don't post (un)subscribe requests to the list. Use the >>> www interface or send it to gmx-users-requ...@gromacs.org. >>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >> >> >> >> -- >> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov >> 865-241-1537, ORNL PO BOX 2008 MS6309 >> >> -- >> gmx-users mailing list gmx-users@gromacs.org >> http://lists.gromacs.org/mailman/listinfo/gmx-users >> Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >> Please don't post (un)subscribe requests to the list. Use the >> www interface or send it to gmx-users-requ...@gromacs.org. >> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> > > > > -- > ===================================================== > *Dimitar V Pachov* > > PhD Physics > Postdoctoral Fellow > HHMI & Biochemistry Department Phone: (781) 736-2326 > Brandeis University, MS 057 Email: dpac...@brandeis.edu > ===================================================== > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- ===================================================== *Dimitar V Pachov* PhD Physics Postdoctoral Fellow HHMI & Biochemistry Department Phone: (781) 736-2326 Brandeis University, MS 057 Email: dpac...@brandeis.edu =====================================================
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists