Hello,

On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pr...@cbr.su.se> wrote:

> Hi Dimitar,
>
> Thanks for the bug report. Would you mind trying the test program I
> attached on the same file system that you get the truncated files on?
>
> compile it with gcc testje.c -o testio
>

Yes, but no problem:

====
[dpachov@login-0-0 NEWTEST]$ ./testio
TEST PASSED: ftell gives: 46
====

As for the other questions:

HPC OS version:
====
[dpachov@login-0-0 NEWTEST]$ uname -a
Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39 EDT
2010 x86_64 x86_64 x86_64 GNU/Linux
[dpachov@login-0-0 NEWTEST]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
====

GROMACS 4.5.4 built:
====
module purge
module load INTEL/intel-12.0
module load OPENMPI/1.4.3_INTEL_12.0
module load FFTW/2.1.5-INTEL_12.0 # not needed

#####
# GROMACS settings

export CC=mpicc
export F77=mpif77
export CXX=mpic++
export FC=mpif90
export F90=mpif90

make distclean

echo "XXXXXXX building single prec XXXXXX"

./configure
--prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \
--enable-mpi \
 --enable-shared \
--program-prefix="" --program-suffix="" \
--enable-float --disable-fortran \
--with-fft=mkl \
--with-external-blas \
--with-external-lapack \
--with-gsl \
--without-x \
CFLAGS="-O3 -funroll-all-loops" \
FFLAGS="-O3 -funroll-all-loops" \
CPPFLAGS="-I${MPI_INCLUDE} -I${MKL_INCLUDE} " \
LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core
-lmkl_intel_thread -liomp5 "

make -j 8 && make install
====

Just did the same test on Hopper 2:
http://www.nersc.gov/users/computational-systems/hopper/

with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result was
the same as reported earlier. You could do the test there as well, if you
have access, and see what you would get.

Hope that helps a bit.

Thanks,
Dimitar





>
> Sander
>
>
>
>
>
> On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote:
>
> Hello,
>
> Just a quick update after a few shorts tests we (my colleague and I)
> quickly did. First, using
>
> "*You can emulate this yourself by calling "sleep 10s" before mdrun and
> see if that's long enough to solve the latency issue in your case.*"
>
> doesn't work for a few reasons, mainly because it doesn't seem to be a
> latency issue, but also because the load on a node is not affected by
> "sleep".
>
> However, you can reproduce the behavior I have observed pretty easily. It
> seems to be related to the values of the pointers to the *xtc, *trr, *edr,
> etc files written at the end of the checkpoint file after abrupt crashes AND
> to the frequency of access (opening) to those files. How to test:
>
> 1. In your input *mdp file put a high frequency of saving coordinates to,
> say, the *xtc (10, for example) and a low frequency for the *trr file
> (10,000, for example).
> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run)
> 3. Kill abruptly the run shortly after that (say, after 10-100 steps).
> 4. You should have a few frames written in the *xtc file, and the only one
> (the first) in the *trr file. The *cpt file should have different from zero
> values for "file_offset_low" for all of these files (the pointers have been
> updated).
>
> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
> 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay
> attention that the frequency for accessing/writing the *trr has not been
> reached.
> 7. You should have a few additional frames written in the *xtc file, while
> the *trr will still have only 1 frame (the first). The *cpt file now has
> updated all pointer values "file_offset_low", BUT the pointer to the *trr
> has acquired a value of 0. Obviously, we already now what will happen if we
> restart again from this last *cpt file.
>
> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run).
> 9. Kill it.
> 10. File *trr has size zero.
>
>
> Therefore, if a run is killed before the files are accessed for writing
> (depending on the chosen frequency), the file offset values reported in the
> *cpt file doesn't seem to be accordingly updated, and hence a new restart
> inevitably leads to overwritten output files.
>
> Do you think this is fixable?
>
> Thanks,
> Dimitar
>
>
>
>
>
>
> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <rol...@utk.edu> wrote:
>
>> Two comments about the discussion:
>>
>> 1) I agree that buffered output (Kernel buffers - not application buffers)
>> should not affect I/O. If it does it should be filed as bug to the OS. Maybe
>> someone can write a short test application which tries to reproduce this
>> idea. Thus writing to a file from one node and immediate after one test
>> program is killed on one node writing to it from some other node.
>>
>> 2) We lock files but only the log file. The idea is that we only need
>> to guarantee that the set of files is only accessed by one application. This
>> seems safe but in case someone sees a way of how the trajectory is opened
>> without the log file being opened, please file a bug.
>>
>> Roland
>>
>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham <mark.abra...@anu.edu.au>wrote:
>>
>>>  On 5/06/2011 11:08 PM, Francesco Oteri wrote:
>>>
>>> Dear Dimitar,
>>> I'm following the debate regarding:
>>>
>>>
>>>    The point was not "why" I was getting the restarts, but the fact
>>> itself that I was getting restarts close in time, as I stated in my first
>>> post. I actually also don't know whether jobs are deleted or suspended. I've
>>> thought that a job returned back to the queue will basically start from the
>>> beginning when later moved to an empty slot ... so don't understand the
>>> difference from that perspective.
>>>
>>>
>>> In the second mail yoo say:
>>>
>>>  Submitted by:
>>> ========================
>>> ii=1
>>> ifmpi="mpirun -np $NSLOTS"
>>> --------
>>>    if [ ! -f run${ii}-i.tpr ];then
>>>        cp run${ii}.tpr run${ii}-i.tpr
>>>       tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
>>>    fi
>>>
>>>     k=`ls md-${ii}*.out | wc -l`
>>>    outfile="md-${ii}-$k.out"
>>>    if [[ -f run${ii}.cpt ]]; then
>>>
>>>       * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v
>>> -deffnm run${ii} -npme 0 > $outfile  2>&1
>>>
>>>     fi
>>>  =========================
>>>
>>>
>>> If I understand well, you are submitting the SERIAL  mdrun. This means
>>> that multiple instances of mdrun are running at the same time.
>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint
>>> files, one for each instance (i.e. one for each CPU),  are written at the
>>> same time.
>>>
>>>
>>> Good thought, but Dimitar's stdout excerpts from early in the thread do
>>> indicate the presence of multiple execution threads. Dynamic load balancing
>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally,
>>> and by default in the installation process, the MPI-enabled binaries get an
>>> "_mpi" suffix, but it isn't enforced - or enforceable :-)
>>>
>>> Mark
>>>
>>> --
>>>
>>> gmx-users mailing list    gmx-users@gromacs.org
>>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>>> Please search the archive at
>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>>> Please don't post (un)subscribe requests to the list. Use the
>>> www interface or send it to gmx-users-requ...@gromacs.org.
>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>
>>
>>
>>
>> --
>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
>> 865-241-1537, ORNL PO BOX 2008 MS6309
>>
>> --
>> gmx-users mailing list    gmx-users@gromacs.org
>> http://lists.gromacs.org/mailman/listinfo/gmx-users
>> Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
>> Please don't post (un)subscribe requests to the list. Use the
>> www interface or send it to gmx-users-requ...@gromacs.org.
>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>
>
>
> --
> =====================================================
> *Dimitar V Pachov*
>
> PhD Physics
> Postdoctoral Fellow
> HHMI & Biochemistry Department        Phone: (781) 736-2326
> Brandeis University, MS 057                Email: dpac...@brandeis.edu
> =====================================================
>
> --
> gmx-users mailing list    gmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-requ...@gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
>
> --
> gmx-users mailing list    gmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-requ...@gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>



-- 
=====================================================
*Dimitar V Pachov*

PhD Physics
Postdoctoral Fellow
HHMI & Biochemistry Department        Phone: (781) 736-2326
Brandeis University, MS 057                Email: dpac...@brandeis.edu
=====================================================
-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to