Re: [gmx-users] Why does the -append option exist?

Mark Abraham Sun, 05 Jun 2011 01:38:02 -0700

On 5/06/2011 5:42 PM, Dimitar Pachov wrote:

On Sun, Jun 5, 2011 at 2:14 AM, Mark Abraham <mark.abra...@anu.edu.au<mailto:mark.abra...@anu.edu.au>> wrote:
    On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
    As I said, the queue is like this: you submit the job, it finds
    an empty node, it goes there, however seconds later another user
    with higher privileges on that particular node submits a job, his
    job kicks out my job, mine goes on the queue again, it finds
    another empty node, goes there, then another user with
    high privileges on that node submits a job, which consequently
    kicks out my job again, and the cycle repeats itself ...
    theoretically, it could continue forever, depending on how many
    and where the empty nodes are, if any.
    You've said that *now* - but previously you've said nothing about
    why you were getting lots of restarts. In my experience, PBS
    queues suspend jobs rather than deleting them, in order that
    resources are not wasted. Apparently other places do things this
    way. I think that this information is highly relevant to
    explaining your observations.
The point was not "why" I was getting the restarts, but the factitself that I was getting restarts close in time, as I stated in myfirst post. I actually also don't know whether jobs are deleted orsuspended. I've thought that a job returned back to the queue willbasically start from the beginning when later moved to an empty slot... so don't understand the difference from that perspective.

It's the difference between a process being killed, and a process beingallowed to survive but temporarily without access to the CPU. Operatingsystems routinely share the CPU over multiple execution threads. Jobsuspension just adapts that idea.

Also, different UNIX signals are interpreted differently by the GROMACSsignal handler. It respects hard kills, but it cooperates with gentlerkills by updating the checkpoint file at the next neighbour-search step,IIRC. Perhaps your PBS is making excessive use of hard kills - if itdidn't, you still get to make some progress when you only get a minuteof CPU time...

    These many restarts suggest that the queue was full with
    relatively short jobs ran by users with high privileges.
    Technically, I cannot see why the same processes should be
    running simultaneously because at any instant my job runs only on
    one node, or it stays in the queuing list.


    I/O can be buffered such that the termination of the process and
    the completion of its I/O are asynchronous. Perhaps it *shouldn't*
    be that way, but this is a problem for the administrators of your
    cluster to address. They know how the file system works. If the
    next job executes before the old one has finished output, then I
    think the symptoms you observe might be possible.

Yes, this is true, and I believe the timing of when the buffer isfully flushed is crucial in providing a possible explanation in theobserved behavior. However, this bottleneck has been known for a longtime, so I expected people had thought about that before confidentlyputting -append as a default. That's all.

Judging by the frequency of people reporting problems, most people don'tencounter the kind of "file system latency leading to race condition"problem I think that you're seeing. Some might see it, and just workaround, as you say. Or other people just don't have the combination offile system and compute resource management that you have to work with.

    Note that there is nothing GROMACS can do about that, unless
    somehow GROMACS can apply a lock in the first mdrun that is
    respected by your file system such that a subsequent mdrun cannot
    open the same file until all pending I/O has completed. I'd expect
    proper HPC file systems do that automatically, but I don't really
    know.
I am not an expert nor do I know the Gromacs coding, but could onehave an option to specify certain timing before which Gromacs isprohibited to output/write any files after its initial start, i.e.some kind of suspension and/or waiting period?

One could delay some/all output initialization until the first write,but it probably makes the code rather more messy. GROMACS does checkthat the state of the output files make sense, by computing andcomparing checksums stored in the checkpoint file. One has to draw aline somewhere. If the contents of those files might be changed byanother process, then efficient MD is simply impossible. Also, therewould be people complain that they spent 15 minutes on their1024-processor simulation before it died when the lack of writepermission for the checkpoint filename got noticed. Perhaps not thatexact scenario, but similar could arise.

You can emulate this yourself by calling "sleep 10s" before mdrun andsee if that's long enough to solve the latency issue in your case.

It seems to me that this kind of file locking ought to be theresponsibility of the file system. Allowing a new process to access afile when there's buffered output pending seems wrong. It just asks forthese kind of race conditions to arise. (Assuming my theory is sound...)

I am also wondering about the checkpoint timing - the default is 15min, but what would be the minimum? Since I have not tested it, whatwould happen if I specify 0.001 min, for example?

I/O takes time, and checkpointing requires global communication toprepare for it. Doing it more often than one needs to do it is wasteful.Your situation sounds so volatile that checkpointing every 30s isprobably sound. On a BlueGene, about the only reason to checkpoint is apower outage. One size can't fit all.

    Words are open to interpretation. Communicating well requires that
    you consider the impact of your words on your reader. You want
    people who can address the problem to want to help. You don't want
    them to feel defensive about the situation - whether you think
    that would be an over-reaction or not.
I got your point(s). However, I respectfully disagree with some ofthem. First, I believe it is much more important what informationone's sentences bring rather than how specifically they are written.

The content is very important. Terse and informative is often muchbetter than waffling vagueness. However, given a range of presentationswith the same content, why not choose a presentation that improves thechance of achieving the objective?


Mark

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Why does the -append option exist?

Reply via email to