On 5/06/2011 5:42 PM, Dimitar Pachov wrote:
On Sun, Jun 5, 2011 at 2:14 AM, Mark Abraham <mark.abra...@anu.edu.au
<mailto:mark.abra...@anu.edu.au>> wrote:
On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
As I said, the queue is like this: you submit the job, it finds
an empty node, it goes there, however seconds later another user
with higher privileges on that particular node submits a job, his
job kicks out my job, mine goes on the queue again, it finds
another empty node, goes there, then another user with
high privileges on that node submits a job, which consequently
kicks out my job again, and the cycle repeats itself ...
theoretically, it could continue forever, depending on how many
and where the empty nodes are, if any.
You've said that *now* - but previously you've said nothing about
why you were getting lots of restarts. In my experience, PBS
queues suspend jobs rather than deleting them, in order that
resources are not wasted. Apparently other places do things this
way. I think that this information is highly relevant to
explaining your observations.
The point was not "why" I was getting the restarts, but the fact
itself that I was getting restarts close in time, as I stated in my
first post. I actually also don't know whether jobs are deleted or
suspended. I've thought that a job returned back to the queue will
basically start from the beginning when later moved to an empty slot
... so don't understand the difference from that perspective.
It's the difference between a process being killed, and a process being
allowed to survive but temporarily without access to the CPU. Operating
systems routinely share the CPU over multiple execution threads. Job
suspension just adapts that idea.
Also, different UNIX signals are interpreted differently by the GROMACS
signal handler. It respects hard kills, but it cooperates with gentler
kills by updating the checkpoint file at the next neighbour-search step,
IIRC. Perhaps your PBS is making excessive use of hard kills - if it
didn't, you still get to make some progress when you only get a minute
of CPU time...
These many restarts suggest that the queue was full with
relatively short jobs ran by users with high privileges.
Technically, I cannot see why the same processes should be
running simultaneously because at any instant my job runs only on
one node, or it stays in the queuing list.
I/O can be buffered such that the termination of the process and
the completion of its I/O are asynchronous. Perhaps it *shouldn't*
be that way, but this is a problem for the administrators of your
cluster to address. They know how the file system works. If the
next job executes before the old one has finished output, then I
think the symptoms you observe might be possible.
Yes, this is true, and I believe the timing of when the buffer is
fully flushed is crucial in providing a possible explanation in the
observed behavior. However, this bottleneck has been known for a long
time, so I expected people had thought about that before confidently
putting -append as a default. That's all.
Judging by the frequency of people reporting problems, most people don't
encounter the kind of "file system latency leading to race condition"
problem I think that you're seeing. Some might see it, and just work
around, as you say. Or other people just don't have the combination of
file system and compute resource management that you have to work with.
Note that there is nothing GROMACS can do about that, unless
somehow GROMACS can apply a lock in the first mdrun that is
respected by your file system such that a subsequent mdrun cannot
open the same file until all pending I/O has completed. I'd expect
proper HPC file systems do that automatically, but I don't really
know.
I am not an expert nor do I know the Gromacs coding, but could one
have an option to specify certain timing before which Gromacs is
prohibited to output/write any files after its initial start, i.e.
some kind of suspension and/or waiting period?
One could delay some/all output initialization until the first write,
but it probably makes the code rather more messy. GROMACS does check
that the state of the output files make sense, by computing and
comparing checksums stored in the checkpoint file. One has to draw a
line somewhere. If the contents of those files might be changed by
another process, then efficient MD is simply impossible. Also, there
would be people complain that they spent 15 minutes on their
1024-processor simulation before it died when the lack of write
permission for the checkpoint filename got noticed. Perhaps not that
exact scenario, but similar could arise.
You can emulate this yourself by calling "sleep 10s" before mdrun and
see if that's long enough to solve the latency issue in your case.
It seems to me that this kind of file locking ought to be the
responsibility of the file system. Allowing a new process to access a
file when there's buffered output pending seems wrong. It just asks for
these kind of race conditions to arise. (Assuming my theory is sound...)
I am also wondering about the checkpoint timing - the default is 15
min, but what would be the minimum? Since I have not tested it, what
would happen if I specify 0.001 min, for example?
I/O takes time, and checkpointing requires global communication to
prepare for it. Doing it more often than one needs to do it is wasteful.
Your situation sounds so volatile that checkpointing every 30s is
probably sound. On a BlueGene, about the only reason to checkpoint is a
power outage. One size can't fit all.
Words are open to interpretation. Communicating well requires that
you consider the impact of your words on your reader. You want
people who can address the problem to want to help. You don't want
them to feel defensive about the situation - whether you think
that would be an over-reaction or not.
I got your point(s). However, I respectfully disagree with some of
them. First, I believe it is much more important what information
one's sentences bring rather than how specifically they are written.
The content is very important. Terse and informative is often much
better than waffling vagueness. However, given a range of presentations
with the same content, why not choose a presentation that improves the
chance of achieving the objective?
Mark
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists