On 5/06/2011 5:42 PM, Dimitar Pachov wrote:


On Sun, Jun 5, 2011 at 2:14 AM, Mark Abraham <mark.abra...@anu.edu.au <mailto:mark.abra...@anu.edu.au>> wrote:

    On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
    As I said, the queue is like this: you submit the job, it finds
    an empty node, it goes there, however seconds later another user
    with higher privileges on that particular node submits a job, his
    job kicks out my job, mine goes on the queue again, it finds
    another empty node, goes there, then another user with
    high privileges on that node submits a job, which consequently
    kicks out my job again, and the cycle repeats itself ...
    theoretically, it could continue forever, depending on how many
    and where the empty nodes are, if any.

    You've said that *now* - but previously you've said nothing about
    why you were getting lots of restarts. In my experience, PBS
    queues suspend jobs rather than deleting them, in order that
    resources are not wasted. Apparently other places do things this
    way. I think that this information is highly relevant to
    explaining your observations.



The point was not "why" I was getting the restarts, but the fact itself that I was getting restarts close in time, as I stated in my first post. I actually also don't know whether jobs are deleted or suspended. I've thought that a job returned back to the queue will basically start from the beginning when later moved to an empty slot ... so don't understand the difference from that perspective.

It's the difference between a process being killed, and a process being allowed to survive but temporarily without access to the CPU. Operating systems routinely share the CPU over multiple execution threads. Job suspension just adapts that idea.

Also, different UNIX signals are interpreted differently by the GROMACS signal handler. It respects hard kills, but it cooperates with gentler kills by updating the checkpoint file at the next neighbour-search step, IIRC. Perhaps your PBS is making excessive use of hard kills - if it didn't, you still get to make some progress when you only get a minute of CPU time...


    These many restarts suggest that the queue was full with
    relatively short jobs ran by users with high privileges.
    Technically, I cannot see why the same processes should be
    running simultaneously because at any instant my job runs only on
    one node, or it stays in the queuing list.

    I/O can be buffered such that the termination of the process and
    the completion of its I/O are asynchronous. Perhaps it *shouldn't*
    be that way, but this is a problem for the administrators of your
    cluster to address. They know how the file system works. If the
    next job executes before the old one has finished output, then I
    think the symptoms you observe might be possible.


Yes, this is true, and I believe the timing of when the buffer is fully flushed is crucial in providing a possible explanation in the observed behavior. However, this bottleneck has been known for a long time, so I expected people had thought about that before confidently putting -append as a default. That's all.

Judging by the frequency of people reporting problems, most people don't encounter the kind of "file system latency leading to race condition" problem I think that you're seeing. Some might see it, and just work around, as you say. Or other people just don't have the combination of file system and compute resource management that you have to work with.


    Note that there is nothing GROMACS can do about that, unless
    somehow GROMACS can apply a lock in the first mdrun that is
    respected by your file system such that a subsequent mdrun cannot
    open the same file until all pending I/O has completed. I'd expect
    proper HPC file systems do that automatically, but I don't really
    know.


I am not an expert nor do I know the Gromacs coding, but could one have an option to specify certain timing before which Gromacs is prohibited to output/write any files after its initial start, i.e. some kind of suspension and/or waiting period?

One could delay some/all output initialization until the first write, but it probably makes the code rather more messy. GROMACS does check that the state of the output files make sense, by computing and comparing checksums stored in the checkpoint file. One has to draw a line somewhere. If the contents of those files might be changed by another process, then efficient MD is simply impossible. Also, there would be people complain that they spent 15 minutes on their 1024-processor simulation before it died when the lack of write permission for the checkpoint filename got noticed. Perhaps not that exact scenario, but similar could arise.

You can emulate this yourself by calling "sleep 10s" before mdrun and see if that's long enough to solve the latency issue in your case.

It seems to me that this kind of file locking ought to be the responsibility of the file system. Allowing a new process to access a file when there's buffered output pending seems wrong. It just asks for these kind of race conditions to arise. (Assuming my theory is sound...)

I am also wondering about the checkpoint timing - the default is 15 min, but what would be the minimum? Since I have not tested it, what would happen if I specify 0.001 min, for example?

I/O takes time, and checkpointing requires global communication to prepare for it. Doing it more often than one needs to do it is wasteful. Your situation sounds so volatile that checkpointing every 30s is probably sound. On a BlueGene, about the only reason to checkpoint is a power outage. One size can't fit all.


    Words are open to interpretation. Communicating well requires that
    you consider the impact of your words on your reader. You want
    people who can address the problem to want to help. You don't want
    them to feel defensive about the situation - whether you think
    that would be an over-reaction or not.


I got your point(s). However, I respectfully disagree with some of them. First, I believe it is much more important what information one's sentences bring rather than how specifically they are written.

The content is very important. Terse and informative is often much better than waffling vagueness. However, given a range of presentations with the same content, why not choose a presentation that improves the chance of achieving the objective?

Mark
-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to