Re: [gmx-users] Two machines, same job, one fails

TJ Mustard Tue, 25 Jan 2011 17:26:26 -0800

On January 25, 2011 at 3:54 PM Mark Abraham <mark.abra...@anu.edu.au> wrote:

On 01/26/11, TJ Mustard <musta...@onid.orst.edu> wrote:

On January 25, 2011 at 3:24 PM "Justin A. Lemkul" <jalem...@vt.edu> wrote:

>
>
> TJ Mustard wrote:
> >
> >
> >
> >
> >
> > On January 25, 2011 at 2:08 PM Mark Abraham <mark.abra...@anu.edu.au> wrote:
> >
> >> On 26/01/2011 5:50 AM, TJ Mustard wrote:
> >>>
> >>> Hi all,
> >>>
> >>>
> >>>
> >>> I am running MD/FEP on a protein-ligand system with gromacs 4.5.3 and
> >>> FFTW 3.2.2.
> >>>
> >>>
> >>>
> >>> My iMac will run the job (over 4000 steps, till I killed it) at 4fs
> >>> steps. (I am using heavy H)
> >>>
> >>>
> >>>
> >>> Once I put this on our groups AMD Cluster the jobs fail even with 2fs
> >>> steps. (with thousands of lincs errors)
> >>>
> >>>
> >>>
> >>> We have recompiled the clusters gromacs 4.5.3 build, with no change.
> >>> I know the system is the same since I copied the job from the server
> >>> to my machine, to rerun it.
> >>>
> >>>
> >>>
> >>> What is going on? Why can one machine run a job perfectly and the
> >>> other cannot? I also know there is adequate memory on both machines.
> >>>
> >>
> >> You've posted this before, and I made a number of diagnostic
> >> suggestions. What did you learn?
> >>
> >> Mark
> >
> > Mark and all,
> >
> >
> >
> > First thank you for all our help. What you suggested last time helped
> > considerably with our jobs/calculations. I have learned that using the
> > standard mdp settings allow my heavyh 4fs jobs to run on my iMac (intel)
> > and have made these my new standard for future jobs. We chose to use the
> > smaller 0.8nm PME/Cutoff due to others papers/tutorials, but now we
> > understand why we need these standard settings. Now what I see to be our
> > problem is that our machines have some sort of variable we cannot
> > account for. If I am blind to my error, please show me. I just don't
> > understand why one computer works while the other does not. We have
> > recompiled gromacs 4.5.3 single precission on our cluster, and still
> > have this problem.
> >
>
> I know the feeling all too well. PowerPC jobs crash instantly, on our cluster,
> despite working beautifully on our lab machines. There's a bug report about
> that one, but I haven't heard anything about AMD failures. It remains a
> possibility that something beyond your control is going on. To explore a bit
> further:
>
> 1. Do the systems in question crash immediately (i.e., step zero) or do they run
> for some time?
>

Step 0, every time.

> 2. If they give you even a little bit of output, you can analyze which energy
> terms, etc go haywire with the tips listed here:
>

All I have seen on these is LINCS Errors and Water molecules unable to be settled.

But I will check this out right now, and email if I smell trouble.

> http://www.gromacs.org/Documentation/Terminology/Blowing_Up#Diagnosing_an_Unstable_System
>
> That would help in tracking down any potential bug or error.

I am doing this as we type/speak and I have not seen any terrible contacts yet. Some close but no overlapping. I will continue with the other advise on the page as well.

>
> 3. Is it just the production runs that are crashing, or everything? If EM isn't
> even working, that smells even buggier.

Awesome question here, we have seen some weird stuff. Sometimes the cluster will give us segmentation faults, then it will fail on our machines or sometimes not on our iMacs. I know weird! If EM starts on the cluster it will finish. Where we have issues is in positional restraint (PR) and MD and MD/FEP. It doesn't matter if FEP is on or off in a MD (although we are using SD for these MD/FEP runs).

Good. That rules out FEP as the source of the problem, like I asked in your previous thread.

Sorry I thought I posted that earlier.

>
> 4. Are the compilers the same on the iMac vs. AMD cluster?

No I am using x86_64-apple-darwin10 GCC 4.4.4 and the cluster is using x86_64-redhat-linux 4.1.2 GCC.

I just did a quick yum search and there doesn't seem to be a newer GCC. We know you are going to cmake but we have yet to get it implemented on our cluster successfully.

There have been doubts about the 4.1.x series of GCC compilers for GROMACS - and IIRC 4.1.2 in particular (do search the archives yourself). Some time back, Berk solicited actual accounts of problems and nobody presented one. So we no longer have an official warning against using it. However I'd say this is a candidate for the source of your problems. I would ask your cluster admins to get and compile a source-code version of GCC for you to try.

I will be talking to them shortly. IF this is the problem I am very surprised.

Thank you again,

TJ Mustard

Mark

TJ Mustard
Email: musta...@onid.orst.edu

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Two machines, same job, one fails

Reply via email to