Re: [gmx-users] Re: Segmentation fault, mdrun_mpi

2012-10-10 Thread Justin Lemkul



On 10/10/12 1:33 PM, Ladasky wrote:

Update:


Ladasky wrote


Justin Lemkul wrote

Random segmentation faults are really hard to debug.  Can you resume the
run
using a checkpoint file?  That would suggest maybe an MPI problem or
something
else external to Gromacs.  Without a reproducible system and a debugging
backtrace, it's going to be hard to figure out where the problem is
coming from.

Thanks for that tip, Justin.  I tried to resume one run which failed at
1.06 million cycles, and it WORKED.  It proceeded all the way to the 2.50
million cycles that I designated.  I now have two separate .trr files, but
I suppose they can be merged.

I don't know whether my crashes are random yet.  I will try re-running
that simulation again from time zero, to see whether it segfaults at the
same place.  If it doesn't, then I have a problem which may have nothing
to do with GROMACS.


I just tried exactly that, a re-run of the same structure.  This time, it
ran without stopping, from time zero to 2.50 million cycles!  No crash at
1.06 million cycles this time.

Unless GROMACS is using some random number generator which affects the
outcome of repeated simulations (and I think that the only time that random
number generation would be needed would be when initial velocities are
generated, which was done during the earlier equilibration step), I will
conclude that my simulation conditions are indeed acceptable, and that
sometimes the software just behaves badly.



There are plenty of things that can differ between runs (unless you've turned 
off optimizations and are using the -reprod option), but in all practical sense, 
they should not lead to random seg faults.



Is that a common occurrence?



Based on the fact that very few people post seg fault problems that are not 
precipitated by actual crashes (i.e. LINCS warnings), I would say no.  There is 
no evidence yet to suggest what the real problem is, but until such time, 
Gromacs is innocent until proven guilty ;)



I could write a script which just automatically restarts my simulations
provided that they 1) ran for a decent number of cycles and b) exited with a
segmentation fault error.  I could then have the script check in after a few
minutes to make sure that they haven't crashed again, and soldier on.



That's an option.  If you're running in a queue system, there may be 
notification options if something goes wrong, as well.


-Justin

--


Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Re: Segmentation fault, mdrun_mpi

2012-10-08 Thread Justin Lemkul



On 10/8/12 4:39 AM, Ladasky wrote:

Justin Lemkul wrote

My first guess would be a buggy MPI implementation.  I can't comment on
hardware
specs, but usually the random failures seen in mdrun_mpi are a result of
some
generic MPI failure.  What MPI are you using?


I am using the OpenMPI package, version 1.4.3.  It's one of three MPI
implementations which are included in the standard repositories of Ubuntu
Linux 11.10.  I can also obtain MPICH2 and gromacs-mpich without jumping
through too many hoops.  It looks like LAM is also available.  However, if
GROMACS needs a special package to interface with LAM, it's not in the
repositories.



This all seems reasonable.  I asked about the MPI implementation because people 
have previously reported that using LAM (which is really outdated) causes random 
seg faults and errors.  I would not necessarily implicate OpenMPI, as I use it 
routinely.  I never use repositories (I always compile from source) as I have 
gotten buggy packages in the past, but I don't know if that's relevant here or 
not.  I'm not trying to implicate the package maintainer in any way, just noting 
that long ago (5-6 years) the Gromacs package had some issues.


-Justin


Alternately, I could drop using the external MPI for now and just use the
new multi-threaded GROMACS defaults.  I was trying to prepare for longer
runs on a cluster, however.  If those runs are going to crash, I had better
know about it now.



--
View this message in context: 
http://gromacs.5086.n6.nabble.com/Segmentation-fault-mdrun-mpi-tp5001601p5001776.html
Sent from the GROMACS Users Forum mailing list archive at Nabble.com.



--


Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Re: Segmentation fault, mdrun_mpi

2012-10-07 Thread Justin Lemkul



On 10/7/12 2:15 PM, Ladasky wrote:

Justin Lemkul wrote

Random segmentation faults are really hard to debug.  Can you resume the
run
using a checkpoint file?  That would suggest maybe an MPI problem or
something
else external to Gromacs.  Without a reproducible system and a debugging
backtrace, it's going to be hard to figure out where the problem is coming
from.


Thanks for that tip, Justin.  I tried to resume one run which failed at 1.06
million cycles, and it WORKED.  It proceeded all the way to the 2.50 million
cycles that I designated.  I now have two separate .trr files, but I suppose
they can be merged.

I don't know whether my crashes are random yet.  I will try re-running that
simulation again from time zero, to see whether it segfaults at the same
place.  If it doesn't, then I have a problem which may have nothing to do
with GROMACS.

I looked in on memory usage several times while mdrun_mpi was executing.
Over all, about 3 GB of my computer's 8 GB of RAM were in use.  As I
expected, GROMACS used very little of this.  The mpirun process used a
constant 708K.  I had five mdrun_mpi processes, all of which used slightly
more RAM as they worked, but I didn't notice anything which suggested a
gross memory leak.  The process which used the most RAM was using 14.4 MB
right after it started, rose to 15.9 MB within the first ten minutes or so,
and reached 16.0 MB after four hours.  The process which used the least RAM
started at 10.6 MB and finished at 10.8 MB.  All together, GROMACS was using
about 64 MB.

I have a well-cooled CPU, core temperatures are under 50 degrees when the
system is running under full load.  My system doesn't lock up or crash on
me.  I think that my hardware is good.




My first guess would be a buggy MPI implementation.  I can't comment on hardware 
specs, but usually the random failures seen in mdrun_mpi are a result of some 
generic MPI failure.  What MPI are you using?


-Justin

--


Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] Re: Segmentation fault, mdrun_mpi

2012-10-05 Thread Justin Lemkul



On 10/5/12 3:03 PM, Ladasky wrote:

Bumping this once before the weekend, hoping to get some help.

I am getting segmentation fault errors at 1 to 2 million cycles into my
production MD runs, using GROMACS 4.5.4.  If these errors are a consequence
of a poorly-equilibrated system, I am no longer getting the right kind of
error messages to support that conclusion.  I am not getting PME or SETTLE
errors.  I am getting a non-descriptive segmentation fault.

I have corrected earlier shortcomings in my equilibration protocols, as
discussed in this earlier thread:

http://gromacs.5086.n6.nabble.com/Re-Water-molecules-cannot-be-settled-why-tp4999302.html

I am now monitoring the macroscopic properties of my simulation.  Potential,
pressure, density, and temperature convergence and subsequent stability are
achieved, at least as well as demonstrated in Justin Lemkul's most recent
tutorial:

http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin/gmx-tutorials/lysozyme/index.html

The trajectories of my simulations do not appear to be radical in any way
that I can discern.  I have a partially-unfolded protein, folding gradually,
in a box of water with counter-ions.


From my previous thread, I have come to appreciate just how far from

equilibrium the initial state of a simulation can be.  Also, I have always
understood that MD simulations are chaotic, and that instabilities can
result simply from the fact that a continuous system is being modeled in
discrete time steps.  (As an aside, one of my first programming puzzles was
about exactly this kind of thing.  When I was a high-school student, I
wanted to simulate the orbits of the Moon about the Earth, and the Earth
about the Sun.  It sounded simple enough, just apply the inverse-square law
for gravity, right?  Yet no matter how I tried, I couldn't achieve a stable
system.  Deeper reading led me to the intuitive and quick "leapfrog" method
of improving differential approximations, which GROMACS apparently uses: and
the more powerful but slower Runge-Kutta methods, which GROMACS apparently
does not use.)

By correcting my earlier problems, I have extended the time that my
simulations will run by a factor of 10-20 fold, out to several nanoseconds.
That's progress, but I'm never going to get to one microsecond this way.

Any advice is appreciated.  Of course I can post MDP files again, as well as
graphs.



Random segmentation faults are really hard to debug.  Can you resume the run 
using a checkpoint file?  That would suggest maybe an MPI problem or something 
else external to Gromacs.  Without a reproducible system and a debugging 
backtrace, it's going to be hard to figure out where the problem is coming from.


-Justin

--


Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists