Re: [gmx-users] Re: Segmentation fault, mdrun_mpi
On 10/10/12 1:33 PM, Ladasky wrote: Update: Ladasky wrote Justin Lemkul wrote Random segmentation faults are really hard to debug. Can you resume the run using a checkpoint file? That would suggest maybe an MPI problem or something else external to Gromacs. Without a reproducible system and a debugging backtrace, it's going to be hard to figure out where the problem is coming from. Thanks for that tip, Justin. I tried to resume one run which failed at 1.06 million cycles, and it WORKED. It proceeded all the way to the 2.50 million cycles that I designated. I now have two separate .trr files, but I suppose they can be merged. I don't know whether my crashes are random yet. I will try re-running that simulation again from time zero, to see whether it segfaults at the same place. If it doesn't, then I have a problem which may have nothing to do with GROMACS. I just tried exactly that, a re-run of the same structure. This time, it ran without stopping, from time zero to 2.50 million cycles! No crash at 1.06 million cycles this time. Unless GROMACS is using some random number generator which affects the outcome of repeated simulations (and I think that the only time that random number generation would be needed would be when initial velocities are generated, which was done during the earlier equilibration step), I will conclude that my simulation conditions are indeed acceptable, and that sometimes the software just behaves badly. There are plenty of things that can differ between runs (unless you've turned off optimizations and are using the -reprod option), but in all practical sense, they should not lead to random seg faults. Is that a common occurrence? Based on the fact that very few people post seg fault problems that are not precipitated by actual crashes (i.e. LINCS warnings), I would say no. There is no evidence yet to suggest what the real problem is, but until such time, Gromacs is innocent until proven guilty ;) I could write a script which just automatically restarts my simulations provided that they 1) ran for a decent number of cycles and b) exited with a segmentation fault error. I could then have the script check in after a few minutes to make sure that they haven't crashed again, and soldier on. That's an option. If you're running in a queue system, there may be notification options if something goes wrong, as well. -Justin -- Justin A. Lemkul, Ph.D. Research Scientist Department of Biochemistry Virginia Tech Blacksburg, VA jalemkul[at]vt.edu | (540) 231-9080 http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Re: Segmentation fault, mdrun_mpi
On 10/8/12 4:39 AM, Ladasky wrote: Justin Lemkul wrote My first guess would be a buggy MPI implementation. I can't comment on hardware specs, but usually the random failures seen in mdrun_mpi are a result of some generic MPI failure. What MPI are you using? I am using the OpenMPI package, version 1.4.3. It's one of three MPI implementations which are included in the standard repositories of Ubuntu Linux 11.10. I can also obtain MPICH2 and gromacs-mpich without jumping through too many hoops. It looks like LAM is also available. However, if GROMACS needs a special package to interface with LAM, it's not in the repositories. This all seems reasonable. I asked about the MPI implementation because people have previously reported that using LAM (which is really outdated) causes random seg faults and errors. I would not necessarily implicate OpenMPI, as I use it routinely. I never use repositories (I always compile from source) as I have gotten buggy packages in the past, but I don't know if that's relevant here or not. I'm not trying to implicate the package maintainer in any way, just noting that long ago (5-6 years) the Gromacs package had some issues. -Justin Alternately, I could drop using the external MPI for now and just use the new multi-threaded GROMACS defaults. I was trying to prepare for longer runs on a cluster, however. If those runs are going to crash, I had better know about it now. -- View this message in context: http://gromacs.5086.n6.nabble.com/Segmentation-fault-mdrun-mpi-tp5001601p5001776.html Sent from the GROMACS Users Forum mailing list archive at Nabble.com. -- Justin A. Lemkul, Ph.D. Research Scientist Department of Biochemistry Virginia Tech Blacksburg, VA jalemkul[at]vt.edu | (540) 231-9080 http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Re: Segmentation fault, mdrun_mpi
On 10/7/12 2:15 PM, Ladasky wrote: Justin Lemkul wrote Random segmentation faults are really hard to debug. Can you resume the run using a checkpoint file? That would suggest maybe an MPI problem or something else external to Gromacs. Without a reproducible system and a debugging backtrace, it's going to be hard to figure out where the problem is coming from. Thanks for that tip, Justin. I tried to resume one run which failed at 1.06 million cycles, and it WORKED. It proceeded all the way to the 2.50 million cycles that I designated. I now have two separate .trr files, but I suppose they can be merged. I don't know whether my crashes are random yet. I will try re-running that simulation again from time zero, to see whether it segfaults at the same place. If it doesn't, then I have a problem which may have nothing to do with GROMACS. I looked in on memory usage several times while mdrun_mpi was executing. Over all, about 3 GB of my computer's 8 GB of RAM were in use. As I expected, GROMACS used very little of this. The mpirun process used a constant 708K. I had five mdrun_mpi processes, all of which used slightly more RAM as they worked, but I didn't notice anything which suggested a gross memory leak. The process which used the most RAM was using 14.4 MB right after it started, rose to 15.9 MB within the first ten minutes or so, and reached 16.0 MB after four hours. The process which used the least RAM started at 10.6 MB and finished at 10.8 MB. All together, GROMACS was using about 64 MB. I have a well-cooled CPU, core temperatures are under 50 degrees when the system is running under full load. My system doesn't lock up or crash on me. I think that my hardware is good. My first guess would be a buggy MPI implementation. I can't comment on hardware specs, but usually the random failures seen in mdrun_mpi are a result of some generic MPI failure. What MPI are you using? -Justin -- Justin A. Lemkul, Ph.D. Research Scientist Department of Biochemistry Virginia Tech Blacksburg, VA jalemkul[at]vt.edu | (540) 231-9080 http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] Re: Segmentation fault, mdrun_mpi
On 10/5/12 3:03 PM, Ladasky wrote: Bumping this once before the weekend, hoping to get some help. I am getting segmentation fault errors at 1 to 2 million cycles into my production MD runs, using GROMACS 4.5.4. If these errors are a consequence of a poorly-equilibrated system, I am no longer getting the right kind of error messages to support that conclusion. I am not getting PME or SETTLE errors. I am getting a non-descriptive segmentation fault. I have corrected earlier shortcomings in my equilibration protocols, as discussed in this earlier thread: http://gromacs.5086.n6.nabble.com/Re-Water-molecules-cannot-be-settled-why-tp4999302.html I am now monitoring the macroscopic properties of my simulation. Potential, pressure, density, and temperature convergence and subsequent stability are achieved, at least as well as demonstrated in Justin Lemkul's most recent tutorial: http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin/gmx-tutorials/lysozyme/index.html The trajectories of my simulations do not appear to be radical in any way that I can discern. I have a partially-unfolded protein, folding gradually, in a box of water with counter-ions. From my previous thread, I have come to appreciate just how far from equilibrium the initial state of a simulation can be. Also, I have always understood that MD simulations are chaotic, and that instabilities can result simply from the fact that a continuous system is being modeled in discrete time steps. (As an aside, one of my first programming puzzles was about exactly this kind of thing. When I was a high-school student, I wanted to simulate the orbits of the Moon about the Earth, and the Earth about the Sun. It sounded simple enough, just apply the inverse-square law for gravity, right? Yet no matter how I tried, I couldn't achieve a stable system. Deeper reading led me to the intuitive and quick "leapfrog" method of improving differential approximations, which GROMACS apparently uses: and the more powerful but slower Runge-Kutta methods, which GROMACS apparently does not use.) By correcting my earlier problems, I have extended the time that my simulations will run by a factor of 10-20 fold, out to several nanoseconds. That's progress, but I'm never going to get to one microsecond this way. Any advice is appreciated. Of course I can post MDP files again, as well as graphs. Random segmentation faults are really hard to debug. Can you resume the run using a checkpoint file? That would suggest maybe an MPI problem or something else external to Gromacs. Without a reproducible system and a debugging backtrace, it's going to be hard to figure out where the problem is coming from. -Justin -- Justin A. Lemkul, Ph.D. Research Scientist Department of Biochemistry Virginia Tech Blacksburg, VA jalemkul[at]vt.edu | (540) 231-9080 http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists