On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu> wrote:
> Hi > > I have been having problems running on Stampede for a long time. I couldn't > get the latest > stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? > I had to backtrack to > the Orsted version (unfortunately, that has a bug in the way the grid is set > up, causing some of the > intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) > Even with > Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 > hours at a time. > This would have been OK on Lonestar or Ranger, > because when I chained a bunch a runs, the next in line would start > almost right away, but on stampede the delay is quite substantial. I believe > Jim Healy opened > a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major "capability" resource in Xsede, we should put some effort into making sure the ET can run properly there. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder
_______________________________________________ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users