On 02 May 2014, at 16:57, Yosef Zlochower <yo...@astro.rit.edu> wrote:
> On 05/02/2014 10:07 AM, Ian Hinder wrote: >> >> On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu >> <mailto:yo...@astro.rit.edu>> wrote: >> >>> Hi >>> >>> I have been having problems running on Stampede for a long time. I >>> couldn't get the latest >>> stable ET to run because during checkpointing, it would die. >> >> OK that's very interesting. Has something changed in the code related >> to how checkpoint files are written? >> >>> I had to backtrack to >>> the Orsted version (unfortunately, that has a bug in the way the grid >>> is set up, causing some of the >>> intermediate levels to span both black holes, wasting a lot of memory). >> >> That bug should have been fixed in a backport; are you sure you are >> checking out the branch and not the tag? In any case, it can be worked >> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the >> same bug I am thinking of >> (http://cactuscode.org/pipermail/users/2013-January/003290.html) > > I was using an old executable so it wouldn't have had the backport > fix. > >> >>> Even with >>> Orsted , stalling is a real issue. Currently, my "solution" is to run >>> for 4 hours at a time. >>> This would have been OK on Lonestar or Ranger, >>> because when I chained a bunch a runs, the next in line would start >>> almost right away, but on stampede the delay is quite substantial. I >>> believe Jim Healy opened >>> a ticket concerning the RIT issues with running ET on stampede. >> >> I think this is the ticket: >> https://trac.einsteintoolkit.org/ticket/1547. I will add my information >> there. The current queue wait time on stampede is more than a day, so >> splitting into 3 hour chunks is not feasible, as you say. >> >> I'm starting to think it might be a code problem as well. So the >> summary is: >> >> – Checkpointing causes jobs to die with code versions after Oersted >> – All versions lead to eventual hung jobs after a few hours >> >> Since Stampede is the major "capability" resource in Xsede, we should >> put some effort into making sure the ET can run properly there. > > We find issues with runs stalling on our local cluster too. The hardware > setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on > top of a proprietary IB library). There's no guarantee that the issues > are the same, but we can try to run some tests locally (note that we > have no issues with runs failing to checkpoint). I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder
_______________________________________________ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users