On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu> wrote:

> Hi
> I have been having problems running on Stampede for a long time. I couldn't 
> get the latest
> stable ET to run because during checkpointing, it would die.

OK that's very interesting.  Has something changed in the code related to how 
checkpoint files are written?

> I had to backtrack to 
> the Orsted version (unfortunately, that has a bug in the way the grid is set 
> up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).

That bug should have been fixed in a backport; are you sure you are checking 
out the branch and not the tag?  In any case, it can be worked around by 
setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am 
thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html)

> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 
> hours at a time.
> This would have been  OK on Lonestar or Ranger,
>  because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I believe 
> Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.

I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547.  I 
will add my information there.  The current queue wait time on stampede is more 
than a day, so splitting into 3 hour chunks is not feasible, as you say.

I'm starting to think it might be a code problem as well.  So the summary is:

        – Checkpointing causes jobs to die with code versions after Oersted
        – All versions lead to eventual hung jobs after a few hours

Since Stampede is the major "capability" resource in Xsede, we should put some 
effort into making sure the ET can run properly there.
Ian Hinder

Users mailing list

Reply via email to