On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu> wrote:

> Hi
> 
> I have been having problems running on Stampede for a long time. I couldn't 
> get the latest
> stable ET to run because during checkpointing, it would die.

OK that's very interesting.  Has something changed in the code related to how 
checkpoint files are written?

> I had to backtrack to 
> the Orsted version (unfortunately, that has a bug in the way the grid is set 
> up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).

That bug should have been fixed in a backport; are you sure you are checking 
out the branch and not the tag?  In any case, it can be worked around by 
setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am 
thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html)

> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 
> hours at a time.
> This would have been  OK on Lonestar or Ranger,
>  because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I believe 
> Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.

I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547.  I 
will add my information there.  The current queue wait time on stampede is more 
than a day, so splitting into 3 hour chunks is not feasible, as you say.

I'm starting to think it might be a code problem as well.  So the summary is:

        – Checkpointing causes jobs to die with code versions after Oersted
        – All versions lead to eventual hung jobs after a few hours

Since Stampede is the major "capability" resource in Xsede, we should put some 
effort into making sure the ET can run properly there.
-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to