On 02 May 2014, at 16:57, Yosef Zlochower <yo...@astro.rit.edu> wrote:

> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>> 
>> On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu
>> <mailto:yo...@astro.rit.edu>> wrote:
>> 
>>> Hi
>>> 
>>> I have been having problems running on Stampede for a long time. I
>>> couldn't get the latest
>>> stable ET to run because during checkpointing, it would die.
>> 
>> OK that's very interesting.  Has something changed in the code related
>> to how checkpoint files are written?
>> 
>>> I had to backtrack to
>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>> is set up, causing some of the
>>> intermediate levels to span both black holes, wasting a lot of memory).
>> 
>> That bug should have been fixed in a backport; are you sure you are
>> checking out the branch and not the tag?  In any case, it can be worked
>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>> same bug I am thinking of
>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
> 
> I was using an old executable so it wouldn't have had the backport
> fix.
> 
>> 
>>> Even with
>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>> for 4 hours at a time.
>>> This would have been  OK on Lonestar or Ranger,
>>> because when I chained a bunch a runs, the next in line would start
>>> almost right away, but on stampede the delay is quite substantial. I
>>> believe Jim Healy opened
>>> a ticket concerning the RIT issues with running ET on stampede.
>> 
>> I think this is the ticket:
>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>> there.  The current queue wait time on stampede is more than a day, so
>> splitting into 3 hour chunks is not feasible, as you say.
>> 
>> I'm starting to think it might be a code problem as well.  So the
>> summary is:
>> 
>> – Checkpointing causes jobs to die with code versions after Oersted
>> – All versions lead to eventual hung jobs after a few hours
>> 
>> Since Stampede is the major "capability" resource in Xsede, we should
>> put some effort into making sure the ET can run properly there.
> 
> We find issues with runs stalling on our local cluster too. The hardware
> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
> top of a proprietary IB library). There's no guarantee that the issues
> are the same, but we can try to run some tests locally (note that we
> have no issues with runs failing to checkpoint).

I resubmitted, and the new job hangs later on.  gdb says it is in 
CarpetIOScalar while doing output of a maximum reduction.  I've disabled this 
and resubmitted.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to