On 28 Jul 2014, at 15:29, Yosef Zlochower <yo...@astro.rit.edu> wrote:

> One thing I'm not sure of is when the "send desc" error is generated. It 
> could be generated when the queue kills the job.
> This was the only job I was running at the time. It uses 16 nodes, with 32 
> MPI processes in total. I don't think it should have been able to
> overloaded the filesystem,

By the way, I have been using the "mvapich2x" version on Stampede after a 
suggestion from their support.  

< MPI_DIR  = /opt/apps/intel13/mvapich2/1.9
---
> MPI_DIR  = /home1/apps/intel13/mvapich2-x/2.0b
> MPI_LIB_DIRS = /home1/apps/intel13/mvapich2-x/2.0b/lib64

It seems to work well.  Maybe you could try that and see if things improve?

> 
> 
> On 07/28/2014 06:14 AM, Ian Hinder wrote:
>> 
>> On 14 Jul 2014, at 22:14, Yosef Zlochower <yo...@astro.rit.edu> wrote:
>> 
>>> I tried a run on stampede today and it died during checkpoint with the
>>> error
>>> " send desc error
>>> send desc error
>>> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
>>> at line 892 in file ../../ofa_poll.c"
>>> 
>>> Have you been having success running production runs on stampede?
>> 
>> I have seen errors when several runs checkpoint at the same time, as can 
>> happen if many jobs start simultaneously and dump a checkpoint after 3 
>> hours. According to TACC support, there was nothing unusual in the system 
>> logs.  I thought it would be useful to add a "random" delay to the 
>> checkpoint code.  For example, in addition to telling it to checkpoint every 
>> 3 hours, you could say "checkpoint every 3 hours, plus a random number 
>> between -20 and +20 minutes".
>> 
>> The error message above suggests something to do with communication ("send 
>> desc").  Checkpointing itself shouldn't do any MPI communication, should it? 
>>  Does it perform consistency checks across processes, or otherwise do 
>> communication?  I also saw freezes during scalar reduction output (see 
>> quoted text below).  Maybe some of the processes are taking much longer to 
>> checkpoint than others, and the ones which finish time out while trying to 
>> communicate?  Maybe adding a barrier after checkpointing would make this 
>> clearer?
>> 
>> 
>> 
>>> 
>>> 
>>> On 05/02/2014 12:15 PM, Ian Hinder wrote:
>>>> 
>>>> On 02 May 2014, at 16:57, Yosef Zlochower <yo...@astro.rit.edu
>>>> <mailto:yo...@astro.rit.edu>> wrote:
>>>> 
>>>>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>>>>> 
>>>>>> On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu
>>>>>> <mailto:yo...@astro.rit.edu>
>>>>>> <mailto:yo...@astro.rit.edu>> wrote:
>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> I have been having problems running on Stampede for a long time. I
>>>>>>> couldn't get the latest
>>>>>>> stable ET to run because during checkpointing, it would die.
>>>>>> 
>>>>>> OK that's very interesting.  Has something changed in the code related
>>>>>> to how checkpoint files are written?
>>>>>> 
>>>>>>> I had to backtrack to
>>>>>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>>>>>> is set up, causing some of the
>>>>>>> intermediate levels to span both black holes, wasting a lot of memory).
>>>>>> 
>>>>>> That bug should have been fixed in a backport; are you sure you are
>>>>>> checking out the branch and not the tag?  In any case, it can be worked
>>>>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>>>>>> same bug I am thinking of
>>>>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>>>> 
>>>>> I was using an old executable so it wouldn't have had the backport
>>>>> fix.
>>>>> 
>>>>>> 
>>>>>>> Even with
>>>>>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>>>>>> for 4 hours at a time.
>>>>>>> This would have been  OK on Lonestar or Ranger,
>>>>>>> because when I chained a bunch a runs, the next in line would start
>>>>>>> almost right away, but on stampede the delay is quite substantial. I
>>>>>>> believe Jim Healy opened
>>>>>>> a ticket concerning the RIT issues with running ET on stampede.
>>>>>> 
>>>>>> I think this is the ticket:
>>>>>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>>>>>> there.  The current queue wait time on stampede is more than a day, so
>>>>>> splitting into 3 hour chunks is not feasible, as you say.
>>>>>> 
>>>>>> I'm starting to think it might be a code problem as well.  So the
>>>>>> summary is:
>>>>>> 
>>>>>> – Checkpointing causes jobs to die with code versions after Oersted
>>>>>> – All versions lead to eventual hung jobs after a few hours
>>>>>> 
>>>>>> Since Stampede is the major "capability" resource in Xsede, we should
>>>>>> put some effort into making sure the ET can run properly there.
>>>>> 
>>>>> We find issues with runs stalling on our local cluster too. The hardware
>>>>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>>>>> top of a proprietary IB library). There's no guarantee that the issues
>>>>> are the same, but we can try to run some tests locally (note that we
>>>>> have no issues with runs failing to checkpoint).
>>>> 
>>>> I resubmitted, and the new job hangs later on.  gdb says it is in
>>>> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
>>>> this and resubmitted.
>>>> 
>>>> --
>>>> Ian Hinder
>>>> http://numrel.aei.mpg.de/people/hinder
>>>> 
>>> 
>>> 
>>> --
>>> Dr. Yosef Zlochower
>>> Center for Computational Relativity and Gravitation
>>> Associate Professor
>>> School of Mathematical Sciences
>>> Rochester Institute of Technology
>>> 85 Lomb Memorial Drive
>>> Rochester, NY 14623
>>> 
>>> Office:74-2067
>>> Phone: +1 585-475-6103
>>> 
>>> yo...@astro.rit.edu
>>> 
>>> CONFIDENTIALITY NOTE: The information transmitted, including
>>> attachments, is intended only for the person(s) or entity to which it
>>> is addressed and may contain confidential and/or privileged material.
>>> Any review, retransmission, dissemination or other use of, or taking
>>> of any action in reliance upon this information by persons or entities
>>> other than the intended recipient is prohibited. If you received this
>>> in error, please contact the sender and destroy any copies of this
>>> information.
>> 
> 
> 
> -- 
> Dr. Yosef Zlochower
> Center for Computational Relativity and Gravitation
> Associate Professor
> School of Mathematical Sciences
> Rochester Institute of Technology
> 85 Lomb Memorial Drive
> Rochester, NY 14623
> 
> Office:74-2067
> Phone: +1 585-475-6103
> 
> yo...@astro.rit.edu
> 
> CONFIDENTIALITY NOTE: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it
> is addressed and may contain confidential and/or privileged material.
> Any review, retransmission, dissemination or other use of, or taking
> of any action in reliance upon this information by persons or entities
> other than the intended recipient is prohibited. If you received this
> in error, please contact the sender and destroy any copies of this
> information.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to