On 28 Jul 2014, at 15:29, Yosef Zlochower <yo...@astro.rit.edu> wrote:
> One thing I'm not sure of is when the "send desc" error is generated. It > could be generated when the queue kills the job. > This was the only job I was running at the time. It uses 16 nodes, with 32 > MPI processes in total. I don't think it should have been able to > overloaded the filesystem, By the way, I have been using the "mvapich2x" version on Stampede after a suggestion from their support. < MPI_DIR = /opt/apps/intel13/mvapich2/1.9 --- > MPI_DIR = /home1/apps/intel13/mvapich2-x/2.0b > MPI_LIB_DIRS = /home1/apps/intel13/mvapich2-x/2.0b/lib64 It seems to work well. Maybe you could try that and see if things improve? > > > On 07/28/2014 06:14 AM, Ian Hinder wrote: >> >> On 14 Jul 2014, at 22:14, Yosef Zlochower <yo...@astro.rit.edu> wrote: >> >>> I tried a run on stampede today and it died during checkpoint with the >>> error >>> " send desc error >>> send desc error >>> [0] Abort: Got completion with error 12, vendor code=81, dest rank= >>> at line 892 in file ../../ofa_poll.c" >>> >>> Have you been having success running production runs on stampede? >> >> I have seen errors when several runs checkpoint at the same time, as can >> happen if many jobs start simultaneously and dump a checkpoint after 3 >> hours. According to TACC support, there was nothing unusual in the system >> logs. I thought it would be useful to add a "random" delay to the >> checkpoint code. For example, in addition to telling it to checkpoint every >> 3 hours, you could say "checkpoint every 3 hours, plus a random number >> between -20 and +20 minutes". >> >> The error message above suggests something to do with communication ("send >> desc"). Checkpointing itself shouldn't do any MPI communication, should it? >> Does it perform consistency checks across processes, or otherwise do >> communication? I also saw freezes during scalar reduction output (see >> quoted text below). Maybe some of the processes are taking much longer to >> checkpoint than others, and the ones which finish time out while trying to >> communicate? Maybe adding a barrier after checkpointing would make this >> clearer? >> >> >> >>> >>> >>> On 05/02/2014 12:15 PM, Ian Hinder wrote: >>>> >>>> On 02 May 2014, at 16:57, Yosef Zlochower <yo...@astro.rit.edu >>>> <mailto:yo...@astro.rit.edu>> wrote: >>>> >>>>> On 05/02/2014 10:07 AM, Ian Hinder wrote: >>>>>> >>>>>> On 02 May 2014, at 14:08, Yosef Zlochower <yo...@astro.rit.edu >>>>>> <mailto:yo...@astro.rit.edu> >>>>>> <mailto:yo...@astro.rit.edu>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I have been having problems running on Stampede for a long time. I >>>>>>> couldn't get the latest >>>>>>> stable ET to run because during checkpointing, it would die. >>>>>> >>>>>> OK that's very interesting. Has something changed in the code related >>>>>> to how checkpoint files are written? >>>>>> >>>>>>> I had to backtrack to >>>>>>> the Orsted version (unfortunately, that has a bug in the way the grid >>>>>>> is set up, causing some of the >>>>>>> intermediate levels to span both black holes, wasting a lot of memory). >>>>>> >>>>>> That bug should have been fixed in a backport; are you sure you are >>>>>> checking out the branch and not the tag? In any case, it can be worked >>>>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the >>>>>> same bug I am thinking of >>>>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html) >>>>> >>>>> I was using an old executable so it wouldn't have had the backport >>>>> fix. >>>>> >>>>>> >>>>>>> Even with >>>>>>> Orsted , stalling is a real issue. Currently, my "solution" is to run >>>>>>> for 4 hours at a time. >>>>>>> This would have been OK on Lonestar or Ranger, >>>>>>> because when I chained a bunch a runs, the next in line would start >>>>>>> almost right away, but on stampede the delay is quite substantial. I >>>>>>> believe Jim Healy opened >>>>>>> a ticket concerning the RIT issues with running ET on stampede. >>>>>> >>>>>> I think this is the ticket: >>>>>> https://trac.einsteintoolkit.org/ticket/1547. I will add my information >>>>>> there. The current queue wait time on stampede is more than a day, so >>>>>> splitting into 3 hour chunks is not feasible, as you say. >>>>>> >>>>>> I'm starting to think it might be a code problem as well. So the >>>>>> summary is: >>>>>> >>>>>> – Checkpointing causes jobs to die with code versions after Oersted >>>>>> – All versions lead to eventual hung jobs after a few hours >>>>>> >>>>>> Since Stampede is the major "capability" resource in Xsede, we should >>>>>> put some effort into making sure the ET can run properly there. >>>>> >>>>> We find issues with runs stalling on our local cluster too. The hardware >>>>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on >>>>> top of a proprietary IB library). There's no guarantee that the issues >>>>> are the same, but we can try to run some tests locally (note that we >>>>> have no issues with runs failing to checkpoint). >>>> >>>> I resubmitted, and the new job hangs later on. gdb says it is in >>>> CarpetIOScalar while doing output of a maximum reduction. I've disabled >>>> this and resubmitted. >>>> >>>> -- >>>> Ian Hinder >>>> http://numrel.aei.mpg.de/people/hinder >>>> >>> >>> >>> -- >>> Dr. Yosef Zlochower >>> Center for Computational Relativity and Gravitation >>> Associate Professor >>> School of Mathematical Sciences >>> Rochester Institute of Technology >>> 85 Lomb Memorial Drive >>> Rochester, NY 14623 >>> >>> Office:74-2067 >>> Phone: +1 585-475-6103 >>> >>> yo...@astro.rit.edu >>> >>> CONFIDENTIALITY NOTE: The information transmitted, including >>> attachments, is intended only for the person(s) or entity to which it >>> is addressed and may contain confidential and/or privileged material. >>> Any review, retransmission, dissemination or other use of, or taking >>> of any action in reliance upon this information by persons or entities >>> other than the intended recipient is prohibited. If you received this >>> in error, please contact the sender and destroy any copies of this >>> information. >> > > > -- > Dr. Yosef Zlochower > Center for Computational Relativity and Gravitation > Associate Professor > School of Mathematical Sciences > Rochester Institute of Technology > 85 Lomb Memorial Drive > Rochester, NY 14623 > > Office:74-2067 > Phone: +1 585-475-6103 > > yo...@astro.rit.edu > > CONFIDENTIALITY NOTE: The information transmitted, including > attachments, is intended only for the person(s) or entity to which it > is addressed and may contain confidential and/or privileged material. > Any review, retransmission, dissemination or other use of, or taking > of any action in reliance upon this information by persons or entities > other than the intended recipient is prohibited. If you received this > in error, please contact the sender and destroy any copies of this > information. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder _______________________________________________ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users