Re: [Users] Stampede

2014-08-08 Thread Yosef Zlochower

On 08/06/2014 09:00 AM, Ian Hinder wrote:
>
> On 28 Jul 2014, at 15:29, Yosef Zlochower  wrote:
>
>> One thing I'm not sure of is when the "send desc" error is generated. It 
>> could be generated when the queue kills the job.
>> This was the only job I was running at the time. It uses 16 nodes, with 32 
>> MPI processes in total. I don't think it should have been able to
>> overloaded the filesystem,
>
> By the way, I have been using the "mvapich2x" version on Stampede after a 
> suggestion from their support.

Thanks. I tried MVAPICH2 and the run died during the final checkpoint.
I asked TACC to check out why my runs died. The only thing they say
was that towards the end, one the node that died, 27 GB of ram
were used. This is strange because I would expect that no more than  15 
GB of ran used. Perhaps there is a slow memory leak.
A 4 hour run seemed to work fine.




>
> < MPI_DIR  = /opt/apps/intel13/mvapich2/1.9
> ---
>> MPI_DIR  = /home1/apps/intel13/mvapich2-x/2.0b
>> MPI_LIB_DIRS = /home1/apps/intel13/mvapich2-x/2.0b/lib64
>
> It seems to work well.  Maybe you could try that and see if things improve?
>
>>
>>
>> On 07/28/2014 06:14 AM, Ian Hinder wrote:
>>>
>>> On 14 Jul 2014, at 22:14, Yosef Zlochower  wrote:
>>>
 I tried a run on stampede today and it died during checkpoint with the
 error
 " send desc error
 send desc error
 [0] Abort: Got completion with error 12, vendor code=81, dest rank=
 at line 892 in file ../../ofa_poll.c"

 Have you been having success running production runs on stampede?
>>>
>>> I have seen errors when several runs checkpoint at the same time, as can 
>>> happen if many jobs start simultaneously and dump a checkpoint after 3 
>>> hours. According to TACC support, there was nothing unusual in the system 
>>> logs.  I thought it would be useful to add a "random" delay to the 
>>> checkpoint code.  For example, in addition to telling it to checkpoint 
>>> every 3 hours, you could say "checkpoint every 3 hours, plus a random 
>>> number between -20 and +20 minutes".
>>>
>>> The error message above suggests something to do with communication ("send 
>>> desc").  Checkpointing itself shouldn't do any MPI communication, should 
>>> it?  Does it perform consistency checks across processes, or otherwise do 
>>> communication?  I also saw freezes during scalar reduction output (see 
>>> quoted text below).  Maybe some of the processes are taking much longer to 
>>> checkpoint than others, and the ones which finish time out while trying to 
>>> communicate?  Maybe adding a barrier after checkpointing would make this 
>>> clearer?
>>>
>>>
>>>


 On 05/02/2014 12:15 PM, Ian Hinder wrote:
>
> On 02 May 2014, at 16:57, Yosef Zlochower  > wrote:
>
>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>>
>>> On 02 May 2014, at 14:08, Yosef Zlochower >> 
>>> > wrote:
>>>
 Hi

 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.
>>>
>>> OK that's very interesting.  Has something changed in the code related
>>> to how checkpoint files are written?
>>>
 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).
>>>
>>> That bug should have been fixed in a backport; are you sure you are
>>> checking out the branch and not the tag?  In any case, it can be worked
>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>>> same bug I am thinking of
>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>
>> I was using an old executable so it wouldn't have had the backport
>> fix.
>>
>>>
 Even with
 Orsted , stalling is a real issue. Currently, my "solution" is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.
>>>
>>> I think this is the ticket:
>>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>>> there.  The current queue wait time on stampede is more than a day, so
>>> splitting into 3 hour chunks is not feasible, as you say.
>>>
>>> I'm starting to think it might be a code problem as well.  So the
>>> summary is:
>>>
>>> – Checkpointing causes jobs to die with code versions after Oersted
>>> – All versions lead to eventual hung jobs after a few

Re: [Users] Stampede

2014-08-06 Thread Ian Hinder

On 28 Jul 2014, at 15:29, Yosef Zlochower  wrote:

> One thing I'm not sure of is when the "send desc" error is generated. It 
> could be generated when the queue kills the job.
> This was the only job I was running at the time. It uses 16 nodes, with 32 
> MPI processes in total. I don't think it should have been able to
> overloaded the filesystem,

By the way, I have been using the "mvapich2x" version on Stampede after a 
suggestion from their support.  

< MPI_DIR  = /opt/apps/intel13/mvapich2/1.9
---
> MPI_DIR  = /home1/apps/intel13/mvapich2-x/2.0b
> MPI_LIB_DIRS = /home1/apps/intel13/mvapich2-x/2.0b/lib64

It seems to work well.  Maybe you could try that and see if things improve?

> 
> 
> On 07/28/2014 06:14 AM, Ian Hinder wrote:
>> 
>> On 14 Jul 2014, at 22:14, Yosef Zlochower  wrote:
>> 
>>> I tried a run on stampede today and it died during checkpoint with the
>>> error
>>> " send desc error
>>> send desc error
>>> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
>>> at line 892 in file ../../ofa_poll.c"
>>> 
>>> Have you been having success running production runs on stampede?
>> 
>> I have seen errors when several runs checkpoint at the same time, as can 
>> happen if many jobs start simultaneously and dump a checkpoint after 3 
>> hours. According to TACC support, there was nothing unusual in the system 
>> logs.  I thought it would be useful to add a "random" delay to the 
>> checkpoint code.  For example, in addition to telling it to checkpoint every 
>> 3 hours, you could say "checkpoint every 3 hours, plus a random number 
>> between -20 and +20 minutes".
>> 
>> The error message above suggests something to do with communication ("send 
>> desc").  Checkpointing itself shouldn't do any MPI communication, should it? 
>>  Does it perform consistency checks across processes, or otherwise do 
>> communication?  I also saw freezes during scalar reduction output (see 
>> quoted text below).  Maybe some of the processes are taking much longer to 
>> checkpoint than others, and the ones which finish time out while trying to 
>> communicate?  Maybe adding a barrier after checkpointing would make this 
>> clearer?
>> 
>> 
>> 
>>> 
>>> 
>>> On 05/02/2014 12:15 PM, Ian Hinder wrote:
 
 On 02 May 2014, at 16:57, Yosef Zlochower >>> > wrote:
 
> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>> 
>> On 02 May 2014, at 14:08, Yosef Zlochower > 
>> > wrote:
>> 
>>> Hi
>>> 
>>> I have been having problems running on Stampede for a long time. I
>>> couldn't get the latest
>>> stable ET to run because during checkpointing, it would die.
>> 
>> OK that's very interesting.  Has something changed in the code related
>> to how checkpoint files are written?
>> 
>>> I had to backtrack to
>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>> is set up, causing some of the
>>> intermediate levels to span both black holes, wasting a lot of memory).
>> 
>> That bug should have been fixed in a backport; are you sure you are
>> checking out the branch and not the tag?  In any case, it can be worked
>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>> same bug I am thinking of
>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
> 
> I was using an old executable so it wouldn't have had the backport
> fix.
> 
>> 
>>> Even with
>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>> for 4 hours at a time.
>>> This would have been  OK on Lonestar or Ranger,
>>> because when I chained a bunch a runs, the next in line would start
>>> almost right away, but on stampede the delay is quite substantial. I
>>> believe Jim Healy opened
>>> a ticket concerning the RIT issues with running ET on stampede.
>> 
>> I think this is the ticket:
>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>> there.  The current queue wait time on stampede is more than a day, so
>> splitting into 3 hour chunks is not feasible, as you say.
>> 
>> I'm starting to think it might be a code problem as well.  So the
>> summary is:
>> 
>> – Checkpointing causes jobs to die with code versions after Oersted
>> – All versions lead to eventual hung jobs after a few hours
>> 
>> Since Stampede is the major "capability" resource in Xsede, we should
>> put some effort into making sure the ET can run properly there.
> 
> We find issues with runs stalling on our local cluster too. The hardware
> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
> top of a proprietary IB library). There's no guarantee that the issues
> are the same, but we can try to run some tests locally (note that we
> have no issue

Re: [Users] Stampede

2014-07-28 Thread Yosef Zlochower
One thing I'm not sure of is when the "send desc" error is generated. It 
could be generated when the queue kills the job.
This was the only job I was running at the time. It uses 16 nodes, with 
32 MPI processes in total. I don't think it should have been able to
  overloaded the filesystem,


On 07/28/2014 06:14 AM, Ian Hinder wrote:
>
> On 14 Jul 2014, at 22:14, Yosef Zlochower  wrote:
>
>> I tried a run on stampede today and it died during checkpoint with the
>> error
>> " send desc error
>> send desc error
>> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
>> at line 892 in file ../../ofa_poll.c"
>>
>> Have you been having success running production runs on stampede?
>
> I have seen errors when several runs checkpoint at the same time, as can 
> happen if many jobs start simultaneously and dump a checkpoint after 3 hours. 
> According to TACC support, there was nothing unusual in the system logs.  I 
> thought it would be useful to add a "random" delay to the checkpoint code.  
> For example, in addition to telling it to checkpoint every 3 hours, you could 
> say "checkpoint every 3 hours, plus a random number between -20 and +20 
> minutes".
>
> The error message above suggests something to do with communication ("send 
> desc").  Checkpointing itself shouldn't do any MPI communication, should it?  
> Does it perform consistency checks across processes, or otherwise do 
> communication?  I also saw freezes during scalar reduction output (see quoted 
> text below).  Maybe some of the processes are taking much longer to 
> checkpoint than others, and the ones which finish time out while trying to 
> communicate?  Maybe adding a barrier after checkpointing would make this 
> clearer?
>
>
>
>>
>>
>> On 05/02/2014 12:15 PM, Ian Hinder wrote:
>>>
>>> On 02 May 2014, at 16:57, Yosef Zlochower >> > wrote:
>>>
 On 05/02/2014 10:07 AM, Ian Hinder wrote:
>
> On 02 May 2014, at 14:08, Yosef Zlochower  
> > wrote:
>
>> Hi
>>
>> I have been having problems running on Stampede for a long time. I
>> couldn't get the latest
>> stable ET to run because during checkpointing, it would die.
>
> OK that's very interesting.  Has something changed in the code related
> to how checkpoint files are written?
>
>> I had to backtrack to
>> the Orsted version (unfortunately, that has a bug in the way the grid
>> is set up, causing some of the
>> intermediate levels to span both black holes, wasting a lot of memory).
>
> That bug should have been fixed in a backport; are you sure you are
> checking out the branch and not the tag?  In any case, it can be worked
> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
> same bug I am thinking of
> (http://cactuscode.org/pipermail/users/2013-January/003290.html)

 I was using an old executable so it wouldn't have had the backport
 fix.

>
>> Even with
>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>> for 4 hours at a time.
>> This would have been  OK on Lonestar or Ranger,
>> because when I chained a bunch a runs, the next in line would start
>> almost right away, but on stampede the delay is quite substantial. I
>> believe Jim Healy opened
>> a ticket concerning the RIT issues with running ET on stampede.
>
> I think this is the ticket:
> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
> there.  The current queue wait time on stampede is more than a day, so
> splitting into 3 hour chunks is not feasible, as you say.
>
> I'm starting to think it might be a code problem as well.  So the
> summary is:
>
> – Checkpointing causes jobs to die with code versions after Oersted
> – All versions lead to eventual hung jobs after a few hours
>
> Since Stampede is the major "capability" resource in Xsede, we should
> put some effort into making sure the ET can run properly there.

 We find issues with runs stalling on our local cluster too. The hardware
 setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
 top of a proprietary IB library). There's no guarantee that the issues
 are the same, but we can try to run some tests locally (note that we
 have no issues with runs failing to checkpoint).
>>>
>>> I resubmitted, and the new job hangs later on.  gdb says it is in
>>> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
>>> this and resubmitted.
>>>
>>> --
>>> Ian Hinder
>>> http://numrel.aei.mpg.de/people/hinder
>>>
>>
>>
>> --
>> Dr. Yosef Zlochower
>> Center for Computational Relativity and Gravitation
>> Associate Professor
>> School of Mathematical Sciences
>> Rochester Institute of Technology
>> 85 Lomb Memorial Drive
>> Rochester, NY 14623
>>
>> Office

Re: [Users] Stampede

2014-07-28 Thread Ian Hinder

On 14 Jul 2014, at 22:14, Yosef Zlochower  wrote:

> I tried a run on stampede today and it died during checkpoint with the
> error
> " send desc error
> send desc error
> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
> at line 892 in file ../../ofa_poll.c"
> 
> Have you been having success running production runs on stampede?

I have seen errors when several runs checkpoint at the same time, as can happen 
if many jobs start simultaneously and dump a checkpoint after 3 hours. 
According to TACC support, there was nothing unusual in the system logs.  I 
thought it would be useful to add a "random" delay to the checkpoint code.  For 
example, in addition to telling it to checkpoint every 3 hours, you could say 
"checkpoint every 3 hours, plus a random number between -20 and +20 minutes".

The error message above suggests something to do with communication ("send 
desc").  Checkpointing itself shouldn't do any MPI communication, should it?  
Does it perform consistency checks across processes, or otherwise do 
communication?  I also saw freezes during scalar reduction output (see quoted 
text below).  Maybe some of the processes are taking much longer to checkpoint 
than others, and the ones which finish time out while trying to communicate?  
Maybe adding a barrier after checkpointing would make this clearer?



> 
> 
> On 05/02/2014 12:15 PM, Ian Hinder wrote:
>> 
>> On 02 May 2014, at 16:57, Yosef Zlochower > > wrote:
>> 
>>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
 
 On 02 May 2014, at 14:08, Yosef Zlochower >>> 
 > wrote:
 
> Hi
> 
> I have been having problems running on Stampede for a long time. I
> couldn't get the latest
> stable ET to run because during checkpointing, it would die.
 
 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?
 
> I had to backtrack to
> the Orsted version (unfortunately, that has a bug in the way the grid
> is set up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).
 
 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>> 
>>> I was using an old executable so it wouldn't have had the backport
>>> fix.
>>> 
 
> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run
> for 4 hours at a time.
> This would have been  OK on Lonestar or Ranger,
> because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I
> believe Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.
 
 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.
 
 I'm starting to think it might be a code problem as well.  So the
 summary is:
 
 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours
 
 Since Stampede is the major "capability" resource in Xsede, we should
 put some effort into making sure the ET can run properly there.
>>> 
>>> We find issues with runs stalling on our local cluster too. The hardware
>>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>>> top of a proprietary IB library). There's no guarantee that the issues
>>> are the same, but we can try to run some tests locally (note that we
>>> have no issues with runs failing to checkpoint).
>> 
>> I resubmitted, and the new job hangs later on.  gdb says it is in
>> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
>> this and resubmitted.
>> 
>> --
>> Ian Hinder
>> http://numrel.aei.mpg.de/people/hinder
>> 
> 
> 
> -- 
> Dr. Yosef Zlochower
> Center for Computational Relativity and Gravitation
> Associate Professor
> School of Mathematical Sciences
> Rochester Institute of Technology
> 85 Lomb Memorial Drive
> Rochester, NY 14623
> 
> Office:74-2067
> Phone: +1 585-475-6103
> 
> yo...@astro.rit.edu
> 
> CONFIDENTIALITY NOTE: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it
> is addressed and may contain confidential and/or privileged material.
> Any review, retransmission, dissemination or other use of, or taking
> of any action in reliance upon this information by persons or entities
> other than the intended recipien

Re: [Users] Stampede

2014-07-16 Thread Yosef Zlochower
I also tried using Roland's openmpi library.
The errors were different, but the run died during
checkpoint as before.


On 07/14/2014 04:14 PM, Yosef Zlochower wrote:
> I tried a run on stampede today and it died during checkpoint with the
> error
> " send desc error
> send desc error
> [0] Abort: Got completion with error 12, vendor code=81, dest rank=
>at line 892 in file ../../ofa_poll.c"
>
> Have you been having success running production runs on stampede?
>
>
> On 05/02/2014 12:15 PM, Ian Hinder wrote:
>>
>> On 02 May 2014, at 16:57, Yosef Zlochower > > wrote:
>>
>>> On 05/02/2014 10:07 AM, Ian Hinder wrote:

 On 02 May 2014, at 14:08, Yosef Zlochower >>> 
 > wrote:

> Hi
>
> I have been having problems running on Stampede for a long time. I
> couldn't get the latest
> stable ET to run because during checkpointing, it would die.

 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?

> I had to backtrack to
> the Orsted version (unfortunately, that has a bug in the way the grid
> is set up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).

 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>>
>>> I was using an old executable so it wouldn't have had the backport
>>> fix.
>>>

> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run
> for 4 hours at a time.
> This would have been  OK on Lonestar or Ranger,
> because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I
> believe Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.

 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.

 I'm starting to think it might be a code problem as well.  So the
 summary is:

 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours

 Since Stampede is the major "capability" resource in Xsede, we should
 put some effort into making sure the ET can run properly there.
>>>
>>> We find issues with runs stalling on our local cluster too. The hardware
>>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>>> top of a proprietary IB library). There's no guarantee that the issues
>>> are the same, but we can try to run some tests locally (note that we
>>> have no issues with runs failing to checkpoint).
>>
>> I resubmitted, and the new job hangs later on.  gdb says it is in
>> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
>> this and resubmitted.
>>
>> --
>> Ian Hinder
>> http://numrel.aei.mpg.de/people/hinder
>>
>
>


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yo...@astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-07-14 Thread Yosef Zlochower
I tried a run on stampede today and it died during checkpoint with the
error
" send desc error
send desc error
[0] Abort: Got completion with error 12, vendor code=81, dest rank=
  at line 892 in file ../../ofa_poll.c"

Have you been having success running production runs on stampede?


On 05/02/2014 12:15 PM, Ian Hinder wrote:
>
> On 02 May 2014, at 16:57, Yosef Zlochower  > wrote:
>
>> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>>>
>>> On 02 May 2014, at 14:08, Yosef Zlochower >> 
>>> > wrote:
>>>
 Hi

 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.
>>>
>>> OK that's very interesting.  Has something changed in the code related
>>> to how checkpoint files are written?
>>>
 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).
>>>
>>> That bug should have been fixed in a backport; are you sure you are
>>> checking out the branch and not the tag?  In any case, it can be worked
>>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>>> same bug I am thinking of
>>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
>>
>> I was using an old executable so it wouldn't have had the backport
>> fix.
>>
>>>
 Even with
 Orsted , stalling is a real issue. Currently, my "solution" is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.
>>>
>>> I think this is the ticket:
>>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>>> there.  The current queue wait time on stampede is more than a day, so
>>> splitting into 3 hour chunks is not feasible, as you say.
>>>
>>> I'm starting to think it might be a code problem as well.  So the
>>> summary is:
>>>
>>> – Checkpointing causes jobs to die with code versions after Oersted
>>> – All versions lead to eventual hung jobs after a few hours
>>>
>>> Since Stampede is the major "capability" resource in Xsede, we should
>>> put some effort into making sure the ET can run properly there.
>>
>> We find issues with runs stalling on our local cluster too. The hardware
>> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
>> top of a proprietary IB library). There's no guarantee that the issues
>> are the same, but we can try to run some tests locally (note that we
>> have no issues with runs failing to checkpoint).
>
> I resubmitted, and the new job hangs later on.  gdb says it is in
> CarpetIOScalar while doing output of a maximum reduction.  I've disabled
> this and resubmitted.
>
> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yo...@astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Philipp Moesta
Hi Ian, everyone,

I had those hanging jobs at random points quite iften last year oct-dec and 
their support couldn't really help. They eventually went away, but stampede is 
not the most reliable machine it seems when it comes to these random errors.

Philipp

> On May 2, 2014, at 9:51, Roland Haas  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hello Ian, Yosef, all,
> 
>> I resubmitted, and the new job hangs later on.  gdb says it is in 
>> CarpetIOScalar while doing output of a maximum reduction.  I've 
>> disabled this and resubmitted.
> If you are feeling desperate (we were for SpEC) then you can also
> compile a copy of OpenMPI and use that. If you want to give it a try,
> there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides
> its own ibrun replacement in
> /work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do
> something like
> 
> /work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32
> /work/00945/rhaas/software/packages/mpihello/a.out
> 
> Yours,
> Roland
> 
> - -- 
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://keys.gnupg.net.
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Icedove - http://www.enigmail.net/
> 
> iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB
> tKcAoLynhEE3K8+1/6U3KqZYK35Euhff
> =MI0p
> -END PGP SIGNATURE-
> ___
> Users mailing list
> Users@einsteintoolkit.org
> http://lists.einsteintoolkit.org/mailman/listinfo/users
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Roland Haas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Ian, Yosef, all,

> I resubmitted, and the new job hangs later on.  gdb says it is in 
> CarpetIOScalar while doing output of a maximum reduction.  I've 
> disabled this and resubmitted.
If you are feeling desperate (we were for SpEC) then you can also
compile a copy of OpenMPI and use that. If you want to give it a try,
there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides
its own ibrun replacement in
/work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do
something like

/work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32
/work/00945/rhaas/software/packages/mpihello/a.out

Yours,
Roland

- -- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB
tKcAoLynhEE3K8+1/6U3KqZYK35Euhff
=MI0p
-END PGP SIGNATURE-
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Ian Hinder

On 02 May 2014, at 16:57, Yosef Zlochower  wrote:

> On 05/02/2014 10:07 AM, Ian Hinder wrote:
>> 
>> On 02 May 2014, at 14:08, Yosef Zlochower > > wrote:
>> 
>>> Hi
>>> 
>>> I have been having problems running on Stampede for a long time. I
>>> couldn't get the latest
>>> stable ET to run because during checkpointing, it would die.
>> 
>> OK that's very interesting.  Has something changed in the code related
>> to how checkpoint files are written?
>> 
>>> I had to backtrack to
>>> the Orsted version (unfortunately, that has a bug in the way the grid
>>> is set up, causing some of the
>>> intermediate levels to span both black holes, wasting a lot of memory).
>> 
>> That bug should have been fixed in a backport; are you sure you are
>> checking out the branch and not the tag?  In any case, it can be worked
>> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
>> same bug I am thinking of
>> (http://cactuscode.org/pipermail/users/2013-January/003290.html)
> 
> I was using an old executable so it wouldn't have had the backport
> fix.
> 
>> 
>>> Even with
>>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>>> for 4 hours at a time.
>>> This would have been  OK on Lonestar or Ranger,
>>> because when I chained a bunch a runs, the next in line would start
>>> almost right away, but on stampede the delay is quite substantial. I
>>> believe Jim Healy opened
>>> a ticket concerning the RIT issues with running ET on stampede.
>> 
>> I think this is the ticket:
>> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
>> there.  The current queue wait time on stampede is more than a day, so
>> splitting into 3 hour chunks is not feasible, as you say.
>> 
>> I'm starting to think it might be a code problem as well.  So the
>> summary is:
>> 
>> – Checkpointing causes jobs to die with code versions after Oersted
>> – All versions lead to eventual hung jobs after a few hours
>> 
>> Since Stampede is the major "capability" resource in Xsede, we should
>> put some effort into making sure the ET can run properly there.
> 
> We find issues with runs stalling on our local cluster too. The hardware
> setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
> top of a proprietary IB library). There's no guarantee that the issues
> are the same, but we can try to run some tests locally (note that we
> have no issues with runs failing to checkpoint).

I resubmitted, and the new job hangs later on.  gdb says it is in 
CarpetIOScalar while doing output of a maximum reduction.  I've disabled this 
and resubmitted.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Yosef Zlochower
On 05/02/2014 10:07 AM, Ian Hinder wrote:
>
> On 02 May 2014, at 14:08, Yosef Zlochower  > wrote:
>
>> Hi
>>
>> I have been having problems running on Stampede for a long time. I
>> couldn't get the latest
>> stable ET to run because during checkpointing, it would die.
>
> OK that's very interesting.  Has something changed in the code related
> to how checkpoint files are written?
>
>> I had to backtrack to
>> the Orsted version (unfortunately, that has a bug in the way the grid
>> is set up, causing some of the
>> intermediate levels to span both black holes, wasting a lot of memory).
>
> That bug should have been fixed in a backport; are you sure you are
> checking out the branch and not the tag?  In any case, it can be worked
> around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
> same bug I am thinking of
> (http://cactuscode.org/pipermail/users/2013-January/003290.html)

I was using an old executable so it wouldn't have had the backport
fix.

>
>> Even with
>> Orsted , stalling is a real issue. Currently, my "solution" is to run
>> for 4 hours at a time.
>> This would have been  OK on Lonestar or Ranger,
>>  because when I chained a bunch a runs, the next in line would start
>> almost right away, but on stampede the delay is quite substantial. I
>> believe Jim Healy opened
>> a ticket concerning the RIT issues with running ET on stampede.
>
> I think this is the ticket:
> https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
> there.  The current queue wait time on stampede is more than a day, so
> splitting into 3 hour chunks is not feasible, as you say.
>
> I'm starting to think it might be a code problem as well.  So the
> summary is:
>
> – Checkpointing causes jobs to die with code versions after Oersted
> – All versions lead to eventual hung jobs after a few hours
>
> Since Stampede is the major "capability" resource in Xsede, we should
> put some effort into making sure the ET can run properly there.

We find issues with runs stalling on our local cluster too. The hardware
setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
top of a proprietary IB library). There's no guarantee that the issues
are the same, but we can try to run some tests locally (note that we
have no issues with runs failing to checkpoint).

> --
> Ian Hinder
> http://numrel.aei.mpg.de/people/hinder
>


-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yo...@astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Frank Loeffler
On Fri, May 02, 2014 at 04:07:59PM +0200, Ian Hinder wrote:
> OK that's very interesting.  Has something changed in the code related to how 
> checkpoint files are written?

There have been some changes I believe - but I would probably look
first at the machine configuration (option list) - if something changed
there. If it did this should be easy to test, much faster than a code
change.

Frank



signature.asc
Description: Digital signature
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Ian Hinder

On 02 May 2014, at 14:08, Yosef Zlochower  wrote:

> Hi
> 
> I have been having problems running on Stampede for a long time. I couldn't 
> get the latest
> stable ET to run because during checkpointing, it would die.

OK that's very interesting.  Has something changed in the code related to how 
checkpoint files are written?

> I had to backtrack to 
> the Orsted version (unfortunately, that has a bug in the way the grid is set 
> up, causing some of the
> intermediate levels to span both black holes, wasting a lot of memory).

That bug should have been fixed in a backport; are you sure you are checking 
out the branch and not the tag?  In any case, it can be worked around by 
setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am 
thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html)

> Even with
> Orsted , stalling is a real issue. Currently, my "solution" is to run for 4 
> hours at a time.
> This would have been  OK on Lonestar or Ranger,
>  because when I chained a bunch a runs, the next in line would start
> almost right away, but on stampede the delay is quite substantial. I believe 
> Jim Healy opened
> a ticket concerning the RIT issues with running ET on stampede.

I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547.  I 
will add my information there.  The current queue wait time on stampede is more 
than a day, so splitting into 3 hour chunks is not feasible, as you say.

I'm starting to think it might be a code problem as well.  So the summary is:

– Checkpointing causes jobs to die with code versions after Oersted
– All versions lead to eventual hung jobs after a few hours

Since Stampede is the major "capability" resource in Xsede, we should put some 
effort into making sure the ET can run properly there.
-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Yosef Zlochower

Hi

I have been having problems running on Stampede for a long time. I 
couldn't get the latest
stable ET to run because during checkpointing, it would die. I had to 
backtrack to
the Orsted version (unfortunately, that has a bug in the way the grid is 
set up, causing some of the
intermediate levels to span both black holes, wasting a lot of memory). 
Even with
Orsted , stalling is a real issue. Currently, my "solution" is to run 
for 4 hours at a time.

This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I 
believe Jim Healy opened

a ticket concerning the RIT issues with running ET on stampede.


On 05/02/2014 05:55 AM, Ian Hinder wrote:

Hi all,

Has anyone run into problems recently with Cactus jobs on Stampede? 
 I've had jobs die when checkpointing, and also mysteriously hanging 
for no apparent reason.  These might be separate problems.  The 
checkpointing issue occurred when I submitted several jobs and they 
all started checkpointing at the same time after 3 hours.  The hang 
happened after a few hours of evolution, with GDB reporting



MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
  at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
296for (; i < mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
  ++i)


Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've 
been in touch with support and they said the dying while checkpointing 
coincided with the filesystems being hit hard by my jobs, which makes 
sense, but they didn't see any problems in their logs, and they have 
no idea about the mysterious hang.  I repeated the hanging job and it 
ran fine.


--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder



___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users