Re: [Users] Stampede

2014-07-28 Thread Ian Hinder

On 14 Jul 2014, at 22:14, Yosef Zlochower yo...@astro.rit.edu wrote:

 I tried a run on stampede today and it died during checkpoint with the
 error
  send desc error
 send desc error
 [0] Abort: Got completion with error 12, vendor code=81, dest rank=
 at line 892 in file ../../ofa_poll.c
 
 Have you been having success running production runs on stampede?

I have seen errors when several runs checkpoint at the same time, as can happen 
if many jobs start simultaneously and dump a checkpoint after 3 hours. 
According to TACC support, there was nothing unusual in the system logs.  I 
thought it would be useful to add a random delay to the checkpoint code.  For 
example, in addition to telling it to checkpoint every 3 hours, you could say 
checkpoint every 3 hours, plus a random number between -20 and +20 minutes.

The error message above suggests something to do with communication (send 
desc).  Checkpointing itself shouldn't do any MPI communication, should it?  
Does it perform consistency checks across processes, or otherwise do 
communication?  I also saw freezes during scalar reduction output (see quoted 
text below).  Maybe some of the processes are taking much longer to checkpoint 
than others, and the ones which finish time out while trying to communicate?  
Maybe adding a barrier after checkpointing would make this clearer?



 
 
 On 05/02/2014 12:15 PM, Ian Hinder wrote:
 
 On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:
 
 On 05/02/2014 10:07 AM, Ian Hinder wrote:
 
 On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:
 
 Hi
 
 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.
 
 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?
 
 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).
 
 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)
 
 I was using an old executable so it wouldn't have had the backport
 fix.
 
 
 Even with
 Orsted , stalling is a real issue. Currently, my solution is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.
 
 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.
 
 I'm starting to think it might be a code problem as well.  So the
 summary is:
 
 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours
 
 Since Stampede is the major capability resource in Xsede, we should
 put some effort into making sure the ET can run properly there.
 
 We find issues with runs stalling on our local cluster too. The hardware
 setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
 top of a proprietary IB library). There's no guarantee that the issues
 are the same, but we can try to run some tests locally (note that we
 have no issues with runs failing to checkpoint).
 
 I resubmitted, and the new job hangs later on.  gdb says it is in
 CarpetIOScalar while doing output of a maximum reduction.  I've disabled
 this and resubmitted.
 
 --
 Ian Hinder
 http://numrel.aei.mpg.de/people/hinder
 
 
 
 -- 
 Dr. Yosef Zlochower
 Center for Computational Relativity and Gravitation
 Associate Professor
 School of Mathematical Sciences
 Rochester Institute of Technology
 85 Lomb Memorial Drive
 Rochester, NY 14623
 
 Office:74-2067
 Phone: +1 585-475-6103
 
 yo...@astro.rit.edu
 
 CONFIDENTIALITY NOTE: The information transmitted, including
 attachments, is intended only for the person(s) or entity to which it
 is addressed and may contain confidential and/or privileged material.
 Any review, retransmission, dissemination or other use of, or taking
 of any action in reliance upon this information by persons or entities
 other than the intended recipient is prohibited. If you received this
 in error, please contact the sender and destroy any copies of this
 information.

-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

___
Users mailing list

Re: [Users] Stampede

2014-07-28 Thread Yosef Zlochower
One thing I'm not sure of is when the send desc error is generated. It 
could be generated when the queue kills the job.
This was the only job I was running at the time. It uses 16 nodes, with 
32 MPI processes in total. I don't think it should have been able to
  overloaded the filesystem,


On 07/28/2014 06:14 AM, Ian Hinder wrote:

 On 14 Jul 2014, at 22:14, Yosef Zlochower yo...@astro.rit.edu wrote:

 I tried a run on stampede today and it died during checkpoint with the
 error
  send desc error
 send desc error
 [0] Abort: Got completion with error 12, vendor code=81, dest rank=
 at line 892 in file ../../ofa_poll.c

 Have you been having success running production runs on stampede?

 I have seen errors when several runs checkpoint at the same time, as can 
 happen if many jobs start simultaneously and dump a checkpoint after 3 hours. 
 According to TACC support, there was nothing unusual in the system logs.  I 
 thought it would be useful to add a random delay to the checkpoint code.  
 For example, in addition to telling it to checkpoint every 3 hours, you could 
 say checkpoint every 3 hours, plus a random number between -20 and +20 
 minutes.

 The error message above suggests something to do with communication (send 
 desc).  Checkpointing itself shouldn't do any MPI communication, should it?  
 Does it perform consistency checks across processes, or otherwise do 
 communication?  I also saw freezes during scalar reduction output (see quoted 
 text below).  Maybe some of the processes are taking much longer to 
 checkpoint than others, and the ones which finish time out while trying to 
 communicate?  Maybe adding a barrier after checkpointing would make this 
 clearer?





 On 05/02/2014 12:15 PM, Ian Hinder wrote:

 On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:

 On 05/02/2014 10:07 AM, Ian Hinder wrote:

 On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:

 Hi

 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.

 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?

 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).

 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)

 I was using an old executable so it wouldn't have had the backport
 fix.


 Even with
 Orsted , stalling is a real issue. Currently, my solution is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.

 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.

 I'm starting to think it might be a code problem as well.  So the
 summary is:

 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours

 Since Stampede is the major capability resource in Xsede, we should
 put some effort into making sure the ET can run properly there.

 We find issues with runs stalling on our local cluster too. The hardware
 setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
 top of a proprietary IB library). There's no guarantee that the issues
 are the same, but we can try to run some tests locally (note that we
 have no issues with runs failing to checkpoint).

 I resubmitted, and the new job hangs later on.  gdb says it is in
 CarpetIOScalar while doing output of a maximum reduction.  I've disabled
 this and resubmitted.

 --
 Ian Hinder
 http://numrel.aei.mpg.de/people/hinder



 --
 Dr. Yosef Zlochower
 Center for Computational Relativity and Gravitation
 Associate Professor
 School of Mathematical Sciences
 Rochester Institute of Technology
 85 Lomb Memorial Drive
 Rochester, NY 14623

 Office:74-2067
 Phone: +1 585-475-6103

 yo...@astro.rit.edu

 CONFIDENTIALITY NOTE: The information transmitted, including
 attachments, is intended only for the person(s) or entity to which it
 is addressed and may contain confidential and/or privileged material.
 Any review, retransmission, dissemination or other use of, or taking
 of any action in reliance upon 

Re: [Users] Stampede

2014-07-16 Thread Yosef Zlochower
I also tried using Roland's openmpi library.
The errors were different, but the run died during
checkpoint as before.


On 07/14/2014 04:14 PM, Yosef Zlochower wrote:
 I tried a run on stampede today and it died during checkpoint with the
 error
  send desc error
 send desc error
 [0] Abort: Got completion with error 12, vendor code=81, dest rank=
at line 892 in file ../../ofa_poll.c

 Have you been having success running production runs on stampede?


 On 05/02/2014 12:15 PM, Ian Hinder wrote:

 On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:

 On 05/02/2014 10:07 AM, Ian Hinder wrote:

 On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:

 Hi

 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.

 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?

 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).

 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)

 I was using an old executable so it wouldn't have had the backport
 fix.


 Even with
 Orsted , stalling is a real issue. Currently, my solution is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.

 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.

 I'm starting to think it might be a code problem as well.  So the
 summary is:

 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours

 Since Stampede is the major capability resource in Xsede, we should
 put some effort into making sure the ET can run properly there.

 We find issues with runs stalling on our local cluster too. The hardware
 setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
 top of a proprietary IB library). There's no guarantee that the issues
 are the same, but we can try to run some tests locally (note that we
 have no issues with runs failing to checkpoint).

 I resubmitted, and the new job hangs later on.  gdb says it is in
 CarpetIOScalar while doing output of a maximum reduction.  I've disabled
 this and resubmitted.

 --
 Ian Hinder
 http://numrel.aei.mpg.de/people/hinder





-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yo...@astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Yosef Zlochower

Hi

I have been having problems running on Stampede for a long time. I 
couldn't get the latest
stable ET to run because during checkpointing, it would die. I had to 
backtrack to
the Orsted version (unfortunately, that has a bug in the way the grid is 
set up, causing some of the
intermediate levels to span both black holes, wasting a lot of memory). 
Even with
Orsted , stalling is a real issue. Currently, my solution is to run 
for 4 hours at a time.

This would have been  OK on Lonestar or Ranger,
 because when I chained a bunch a runs, the next in line would start
almost right away, but on stampede the delay is quite substantial. I 
believe Jim Healy opened

a ticket concerning the RIT issues with running ET on stampede.


On 05/02/2014 05:55 AM, Ian Hinder wrote:

Hi all,

Has anyone run into problems recently with Cactus jobs on Stampede? 
 I've had jobs die when checkpointing, and also mysteriously hanging 
for no apparent reason.  These might be separate problems.  The 
checkpointing issue occurred when I submitted several jobs and they 
all started checkpointing at the same time after 3 hours.  The hang 
happened after a few hours of evolution, with GDB reporting



MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13)
  at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296
296for (; i  mv2_MPIDI_CH3I_RDMA_Process.polling_group_size;
  ++i)


Unfortunately I didn't ask for a backtrace. I'm using mvapich2.  I've 
been in touch with support and they said the dying while checkpointing 
coincided with the filesystems being hit hard by my jobs, which makes 
sense, but they didn't see any problems in their logs, and they have 
no idea about the mysterious hang.  I repeated the hanging job and it 
ran fine.


--
Ian Hinder
http://numrel.aei.mpg.de/people/hinder



___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Ian Hinder

On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu wrote:

 Hi
 
 I have been having problems running on Stampede for a long time. I couldn't 
 get the latest
 stable ET to run because during checkpointing, it would die.

OK that's very interesting.  Has something changed in the code related to how 
checkpoint files are written?

 I had to backtrack to 
 the Orsted version (unfortunately, that has a bug in the way the grid is set 
 up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).

That bug should have been fixed in a backport; are you sure you are checking 
out the branch and not the tag?  In any case, it can be worked around by 
setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am 
thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html)

 Even with
 Orsted , stalling is a real issue. Currently, my solution is to run for 4 
 hours at a time.
 This would have been  OK on Lonestar or Ranger,
  because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I believe 
 Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.

I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547.  I 
will add my information there.  The current queue wait time on stampede is more 
than a day, so splitting into 3 hour chunks is not feasible, as you say.

I'm starting to think it might be a code problem as well.  So the summary is:

– Checkpointing causes jobs to die with code versions after Oersted
– All versions lead to eventual hung jobs after a few hours

Since Stampede is the major capability resource in Xsede, we should put some 
effort into making sure the ET can run properly there.
-- 
Ian Hinder
http://numrel.aei.mpg.de/people/hinder

___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Frank Loeffler
On Fri, May 02, 2014 at 04:07:59PM +0200, Ian Hinder wrote:
 OK that's very interesting.  Has something changed in the code related to how 
 checkpoint files are written?

There have been some changes I believe - but I would probably look
first at the machine configuration (option list) - if something changed
there. If it did this should be easy to test, much faster than a code
change.

Frank



signature.asc
Description: Digital signature
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Yosef Zlochower
On 05/02/2014 10:07 AM, Ian Hinder wrote:

 On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu
 mailto:yo...@astro.rit.edu wrote:

 Hi

 I have been having problems running on Stampede for a long time. I
 couldn't get the latest
 stable ET to run because during checkpointing, it would die.

 OK that's very interesting.  Has something changed in the code related
 to how checkpoint files are written?

 I had to backtrack to
 the Orsted version (unfortunately, that has a bug in the way the grid
 is set up, causing some of the
 intermediate levels to span both black holes, wasting a lot of memory).

 That bug should have been fixed in a backport; are you sure you are
 checking out the branch and not the tag?  In any case, it can be worked
 around by setting CarpetRegrid2::min_fraction = 1, assuming this is the
 same bug I am thinking of
 (http://cactuscode.org/pipermail/users/2013-January/003290.html)

I was using an old executable so it wouldn't have had the backport
fix.


 Even with
 Orsted , stalling is a real issue. Currently, my solution is to run
 for 4 hours at a time.
 This would have been  OK on Lonestar or Ranger,
  because when I chained a bunch a runs, the next in line would start
 almost right away, but on stampede the delay is quite substantial. I
 believe Jim Healy opened
 a ticket concerning the RIT issues with running ET on stampede.

 I think this is the ticket:
 https://trac.einsteintoolkit.org/ticket/1547.  I will add my information
 there.  The current queue wait time on stampede is more than a day, so
 splitting into 3 hour chunks is not feasible, as you say.

 I'm starting to think it might be a code problem as well.  So the
 summary is:

 – Checkpointing causes jobs to die with code versions after Oersted
 – All versions lead to eventual hung jobs after a few hours

 Since Stampede is the major capability resource in Xsede, we should
 put some effort into making sure the ET can run properly there.

We find issues with runs stalling on our local cluster too. The hardware
setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on
top of a proprietary IB library). There's no guarantee that the issues
are the same, but we can try to run some tests locally (note that we
have no issues with runs failing to checkpoint).

 --
 Ian Hinder
 http://numrel.aei.mpg.de/people/hinder



-- 
Dr. Yosef Zlochower
Center for Computational Relativity and Gravitation
Associate Professor
School of Mathematical Sciences
Rochester Institute of Technology
85 Lomb Memorial Drive
Rochester, NY 14623

Office:74-2067
Phone: +1 585-475-6103

yo...@astro.rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including
attachments, is intended only for the person(s) or entity to which it
is addressed and may contain confidential and/or privileged material.
Any review, retransmission, dissemination or other use of, or taking
of any action in reliance upon this information by persons or entities
other than the intended recipient is prohibited. If you received this
in error, please contact the sender and destroy any copies of this
information.
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Roland Haas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello Ian, Yosef, all,

 I resubmitted, and the new job hangs later on.  gdb says it is in 
 CarpetIOScalar while doing output of a maximum reduction.  I've 
 disabled this and resubmitted.
If you are feeling desperate (we were for SpEC) then you can also
compile a copy of OpenMPI and use that. If you want to give it a try,
there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides
its own ibrun replacement in
/work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do
something like

/work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32
/work/00945/rhaas/software/packages/mpihello/a.out

Yours,
Roland

- -- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://keys.gnupg.net.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB
tKcAoLynhEE3K8+1/6U3KqZYK35Euhff
=MI0p
-END PGP SIGNATURE-
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] Stampede

2014-05-02 Thread Philipp Moesta
Hi Ian, everyone,

I had those hanging jobs at random points quite iften last year oct-dec and 
their support couldn't really help. They eventually went away, but stampede is 
not the most reliable machine it seems when it comes to these random errors.

Philipp

 On May 2, 2014, at 9:51, Roland Haas roland.h...@physics.gatech.edu wrote:
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hello Ian, Yosef, all,
 
 I resubmitted, and the new job hangs later on.  gdb says it is in 
 CarpetIOScalar while doing output of a maximum reduction.  I've 
 disabled this and resubmitted.
 If you are feeling desperate (we were for SpEC) then you can also
 compile a copy of OpenMPI and use that. If you want to give it a try,
 there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides
 its own ibrun replacement in
 /work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do
 something like
 
 /work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32
 /work/00945/rhaas/software/packages/mpihello/a.out
 
 Yours,
 Roland
 
 - -- 
 My email is as private as my paper mail. I therefore support encrypting
 and signing email messages. Get my PGP key from http://keys.gnupg.net.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 Comment: Using GnuPG with Icedove - http://www.enigmail.net/
 
 iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB
 tKcAoLynhEE3K8+1/6U3KqZYK35Euhff
 =MI0p
 -END PGP SIGNATURE-
 ___
 Users mailing list
 Users@einsteintoolkit.org
 http://lists.einsteintoolkit.org/mailman/listinfo/users
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users