Re: [Users] Stampede
On 14 Jul 2014, at 22:14, Yosef Zlochower yo...@astro.rit.edu wrote: I tried a run on stampede today and it died during checkpoint with the error send desc error send desc error [0] Abort: Got completion with error 12, vendor code=81, dest rank= at line 892 in file ../../ofa_poll.c Have you been having success running production runs on stampede? I have seen errors when several runs checkpoint at the same time, as can happen if many jobs start simultaneously and dump a checkpoint after 3 hours. According to TACC support, there was nothing unusual in the system logs. I thought it would be useful to add a random delay to the checkpoint code. For example, in addition to telling it to checkpoint every 3 hours, you could say checkpoint every 3 hours, plus a random number between -20 and +20 minutes. The error message above suggests something to do with communication (send desc). Checkpointing itself shouldn't do any MPI communication, should it? Does it perform consistency checks across processes, or otherwise do communication? I also saw freezes during scalar reduction output (see quoted text below). Maybe some of the processes are taking much longer to checkpoint than others, and the ones which finish time out while trying to communicate? Maybe adding a barrier after checkpointing would make this clearer? On 05/02/2014 12:15 PM, Ian Hinder wrote: On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: On 05/02/2014 10:07 AM, Ian Hinder wrote: On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) I was using an old executable so it wouldn't have had the backport fix. Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major capability resource in Xsede, we should put some effort into making sure the ET can run properly there. We find issues with runs stalling on our local cluster too. The hardware setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on top of a proprietary IB library). There's no guarantee that the issues are the same, but we can try to run some tests locally (note that we have no issues with runs failing to checkpoint). I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder -- Dr. Yosef Zlochower Center for Computational Relativity and Gravitation Associate Professor School of Mathematical Sciences Rochester Institute of Technology 85 Lomb Memorial Drive Rochester, NY 14623 Office:74-2067 Phone: +1 585-475-6103 yo...@astro.rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder ___ Users mailing list
Re: [Users] Stampede
One thing I'm not sure of is when the send desc error is generated. It could be generated when the queue kills the job. This was the only job I was running at the time. It uses 16 nodes, with 32 MPI processes in total. I don't think it should have been able to overloaded the filesystem, On 07/28/2014 06:14 AM, Ian Hinder wrote: On 14 Jul 2014, at 22:14, Yosef Zlochower yo...@astro.rit.edu wrote: I tried a run on stampede today and it died during checkpoint with the error send desc error send desc error [0] Abort: Got completion with error 12, vendor code=81, dest rank= at line 892 in file ../../ofa_poll.c Have you been having success running production runs on stampede? I have seen errors when several runs checkpoint at the same time, as can happen if many jobs start simultaneously and dump a checkpoint after 3 hours. According to TACC support, there was nothing unusual in the system logs. I thought it would be useful to add a random delay to the checkpoint code. For example, in addition to telling it to checkpoint every 3 hours, you could say checkpoint every 3 hours, plus a random number between -20 and +20 minutes. The error message above suggests something to do with communication (send desc). Checkpointing itself shouldn't do any MPI communication, should it? Does it perform consistency checks across processes, or otherwise do communication? I also saw freezes during scalar reduction output (see quoted text below). Maybe some of the processes are taking much longer to checkpoint than others, and the ones which finish time out while trying to communicate? Maybe adding a barrier after checkpointing would make this clearer? On 05/02/2014 12:15 PM, Ian Hinder wrote: On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: On 05/02/2014 10:07 AM, Ian Hinder wrote: On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) I was using an old executable so it wouldn't have had the backport fix. Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major capability resource in Xsede, we should put some effort into making sure the ET can run properly there. We find issues with runs stalling on our local cluster too. The hardware setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on top of a proprietary IB library). There's no guarantee that the issues are the same, but we can try to run some tests locally (note that we have no issues with runs failing to checkpoint). I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder -- Dr. Yosef Zlochower Center for Computational Relativity and Gravitation Associate Professor School of Mathematical Sciences Rochester Institute of Technology 85 Lomb Memorial Drive Rochester, NY 14623 Office:74-2067 Phone: +1 585-475-6103 yo...@astro.rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon
Re: [Users] Stampede
I also tried using Roland's openmpi library. The errors were different, but the run died during checkpoint as before. On 07/14/2014 04:14 PM, Yosef Zlochower wrote: I tried a run on stampede today and it died during checkpoint with the error send desc error send desc error [0] Abort: Got completion with error 12, vendor code=81, dest rank= at line 892 in file ../../ofa_poll.c Have you been having success running production runs on stampede? On 05/02/2014 12:15 PM, Ian Hinder wrote: On 02 May 2014, at 16:57, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: On 05/02/2014 10:07 AM, Ian Hinder wrote: On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) I was using an old executable so it wouldn't have had the backport fix. Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major capability resource in Xsede, we should put some effort into making sure the ET can run properly there. We find issues with runs stalling on our local cluster too. The hardware setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on top of a proprietary IB library). There's no guarantee that the issues are the same, but we can try to run some tests locally (note that we have no issues with runs failing to checkpoint). I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder -- Dr. Yosef Zlochower Center for Computational Relativity and Gravitation Associate Professor School of Mathematical Sciences Rochester Institute of Technology 85 Lomb Memorial Drive Rochester, NY 14623 Office:74-2067 Phone: +1 585-475-6103 yo...@astro.rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. On 05/02/2014 05:55 AM, Ian Hinder wrote: Hi all, Has anyone run into problems recently with Cactus jobs on Stampede? I've had jobs die when checkpointing, and also mysteriously hanging for no apparent reason. These might be separate problems. The checkpointing issue occurred when I submitted several jobs and they all started checkpointing at the same time after 3 hours. The hang happened after a few hours of evolution, with GDB reporting MPIDI_CH3I_MRAILI_Get_next_vbuf (vc_ptr=0x7fff00d9a8d8, vbuf_ptr=0x13) at src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:296 296for (; i mv2_MPIDI_CH3I_RDMA_Process.polling_group_size; ++i) Unfortunately I didn't ask for a backtrace. I'm using mvapich2. I've been in touch with support and they said the dying while checkpointing coincided with the filesystems being hit hard by my jobs, which makes sense, but they didn't see any problems in their logs, and they have no idea about the mysterious hang. I repeated the hanging job and it ran fine. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu wrote: Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major capability resource in Xsede, we should put some effort into making sure the ET can run properly there. -- Ian Hinder http://numrel.aei.mpg.de/people/hinder ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
On Fri, May 02, 2014 at 04:07:59PM +0200, Ian Hinder wrote: OK that's very interesting. Has something changed in the code related to how checkpoint files are written? There have been some changes I believe - but I would probably look first at the machine configuration (option list) - if something changed there. If it did this should be easy to test, much faster than a code change. Frank signature.asc Description: Digital signature ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
On 05/02/2014 10:07 AM, Ian Hinder wrote: On 02 May 2014, at 14:08, Yosef Zlochower yo...@astro.rit.edu mailto:yo...@astro.rit.edu wrote: Hi I have been having problems running on Stampede for a long time. I couldn't get the latest stable ET to run because during checkpointing, it would die. OK that's very interesting. Has something changed in the code related to how checkpoint files are written? I had to backtrack to the Orsted version (unfortunately, that has a bug in the way the grid is set up, causing some of the intermediate levels to span both black holes, wasting a lot of memory). That bug should have been fixed in a backport; are you sure you are checking out the branch and not the tag? In any case, it can be worked around by setting CarpetRegrid2::min_fraction = 1, assuming this is the same bug I am thinking of (http://cactuscode.org/pipermail/users/2013-January/003290.html) I was using an old executable so it wouldn't have had the backport fix. Even with Orsted , stalling is a real issue. Currently, my solution is to run for 4 hours at a time. This would have been OK on Lonestar or Ranger, because when I chained a bunch a runs, the next in line would start almost right away, but on stampede the delay is quite substantial. I believe Jim Healy opened a ticket concerning the RIT issues with running ET on stampede. I think this is the ticket: https://trac.einsteintoolkit.org/ticket/1547. I will add my information there. The current queue wait time on stampede is more than a day, so splitting into 3 hour chunks is not feasible, as you say. I'm starting to think it might be a code problem as well. So the summary is: – Checkpointing causes jobs to die with code versions after Oersted – All versions lead to eventual hung jobs after a few hours Since Stampede is the major capability resource in Xsede, we should put some effort into making sure the ET can run properly there. We find issues with runs stalling on our local cluster too. The hardware setup is similar to stampede (Intel Nehalem with QDR IB and openMPI on top of a proprietary IB library). There's no guarantee that the issues are the same, but we can try to run some tests locally (note that we have no issues with runs failing to checkpoint). -- Ian Hinder http://numrel.aei.mpg.de/people/hinder -- Dr. Yosef Zlochower Center for Computational Relativity and Gravitation Associate Professor School of Mathematical Sciences Rochester Institute of Technology 85 Lomb Memorial Drive Rochester, NY 14623 Office:74-2067 Phone: +1 585-475-6103 yo...@astro.rit.edu CONFIDENTIALITY NOTE: The information transmitted, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and destroy any copies of this information. ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello Ian, Yosef, all, I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. If you are feeling desperate (we were for SpEC) then you can also compile a copy of OpenMPI and use that. If you want to give it a try, there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides its own ibrun replacement in /work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do something like /work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32 /work/00945/rhaas/software/packages/mpihello/a.out Yours, Roland - -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://keys.gnupg.net. -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Icedove - http://www.enigmail.net/ iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB tKcAoLynhEE3K8+1/6U3KqZYK35Euhff =MI0p -END PGP SIGNATURE- ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users
Re: [Users] Stampede
Hi Ian, everyone, I had those hanging jobs at random points quite iften last year oct-dec and their support couldn't really help. They eventually went away, but stampede is not the most reliable machine it seems when it comes to these random errors. Philipp On May 2, 2014, at 9:51, Roland Haas roland.h...@physics.gatech.edu wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello Ian, Yosef, all, I resubmitted, and the new job hangs later on. gdb says it is in CarpetIOScalar while doing output of a maximum reduction. I've disabled this and resubmitted. If you are feeling desperate (we were for SpEC) then you can also compile a copy of OpenMPI and use that. If you want to give it a try, there is a copy in /work/00945/rhaas/software/openmpi-1.8 which provides its own ibrun replacement in /work/00945/rhaas/software/openmpi-1.8/bin/ibrun. To test you can do something like /work/00945/rhaas/software/openmpi-1.8/bin/ibrun -n 32 /work/00945/rhaas/software/packages/mpihello/a.out Yours, Roland - -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://keys.gnupg.net. -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Icedove - http://www.enigmail.net/ iEYEARECAAYFAlNjzQQACgkQTiFSTN7SboX2XgCg0WD0JvfhAiVHHDc5jDoB8SdB tKcAoLynhEE3K8+1/6U3KqZYK35Euhff =MI0p -END PGP SIGNATURE- ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users ___ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users