[slurm-dev] Re: SPANK plug-in, slurm_spank_job_prolog
Hello, We are running 2.6 ( but I’m assuming newer versions keep this behaviour) and prolog script is not invoked after salloc, srun afterwards should do that, but if your users will want to use native MPI launchers (mpiexec, etc) you’ll have to run it manually somehow. Not sure if this can be applied to your case, but we use Xeon Phi’s in our clusters and we use PrologSlurmctld directive to prepare environment. It will run the script on master node so you have to be a bit creative there, but it is invoked even after salloc. -- Nikita Burtsev On 26 Jan 2014 at 06:00:55, Filippo Spiga (spiga.fili...@gmail.com) wrote: On Jan 26, 2014, at 12:42 AM, Mark A. Grondona mgrond...@llnl.gov wrote: However, one thing to be aware of is that the prolog script is not executed on a node until at least one job step runs there. Meaning that the salloc command do not invoke prolog until I invoke srun. Correct? Thanks for the explanation! Cheers, Filippo -- Mr. Filippo SPIGA, M.Sc. http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
[slurm-dev] documentation translations
Hello, For one of our project we needed to translate parts of Slurm documentation in Russian. We would like to publish it for everyone to use. I've looked around http://slurm.schedmd.com but did not find anything related to localised versions of docs. What we have is 2.5 version but I'm guessing adapting it to 2.6 is not that time consuming. Is there a way to put it on official web-site? Also since its not 100% done something like Transifex could come in handy to collaborate on it. -- Nikita Burtsev
[slurm-dev] Re: Required node not available (down or drained)
https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes -- Nikita Burtsev On Monday, August 26, 2013 at 7:43 PM, Sivasangari Nandy wrote: And the log file is not informative tail -f /var/log/slurm-llnl/slurmd.log ... [2013-08-26T11:52:16] Slurmd shutdown completing [2013-08-26T11:52:56] slurmd version 2.3.4 started [2013-08-26T11:52:56] slurmd started on Mon 26 Aug 2013 11:52:56 +0200 [2013-08-26T11:52:56] Procs=1 Sockets=1 Cores=1 Threads=1 Memory=2012 TmpDisk=9069 Uptime=1122626 De: Sivasangari Nandy sivasangari.na...@irisa.fr (mailto:sivasangari.na...@irisa.fr) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Lundi 26 Août 2013 14:28:28 Objet: Re: [slurm-dev] Re: Required node not available (down or drained) Hi, I have checked some things, now my slurmctld and slurmd are in a single machine (using just one node) so the test is easier. For that I have modified the conf file : vi /etc/slurm-llnl/slurm.conf Slurmctld and slurmd are both running, here my ps result : root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm root 31712 31706 0 11:44 pts/100:00:00 tail -f /var/log/slurm-llnl/slurmd.log slurm31990 1 0 11:52 ?00:00:00 /usr/sbin/slurmctld root 32103 1 0 11:52 ?00:00:00 /usr/sbin/slurmd -c root 32125 30346 0 11:53 pts/000:00:00 grep slurm So i have tried srun again but got this error yet: !srun srun /omaha-beach/test.sh (http://test.sh) srun: Required node not available (down or drained) srun: job 64 queued and waiting for resources Have you got any idea of the problem ? thanks, Siva De: Nikita Burtsev nikita.burt...@gmail.com (mailto:nikita.burt...@gmail.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Jeudi 22 Août 2013 09:59:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd. Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com (mailto:nikita.burt...@gmail.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr (mailto:sivasangari.na...@irisa.fr) wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD 0:00 n bsp; 1 (Resources) sinfo gives me
[slurm-dev] Re: Required node not available (down or drained)
You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd. Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com (mailto:nikita.burt...@gmail.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr (mailto:sivasangari.na...@irisa.fr) wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de*up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: Required node not available (down or drained)
VM-667 where you have slurmctld running is your master, you don't need the agent part on it. As i understand your setup VM-[669-671] are your actual nodes, so you need to check if slurmd is running on those 3 and start it if needed. -- Nikita Burtsev On Thursday, August 22, 2013 at 12:02 PM, Sivasangari Nandy wrote: that's what i have done yesterday actually : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. De: Nikita Burtsev nikita.burt...@gmail.com (mailto:nikita.burt...@gmail.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Jeudi 22 Août 2013 09:59:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd. Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com (mailto:nikita.burt...@gmail.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr (mailto:sivasangari.na...@irisa.fr) wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de*up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: Required node not available (down or drained)
slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com) À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com) Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr (mailto:sivasangari.na...@irisa.fr) wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh (http://test.sh) root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de*up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Sivasangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: Job submit plugin to improve backfill
Hello, Why not enable this functionality by setting DefaultTime=0 in slurm.conf which would let us set this on per-partition basis, rather than through job submit plugin. (Unless i'm missing something obvious here) Also currently setting DefaultTime=0 (on 2.5.6 at least) gives following message: # srun -N2 hostname srun: error: Unable to create job step: Job/step already completing or completed I suppose it is the way it should be, but seems rather illogical to be able to set this at all. -- Nikita Burtsev Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, June 28, 2013 at 7:25 PM, Daniel M. Weeks wrote: Hi Ryan, Thanks. We had considered this approach but went in a different direction for a couple reasons: We have a good number of users that script job submissions and may blast out up to several hundred jobs. A user might not realize their jobs are getting cutoff until many of them run and it's a waste of resources. Also, we have many users that are relatively new to HPC/Slurm and work from guides or tutorials that don't explain things very well. The distinct error message at job submission rather than a related error after a failure (from the user's perspective) keeps a lot of support emails out of my inbox. Of course I'd like them to learn to use Slurm better but they usually want to focus on their own research first. - Dan On 06/28/2013 11:00 AM, Ryan Cox wrote: An alternative that we do is choose very low defaults for people: PartitionName=Default DefaultTime=30:00 #plus other options DefMemPerCPU=512 The disadvantage to this approach is that it doesn't give an obvious error message at submit time. However, it's not hard to figure out what happened when they hit the time limit or the error output says they went over their memory limit. Ryan On 06/28/2013 08:29 AM, Daniel M. Weeks wrote: At CCNI, we use backfill scheduling on all our systems. However, we have found that users typically do not specify a time limit for their job so the scheduler assumes the maximum from QoS/user limits/partition limits/etc. This really hurts backfilling since the scheduler remains ignorant of short jobs. Attached is a small patch I wrote containing a job submit plugin and a new error message. The plugin rejects a job submission when it is missing a time limit and will provide the user with a clear and distinct error. I've just re-tested and the patch applies and builds cleanly on the slurm-2.5, slurm-2.6, and master branches. Please let me know if you find this useful, run across problems, or have suggestions/improvements. Thanks. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University -- Daniel M. Weeks Systems Programmer Computational Center for Nanotechnology Innovations Rensselaer Polytechnic Institute Troy, NY 12180 518-276-4458