[slurm-dev] Re: SPANK plug-in, slurm_spank_job_prolog

2014-01-26 Thread Nikita Burtsev
Hello,

We are running 2.6 ( but I’m assuming newer versions keep this behaviour)  and 
prolog script is not invoked after salloc, srun afterwards should do that, but 
if your users will want to use native MPI launchers (mpiexec, etc) you’ll have 
to run it manually somehow. Not sure if this can be applied to your case, but 
we use Xeon Phi’s in our clusters and we use PrologSlurmctld directive to 
prepare environment. It will run the script on master node so you have to be a 
bit creative there, but it is invoked even after salloc.   

-- 
Nikita Burtsev

On 26 Jan 2014 at 06:00:55, Filippo Spiga (spiga.fili...@gmail.com) wrote:

On Jan 26, 2014, at 12:42 AM, Mark A. Grondona mgrond...@llnl.gov wrote:
However, one thing to be aware of is that the prolog script is not
executed on a node until at least one job step runs there.

Meaning that the salloc command do not invoke prolog until I invoke srun. 
Correct?

Thanks for the explanation!

Cheers,
Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert



[slurm-dev] documentation translations

2013-10-22 Thread Nikita Burtsev
Hello, 

For one of our project we needed to translate parts of Slurm documentation in 
Russian. We would like to publish it for everyone to use. I've looked around 
http://slurm.schedmd.com but did not find anything related to localised 
versions of docs. What we have is 2.5 version but I'm guessing adapting it to 
2.6 is not that time consuming. 

Is there a way to put it on official web-site? Also since its not 100% done 
something like Transifex could come in handy to collaborate on it.

-- 
Nikita Burtsev



[slurm-dev] Re: Required node not available (down or drained)

2013-08-26 Thread Nikita Burtsev
https://computing.llnl.gov/linux/slurm/troubleshoot.html#nodes  

--  
Nikita Burtsev


On Monday, August 26, 2013 at 7:43 PM, Sivasangari Nandy wrote:

 And the log file is not informative  
  
 tail -f /var/log/slurm-llnl/slurmd.log
  
 ...
 [2013-08-26T11:52:16] Slurmd shutdown completing
 [2013-08-26T11:52:56] slurmd version 2.3.4 started
 [2013-08-26T11:52:56] slurmd started on Mon 26 Aug 2013 11:52:56 +0200
 [2013-08-26T11:52:56] Procs=1 Sockets=1 Cores=1 Threads=1 Memory=2012 
 TmpDisk=9069 Uptime=1122626
  
  
  De: Sivasangari Nandy sivasangari.na...@irisa.fr 
  (mailto:sivasangari.na...@irisa.fr)
  À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
  Envoyé: Lundi 26 Août 2013 14:28:28
  Objet: Re: [slurm-dev] Re: Required node not available (down or drained)
   
  Hi,  
   
  I have checked some things, now my slurmctld and slurmd are in a single 
  machine (using just one node) so the test is easier.
  For that I have modified the conf file : vi /etc/slurm-llnl/slurm.conf
   
  Slurmctld and slurmd are both running, here my ps result :  
   
  root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm  
  root 31712 31706  0 11:44 pts/100:00:00 tail -f 
  /var/log/slurm-llnl/slurmd.log
  slurm31990 1  0 11:52 ?00:00:00 /usr/sbin/slurmctld
  root 32103 1  0 11:52 ?00:00:00 /usr/sbin/slurmd -c
  root 32125 30346  0 11:53 pts/000:00:00 grep slurm
   
  So i have tried srun again but got this error yet:  
   
  !srun
  srun /omaha-beach/test.sh (http://test.sh)
  srun: Required node not available (down or drained)
  srun: job 64 queued and waiting for resources
   
   
  Have you got any idea of the problem ?
  thanks,
   
  Siva
   
   De: Nikita Burtsev nikita.burt...@gmail.com 
   (mailto:nikita.burt...@gmail.com)
   À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
   Envoyé: Jeudi 22 Août 2013 09:59:52
   Objet: [slurm-dev] Re: Required node not available (down or drained)

   Re: [slurm-dev] Re: Required node not available (down or drained)  
   You need to have slurmd running on all nodes that will execute jobs, so 
   you should start it with init script.   

   --  
   Nikita Burtsev
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


   On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:

check if the slurmd daemon is running with the command ps -el | grep 
slurmd.
 
Nothing is happened with ps -el ...
 
root@VM-667:~# ps -el | grep slurmd
 
 De: Nikita Burtsev nikita.burt...@gmail.com 
 (mailto:nikita.burt...@gmail.com)
 À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
 Envoyé: Mercredi 21 Août 2013 18:58:52
 Objet: [slurm-dev] Re: Required node not available (down or drained)
  
 Re: [slurm-dev] Re: Required node not available (down or drained)  
 slurmctld is the management process and since your have access to 
 squeue/sinfo information it is running just fine. You need to check 
 if slurmd (which is the agent part) is running on your nodes, i.e. 
 VM-[669-671]  
  
 --  
 Nikita Burtsev
  
  
 On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote:
  
  I have tried :  
   
  /etc/init.d/slurm-llnl start
   
  [ ok ] Starting slurm central management daemon: slurmctld.
  /usr/sbin/slurmctld already running.
   
  And :  
   
  scontrol show slurmd
   
  scontrol: error: slurm_slurmd_info: Connection refused
  slurm_load_slurmd_status: Connection refused
   
   
  Hum how to proceed to repair that problem ?
   
   
   De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com)
   À: slurm-dev slurm-dev@schedmd.com 
   (mailto:slurm-dev@schedmd.com)
   Envoyé: Mercredi 21 Août 2013 15:36:53
   Objet: [slurm-dev] Re: Required node not available (down or 
   drained)

   Check your slurmd log. It doesn't appear the slurmd is running.

   Sivasangari Nandy sivasangari.na...@irisa.fr 
   (mailto:sivasangari.na...@irisa.fr) wrote:
  Hello,  
   
  I'm trying to use Slurm for the first time, and I got a 
  problem with nodes I think.
  I have this message when I used squeue :
   
  root@VM-667:~# squeue
JOBID PARTITION NAME USER  ST   TIME  NODES 
  NODELIST(REASON)
   50 SLURM-deb  test.sh (http://test.sh) root  PD
   ;   0:00  1 (ReqNodeNotAvail)
   
  or this one with an other squeue :
   
  root@VM-671:~# squeue
JOBID PARTITION NAME USER  ST   TIME  NODES 
  NODELIST(REASON)
   50 SLURM-deb  test.sh (http://test.sh) root  PD
 0:00   n bsp;  1 (Resources)
   
  sinfo gives me

[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Nikita Burtsev
You need to have slurmd running on all nodes that will execute jobs, so you 
should start it with init script.   

--  
Nikita Burtsev
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:

 check if the slurmd daemon is running with the command ps -el | grep 
 slurmd.
  
 Nothing is happened with ps -el ...
  
 root@VM-667:~# ps -el | grep slurmd
  
  De: Nikita Burtsev nikita.burt...@gmail.com 
  (mailto:nikita.burt...@gmail.com)
  À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
  Envoyé: Mercredi 21 Août 2013 18:58:52
  Objet: [slurm-dev] Re: Required node not available (down or drained)
   
  Re: [slurm-dev] Re: Required node not available (down or drained)  
  slurmctld is the management process and since your have access to 
  squeue/sinfo information it is running just fine. You need to check if 
  slurmd (which is the agent part) is running on your nodes, i.e. 
  VM-[669-671]  
   
  --  
  Nikita Burtsev
   
   
  On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote:
   
   I have tried :  

   /etc/init.d/slurm-llnl start

   [ ok ] Starting slurm central management daemon: slurmctld.
   /usr/sbin/slurmctld already running.

   And :  

   scontrol show slurmd

   scontrol: error: slurm_slurmd_info: Connection refused
   slurm_load_slurmd_status: Connection refused


   Hum how to proceed to repair that problem ?


De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com)
À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
Envoyé: Mercredi 21 Août 2013 15:36:53
Objet: [slurm-dev] Re: Required node not available (down or drained)
 
Check your slurmd log. It doesn't appear the slurmd is running.
 
Sivasangari Nandy sivasangari.na...@irisa.fr 
(mailto:sivasangari.na...@irisa.fr) wrote:
   Hello,  

   I'm trying to use Slurm for the first time, and I got a problem 
   with nodes I think.
   I have this message when I used squeue :

   root@VM-667:~# squeue
 JOBID PARTITION NAME USER  ST   TIME  NODES 
   NODELIST(REASON)
50 SLURM-deb  test.sh (http://test.sh) root  PD ;   
   0:00  1 (ReqNodeNotAvail)

   or this one with an other squeue :

   root@VM-671:~# squeue
 JOBID PARTITION NAME USER  ST   TIME  NODES 
   NODELIST(REASON)
50 SLURM-deb  test.sh (http://test.sh) root  PD   
   0:00   n bsp;  1 (Resources)

   sinfo gives me :

   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
   SLURM-de*up   infinite  3   down VM-[669-671]


   I have already used slurm one time with the same configuration 
   and I wan able to run my job.
   But now the second time I always got :  

   srun: Required node not available (down or drained)
   srun: job 51 queued and waiting for resources


   Advance thanks for your help,  
   Siva


   
   
  
  
  
 
 




   --  
   Sivasangari NANDY -  Plate-forme GenOuest
   IRISA-INRIA, Campus de Beaulieu
   263 Avenue du Général Leclerc
   35042 Rennes cedex, France
   Tél: +33 (0) 2 99 84 25 69
   Bureau :  D152

   
  
  
  
 --  
 Sivasangari NANDY -  Plate-forme GenOuest
 IRISA-INRIA, Campus de Beaulieu
 263 Avenue du Général Leclerc
 35042 Rennes cedex, France
 Tél: +33 (0) 2 99 84 25 69
 Bureau :  D152
  



[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Nikita Burtsev
VM-667 where you have slurmctld running is your master,  you don't need the 
agent part on it. As i understand your setup VM-[669-671] are your actual 
nodes, so you need to check if slurmd is running on those 3 and start it if 
needed.   

--  
Nikita Burtsev


On Thursday, August 22, 2013 at 12:02 PM, Sivasangari Nandy wrote:

 that's what i have done yesterday actually :
  
 /etc/init.d/slurm-llnl start
  
 [ ok ] Starting slurm central management daemon: slurmctld.
 /usr/sbin/slurmctld already running.
  
  De: Nikita Burtsev nikita.burt...@gmail.com 
  (mailto:nikita.burt...@gmail.com)
  À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
  Envoyé: Jeudi 22 Août 2013 09:59:52
  Objet: [slurm-dev] Re: Required node not available (down or drained)
   
  Re: [slurm-dev] Re: Required node not available (down or drained)  
  You need to have slurmd running on all nodes that will execute jobs, so you 
  should start it with init script.   
   
  --  
  Nikita Burtsev
  Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
   
   
  On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
   
   check if the slurmd daemon is running with the command ps -el | grep 
   slurmd.

   Nothing is happened with ps -el ...

   root@VM-667:~# ps -el | grep slurmd

De: Nikita Burtsev nikita.burt...@gmail.com 
(mailto:nikita.burt...@gmail.com)
À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
Envoyé: Mercredi 21 Août 2013 18:58:52
Objet: [slurm-dev] Re: Required node not available (down or drained)
 
Re: [slurm-dev] Re: Required node not available (down or drained)  
slurmctld is the management process and since your have access to 
squeue/sinfo information it is running just fine. You need to check if 
slurmd (which is the agent part) is running on your nodes, i.e. 
VM-[669-671]  
 
--  
Nikita Burtsev
 
 
On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote:
 
 I have tried :  
  
 /etc/init.d/slurm-llnl start
  
 [ ok ] Starting slurm central management daemon: slurmctld.
 /usr/sbin/slurmctld already running.
  
 And :  
  
 scontrol show slurmd
  
 scontrol: error: slurm_slurmd_info: Connection refused
 slurm_load_slurmd_status: Connection refused
  
  
 Hum how to proceed to repair that problem ?
  
  
  De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com)
  À: slurm-dev slurm-dev@schedmd.com 
  (mailto:slurm-dev@schedmd.com)
  Envoyé: Mercredi 21 Août 2013 15:36:53
  Objet: [slurm-dev] Re: Required node not available (down or drained)
   
  Check your slurmd log. It doesn't appear the slurmd is running.
   
  Sivasangari Nandy sivasangari.na...@irisa.fr 
  (mailto:sivasangari.na...@irisa.fr) wrote:
 Hello,  
  
 I'm trying to use Slurm for the first time, and I got a 
 problem with nodes I think.
 I have this message when I used squeue :
  
 root@VM-667:~# squeue
   JOBID PARTITION NAME USER  ST   TIME  NODES 
 NODELIST(REASON)
  50 SLURM-deb  test.sh (http://test.sh) root  PD 
 ;   0:00  1 (ReqNodeNotAvail)
  
 or this one with an other squeue :
  
 root@VM-671:~# squeue
   JOBID PARTITION NAME USER  ST   TIME  NODES 
 NODELIST(REASON)
  50 SLURM-deb  test.sh (http://test.sh) root  PD  
  0:00   n bsp;  1 (Resources)
  
 sinfo gives me :
  
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 SLURM-de*up   infinite  3   down VM-[669-671]
  
  
 I have already used slurm one time with the same 
 configuration and I wan able to run my job.
 But now the second time I always got :  
  
 srun: Required node not available (down or drained)
 srun: job 51 queued and waiting for resources
  
  
 Advance thanks for your help,  
 Siva
  
  
 
 



   
   
  
  
  
  
 --  
 Sivasangari NANDY -  Plate-forme GenOuest
 IRISA-INRIA, Campus de Beaulieu
 263 Avenue du Général Leclerc
 35042 Rennes cedex, France
 Tél: +33 (0) 2 99 84 25 69
 Bureau :  D152
  
 



   --  
   Sivasangari NANDY -  Plate-forme GenOuest
   IRISA-INRIA, Campus de Beaulieu
   263 Avenue du Général Leclerc
   35042 Rennes cedex, France
   Tél: +33 (0) 2 99 84 25 69
   Bureau :  D152

   
  
  
  
 --  
 Sivasangari NANDY -  Plate-forme GenOuest
 IRISA-INRIA, Campus de Beaulieu
 263 Avenue du Général Leclerc
 35042 Rennes cedex, France
 Tél: +33 (0) 2 99 84 25 69
 Bureau :  D152
  



[slurm-dev] Re: Required node not available (down or drained)

2013-08-21 Thread Nikita Burtsev
slurmctld is the management process and since your have access to squeue/sinfo 
information it is running just fine. You need to check if slurmd (which is the 
agent part) is running on your nodes, i.e. VM-[669-671]  

--  
Nikita Burtsev


On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote:

 I have tried :  
  
 /etc/init.d/slurm-llnl start
  
 [ ok ] Starting slurm central management daemon: slurmctld.
 /usr/sbin/slurmctld already running.
  
 And :  
  
 scontrol show slurmd
  
 scontrol: error: slurm_slurmd_info: Connection refused
 slurm_load_slurmd_status: Connection refused
  
  
 Hum how to proceed to repair that problem ?
  
  
  De: Danny Auble d...@schedmd.com (mailto:d...@schedmd.com)
  À: slurm-dev slurm-dev@schedmd.com (mailto:slurm-dev@schedmd.com)
  Envoyé: Mercredi 21 Août 2013 15:36:53
  Objet: [slurm-dev] Re: Required node not available (down or drained)
   
  Check your slurmd log. It doesn't appear the slurmd is running.
   
  Sivasangari Nandy sivasangari.na...@irisa.fr 
  (mailto:sivasangari.na...@irisa.fr) wrote:
 Hello,  
  
 I'm trying to use Slurm for the first time, and I got a problem with 
 nodes I think.
 I have this message when I used squeue :
  
 root@VM-667:~# squeue
   JOBID PARTITION NAME USER  ST   TIME  NODES 
 NODELIST(REASON)
  50 SLURM-deb  test.sh (http://test.sh) root  PD ;   0:00 
  1 (ReqNodeNotAvail)
  
 or this one with an other squeue :
  
 root@VM-671:~# squeue
   JOBID PARTITION NAME USER  ST   TIME  NODES 
 NODELIST(REASON)
  50 SLURM-deb  test.sh (http://test.sh) root  PD   0:00   
 n bsp;  1 (Resources)
  
 sinfo gives me :
  
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 SLURM-de*up   infinite  3   down VM-[669-671]
  
  
 I have already used slurm one time with the same configuration and I 
 wan able to run my job.
 But now the second time I always got :  
  
 srun: Required node not available (down or drained)
 srun: job 51 queued and waiting for resources
  
  
 Advance thanks for your help,  
 Siva
  
  
 
 



   
   
  
  
  
  
 --  
 Sivasangari NANDY -  Plate-forme GenOuest
 IRISA-INRIA, Campus de Beaulieu
 263 Avenue du Général Leclerc
 35042 Rennes cedex, France
 Tél: +33 (0) 2 99 84 25 69
 Bureau :  D152
  



[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Nikita Burtsev
Hello, 

Why not enable this functionality by setting DefaultTime=0 in slurm.conf which 
would let us set this on per-partition basis, rather than through job submit 
plugin. (Unless i'm missing something obvious here) 

Also currently setting DefaultTime=0 (on 2.5.6 at least) gives following 
message:
# srun -N2 hostname
srun: error: Unable to create job step: Job/step already completing or completed


I suppose it is the way it should be, but seems rather illogical to be able to 
set this at all. 

-- 
Nikita Burtsev
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Friday, June 28, 2013 at 7:25 PM, Daniel M. Weeks wrote:

 
 Hi Ryan,
 
 Thanks. We had considered this approach but went in a different
 direction for a couple reasons:
 
 We have a good number of users that script job submissions and may blast
 out up to several hundred jobs. A user might not realize their jobs are
 getting cutoff until many of them run and it's a waste of resources.
 
 Also, we have many users that are relatively new to HPC/Slurm and work
 from guides or tutorials that don't explain things very well. The
 distinct error message at job submission rather than a related error
 after a failure (from the user's perspective) keeps a lot of support
 emails out of my inbox. Of course I'd like them to learn to use Slurm
 better but they usually want to focus on their own research first.
 
 - Dan
 
 On 06/28/2013 11:00 AM, Ryan Cox wrote:
  An alternative that we do is choose very low defaults for people:
  PartitionName=Default DefaultTime=30:00 #plus other options 
  DefMemPerCPU=512
  
  The disadvantage to this approach is that it doesn't give an obvious
  error message at submit time. However, it's not hard to figure out what
  happened when they hit the time limit or the error output says they went
  over their memory limit.
  
  Ryan
  
  On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:
   At CCNI, we use backfill scheduling on all our systems. However, we have
   found that users typically do not specify a time limit for their job so
   the scheduler assumes the maximum from QoS/user limits/partition
   limits/etc. This really hurts backfilling since the scheduler remains
   ignorant of short jobs.
   
   Attached is a small patch I wrote containing a job submit plugin and a
   new error message. The plugin rejects a job submission when it is
   missing a time limit and will provide the user with a clear and distinct
   error.
   
   I've just re-tested and the patch applies and builds cleanly on the
   slurm-2.5, slurm-2.6, and master branches.
   
   Please let me know if you find this useful, run across problems, or have
   suggestions/improvements. Thanks.
   
  
  
  -- 
  Ryan Cox
  Operations Director
  Fulton Supercomputing Lab
  Brigham Young University
  
 
 
 
 -- 
 Daniel M. Weeks
 Systems Programmer
 Computational Center for Nanotechnology Innovations
 Rensselaer Polytechnic Institute
 Troy, NY 12180
 518-276-4458