[slurm-dev] slurm-dev mpich2 for job running in all nodes
I have tried with salloc directly : salloc -N 2 mpiexec /path/my_application And with sview I can see that 2 nodes are allocated simultaneously, so it is great but the log file of my output shows me that my appli ran 2 times. However I don't want to run 2 times but just use 2 nodes to run faster my unique appli... :( Does anyone has an idea ? advance thanks, Siva - Mail original - De: Moe Jette je...@schedmd.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Vendredi 20 Septembre 2013 17:27:53 Objet: [slurm-dev] Re: mpich2 for job running in all nodes Slurm use with the various MPI distributions is documented here: http://slurm.schedmd.com/mpi_guide.html#mpich2 Quoting Sivasangari Nandy sivasangari.na...@irisa.fr: Hi, Does anyone know how to use the mpich2 command pls ? I wanted to execute Bowtie2 (a mapping tool) using my three nodes : VM-669, VM-670 and VM-671 and with this command : srun mpiexec -n 3 -machinefile Mname.txt /omaha-beach/workflow/bowtie2.sh But I got this error : srun: error: Only allocated 1 nodes asked for 3 -n for the number of processor (I got one processor per node cf joined file for slurm conf file) Mname.txt is my file with nodes name VM ... bowtie2.sh is like this : #!/bin/bash bowtie2 -p 3 -x Indexed_Bowtie -q -I 0 -X 1000 -1 r1.fastq -2 r2.fastq -S bowtie2.sam Thanks for your help, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] sacct error
root@VM-667:/omaha-beach/workflow# sacct 208 SLURM accounting storage is disabled I don't understand why i got this error ... I've joined my conf file. Have you already have this problem ? slur.conf Description: Binary data
[slurm-dev] RE: can't make sacct
I got this error when i try a sacct JOBID : SLURM accounting storage is disabled here my slurmctld log file ( tail -f /var/log/slurm-llnl/slurmctld.log) : [2013-09-19T10:58:20] sched: _slurm_rpc_allocate_resources JobId=179 NodeList=VM-669 usec=70 [2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.0 VM-669 usec=197 [2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.1 VM-669 usec=187 [2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.1 usec=17 [2013-09-19T10:59:00] completing job 179 [2013-09-19T10:59:00] sched: job_complete for JobId=179 successful [2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.0 usec=8 and my conf file (/etc/slurm-llnl/slurm.conf) : # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=VM-667 ControlAddr=192.168.2.26 #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=VM-[669-671] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE State=UP - Mail original - De: Nancy Kritkausky nancy.kritkau...@bull.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 18 Septembre 2013 18:29:56 Objet: [slurm-dev] RE: can't make sacct Hello Siva, There is not a lot of information to go on from your email. What type of accounting do you have configured? What does your slurm.conf and slurmdbd.conf file look like? I would also suggest looking at your slurmdbd.log and slurmd.log to see what is going on, or sending them to the dev list. Nancy From: Sivasangari Nandy [mailto:sivasangari.na...@irisa.fr] Sent: Wednesday, September 18, 2013 9:02 AM To: slurm-dev Subject: [slurm-dev] can't make sacct Hi, Hey does anyone know why my sacct command doesn't work ? I got this : root@VM-667:/omaha-beach/workflow# sacct JobID JobName
[slurm-dev] can't make sacct
Hi, Hey does anyone know why my sacct command doesn't work ? I got this : root@VM-667:/omaha-beach/workflow# sacct JobID JobName Partition Account AllocCPUS State ExitCode -- -- -- -- -- /var/log/slurm_jobacct.log: No such file or directory advance thanks, Siva --
[slurm-dev] mpich2 to use multiple machines
Hello, I got a small problem with mpich2 for slurm. I want to run my jobs in more than one machines (here for the test I just wanted it with VM-669, so I've put in the file Mname.txt just VM-669) From the master (VM-667) I run : mpiexec -machinefile Mname.txt -np 1 /bin/spleep 60 but I have these errors : [proxy:0:0@VM-669] launch_procs (./pm/pmiserv/pmip_cb.c:687): unable to change wdir to /root/omaha-beach (No such file or directory) [proxy:0:0@VM-669] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): launch_procs returned error [proxy:0:0@VM-669] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@VM-669] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event [mpiexec@VM-667] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed [mpiexec@VM-667] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [mpiexec@VM-667] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event [mpiexec@VM-667] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion Have you got any idea ? advance thanks, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: Failed to allocate resource : Unable to contact slurm controller
hello, try this : /etc/init.d/slurm-llnl start /etc/init.d/slurm-llnl stop /etc/init.d/slurm-llnl startclean - Mail original - De: Arjun J Rao rectangle.k...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mardi 17 Septembre 2013 11:33:53 Objet: [slurm-dev] Failed to allocate resource : Unable to contact slurm controller Failed to allocate resource : Unable to contact slurm controller I want to run SLURM on on a single machine as a proof of concept to run some trivial MPI programs on my machine. I keep getting the message : Failed to allocate resources ; Unable to contact slurm controller In my slurm.conf file, I have named the ControlMachine as localhost and ControlAddr as 127.0.0.1 Compute Node name as localhost and NodeAddr as 127.0.0.1 too. What am i doing wrong ? Scientific Linux 6.4 64-bit -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: nodes often down
Whaou thanks a lot Hongjia Cao ! It's working now with CPUs=1 and Sockets=1. Yea actually physically (cat /proc/cpuinfo) there was just 1 CPU so with 2 CPUs in the slurm conf file it did not work. thanks again, bye. - Mail original - De: Hongjia Cao hj...@nudt.edu.cn À: slurm-dev slurm-dev@schedmd.com Envoyé: Vendredi 13 Septembre 2013 06:24:53 Objet: [slurm-dev] Re: nodes often down I guess that you have wrong number of CPUs per node configured. Please try changed the configuration file. Or you may try FastSchedule=1. 在 2013-09-10二的 23:54 -0700,Sivasangari Nandy写道: root@VM-667:/omaha-beach/workflow# sinfo -R REASON USER TIMESTAMP NODELIST Low CPUs slurm 2013-09-10T19:38:37 VM-[669-671] Then when i changed nodes to idle, and type sinfo -R, there is nothing. I wanted to know how to have permanently idle nodes. __ De: 曹宏嘉 hj...@nudt.edu.cn À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 11 Septembre 2013 02:26:54 Objet: [slurm-dev] Re: nodes often down You may run sinfo -R to see the reason that the node is left down. ReturnToService=1 can not recover all down nodes. -原始邮件- 发件人: Sivasangari Nandy sivasangari.na...@irisa.fr 发送时间: 2013-09-10 20:38:43 (星期二) 收件人: slurm-dev slurm-dev@schedmd.com 抄送: slurm-dev slurm-dev@schedmd.com 主题: [slurm-dev] Re: nodes often down No it was like that at first so by default. And yea I've restarted slurm but no changes, after 10 min nodes are all down. Here my conf file if needed : # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=VM-667 ControlAddr=192.168.2.26 #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram
[slurm-dev] nodes often down
Hello, My nodes are often in down states, so I must make a 'sview' and activate all nodes manually to put them 'idle' in order to run jobs. I've seen in the FAQ that I can change the slurm.conf file. The configuration parameter ReturnToService in slurm.conf controls how DOWN nodes are handled. Set its value to one in order for DOWN nodes to automatically be returned to service once the slurmd daemon registers with a valid node configuration. However in my file it's already 1 for ReturnToService. advance thanks, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: nodes often down
No it was like that at first so by default. And yea I've restarted slurm but no changes, after 10 min nodes are all down. Here my conf file if needed : # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=VM-667 ControlAddr=192.168.2.26 #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #PrologSlurmctld= #FirstJobId=1 #MaxJobId=99 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 JobCheckpointDir=/var/lib/slurm-llnl/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #Licenses=foo*4,bar #MailProg=/usr/bin/mail #MaxJobCount=5000 #MaxStepCount=4 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/pgid #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm #SrunEpilog= #SrunProlog= StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFs=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageUser= AccountingStoreJobComment=YES ClusterName=cluster #DebugFlags= #JobCompHost= #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=VM-[669-671] CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE State=UP - Mail original - De: Alan V. Cowles alan.cow...@duke.edu À: slurm-dev slurm-dev@schedmd.com Cc: Sivasangari Nandy sivasangari.na...@irisa.fr Envoyé: Mardi 10 Septembre 2013 13:17:36 Objet: Re: [slurm-dev] nodes often down Siva, Was that the default setting you had in place with your original config, or a change you made recently to combat the downed nodes problem, and did you restart slurm or do a reconfigure to re-read the slurm.conf file. I've found some changes don't take effect with a reconfigure, and you have to restart. AC On 09/10/2013 04:01 AM, Sivasangari Nandy wrote: Hello, My nodes are often in down states, so I must make a 'sview' and activate all nodes manually to put them 'idle' in order to run jobs. I've seen in the FAQ that I can change the slurm.conf file. The configuration parameter ReturnToService in slurm.conf controls how DOWN nodes are handled. Set its value to one in order for DOWN nodes to automatically be returned to service once the slurmd daemon registers with a valid node configuration. However in my file it's already 1 for ReturnToService. advance thanks, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Siva sangari NANDY - Plate
[slurm-dev] Re: Required node not available (down or drained)
And the log file is not informative tail -f /var/log/slurm-llnl/slurmd.log ... [2013-08-26T11:52:16] Slurmd shutdown completing [2013-08-26T11:52:56] slurmd version 2.3.4 started [2013-08-26T11:52:56] slurmd started on Mon 26 Aug 2013 11:52:56 +0200 [2013-08-26T11:52:56] Procs=1 Sockets=1 Cores=1 Threads=1 Memory=2012 TmpDisk=9069 Uptime=1122626 - Mail original - De: Sivasangari Nandy sivasangari.na...@irisa.fr À: slurm-dev slurm-dev@schedmd.com Envoyé: Lundi 26 Août 2013 14:28:28 Objet: Re: [slurm-dev] Re: Required node not available (down or drained) Hi, I have checked some things, now my slurmctld and slurmd are in a single machine (using just one node) so the test is easier. For that I have modified the conf file : vi /etc/slurm-llnl/slurm.conf Slurmctld and slurmd are both running, here my ps result : root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm root 31712 31706 0 11:44 pts/1 00:00:00 tail -f /var/log/slurm-llnl/slurmd.log slurm 31990 1 0 11:52 ? 00:00:00 /usr/sbin/slurmctld root 32103 1 0 11:52 ? 00:00:00 /usr/sbin/slurmd -c root 32125 30346 0 11:53 pts/0 00:00:00 grep slurm So i have tried srun again but got this error yet: !srun srun /omaha-beach/test.sh srun: Required node not available (down or drained) srun: job 64 queued and waiting for resources Have you got any idea of the problem ? thanks, Siva - Mail original - De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Jeudi 22 Août 2013 09:59:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd . Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
[slurm-dev] Re: Required node not available (down or drained)
that's what i have done yesterday actually : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. - Mail original - De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Jeudi 22 Août 2013 09:59:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd . Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de* up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Re: Required node not available (down or drained)
So i have done : /etc/init.d/slurm-llnl start in each node and tried again but i have : JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 1 (Resources) 53 SLURM-deb test.sh root PD 0:00 1 (Resources) and I have this when i try : root@VM-671:~# ps -el | grep slurmd 5 S 0 8223 1 0 80 0 - 22032 - ? 00:00:01 slurmd - Mail original - De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Jeudi 22 Août 2013 09:59:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) You need to have slurmd running on all nodes that will execute jobs, so you should start it with init script. -- Nikita Burtsev Sent with Sparrow On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote: check if the slurmd daemon is running with the command ps -el | grep slurmd . Nothing is happened with ps -el ... root@VM-667:~# ps -el | grep slurmd De: Nikita Burtsev nikita.burt...@gmail.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 18:58:52 Objet: [slurm-dev] Re: Required node not available (down or drained) Re: [slurm-dev] Re: Required node not available (down or drained) slurmctld is the management process and since your have access to squeue/sinfo information it is running just fine. You need to check if slurmd (which is the agent part) is running on your nodes, i.e. VM-[669-671] -- Nikita Burtsev On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote: I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? De: Danny Auble d...@schedmd.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de* up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152 -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152
[slurm-dev] Required node not available (down or drained)
Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de* up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva
[slurm-dev] Re: Required node not available (down or drained)
I have tried : /etc/init.d/slurm-llnl start [ ok ] Starting slurm central management daemon: slurmctld. /usr/sbin/slurmctld already running. And : scontrol show slurmd scontrol: error: slurm_slurmd_info: Connection refused slurm_load_slurmd_status: Connection refused Hum how to proceed to repair that problem ? - Mail original - De: Danny Auble d...@schedmd.com À: slurm-dev slurm-dev@schedmd.com Envoyé: Mercredi 21 Août 2013 15:36:53 Objet: [slurm-dev] Re: Required node not available (down or drained) Check your slurmd log. It doesn't appear the slurmd is running. Sivasangari Nandy sivasangari.na...@irisa.fr wrote: Hello, I'm trying to use Slurm for the first time, and I got a problem with nodes I think. I have this message when I used squeue : root@VM-667:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail) or this one with an other squeue : root@VM-671:~# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources) sinfo gives me : PARTITION AVAIL TIMELIMIT NODES STATE NODELIST SLURM-de* up infinite 3 down VM-[669-671] I have already used slurm one time with the same configuration and I wan able to run my job. But now the second time I always got : srun: Required node not available (down or drained) srun: job 51 queued and waiting for resources Advance thanks for your help, Siva -- Siva sangari NANDY - Plate-forme GenOuest IRISA-INRIA, Campus de Beaulieu 263 Avenue du Général Leclerc 35042 Rennes cedex, France Tél: +33 (0) 2 99 84 25 69 Bureau : D152