[slurm-dev] Re: salloc: Required job not available (Down or drained)

2013-10-03 Thread Sivasangari Nandy
Hello, 

no prob, i got this problem at first also and now it's working ! 
Actually you type "sview" 
then click in your nodes and change them into "idle". 
This will not be permanent , after some time nodes will be down again. 
So you must verify the number of cpu physically available byt doing that : cat 
/proc/cpuinfo 
then configure your conf (/etc/slurm-llnl/slurm.conf) in the last part : CPUs= 
and Sockets= 

Like that my nodes are always available now. 
best of luck, 

- Mail original -

> De: "Arjun J Rao" 
> À: "slurm-dev" 
> Envoyé: Jeudi 3 Octobre 2013 12:18:56
> Objet: [slurm-dev] Re: salloc: Required job not available (Down or
> drained)

> Re: [slurm-dev] Re: salloc: Required job not available (Down or
> drained)

> sinfo says
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 down node33

> REASON USER TIMESTAMP NODELIST
> Node unexpectedly rebooted root 2013-10-03T10:06:22 node33

> On Thu, Oct 3, 2013 at 3:22 PM, Andy Riebs < andy.ri...@hp.com >
> wrote:

> > What does `sinfo` and `sinfo -R` say?
> 

> > (Please be sure to reply to the list, not just to me.)
> 

> > Andy
> 

> > On 10/03/2013 05:28 AM, Arjun J Rao wrote:
> 

> > > I'm running SLURM on a single computer in Scientific Linux 6.4
> > 
> 
> > > Till yesterday, I was able to run salloc -N1 bash and it would
> > > allocate the resources to me immediately. But today I get errors
> > > :
> > 
> 

> > > salloc: Required node not available (down or drained)
> > 
> 
> > > salloc: Pending job allocation 117
> > 
> 
> > > salloc: job 117 queued and waiting for resources
> > 
> 

> > > My computer is named localhost and also known by the alias node33
> > > (33
> > > has no significance here, it just had to be some number)
> > 
> 

> > > Version of slurm is 2.6.2 and i installed from source
> > 
> 

> > > My slurm.conf file is attached
> > 
> 

> > --
> 
> > Andy Riebs
> 
> > Hewlett-Packard Company
> 
> > High Performance Computing
> 
> > +1 404 648 9024
> 
> > My opinions are not necessarily those of HP
> 
-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] sacct error

2013-09-24 Thread Sivasangari Nandy

root@VM-667:/omaha-beach/workflow# sacct 208 
SLURM accounting storage is disabled 


I don't understand why i got this error ... 
I've joined my conf file. 
Have you already have this problem ? 





slur.conf
Description: Binary data


[slurm-dev] slurm-dev mpich2 for job running in all nodes

2013-09-24 Thread Sivasangari Nandy

I have tried with salloc directly : "salloc -N 2 mpiexec /path/my_application"
And with "sview" I can see that 2 nodes are allocated simultaneously, so it is 
great but the log file of
my output shows me that my appli ran 2 times.
However I don't want to run 2 times but just use 2 nodes to run faster my 
unique appli... :(

Does anyone has an idea ?
advance thanks, 
Siva


- Mail original -
> De: "Moe Jette" 
> À: "slurm-dev" 
> Envoyé: Vendredi 20 Septembre 2013 17:27:53
> Objet: [slurm-dev] Re: mpich2 for job running in all nodes
> 
> 
> Slurm use with the various MPI distributions is documented here:
> 
> http://slurm.schedmd.com/mpi_guide.html#mpich2
> 
> 
> Quoting Sivasangari Nandy :
> 
> > Hi,
> >
> >
> > Does anyone know how to use the mpich2 command pls ?
> > I wanted to execute Bowtie2 (a mapping tool) using my three nodes :
> > VM-669, VM-670 and VM-671
> > and with this command :
> >
> >
> > srun mpiexec -n 3 -machinefile Mname.txt
> > /omaha-beach/workflow/bowtie2.sh
> >
> >
> >
> > But I got this error :
> >
> >
> > srun: error: Only allocated 1 nodes asked for 3
> >
> >
> > -n for the number of processor (I got one processor per node cf
> > joined file for slurm conf file)
> > Mname.txt is my file with nodes name VM ...
> > bowtie2.sh is like this :
> >
> >
> >
> > #!/bin/bash
> >
> > bowtie2 -p 3 -x Indexed_Bowtie -q -I 0 -X 1000 -1 r1.fastq -2
> > r2.fastq -S bowtie2.sam
> >
> >
> >
> >
> > Thanks for your help,
> > Siva
> 
> 
> 

-- 
Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 

[slurm-dev] mpich2 for job running in all nodes

2013-09-20 Thread Sivasangari Nandy
Hi, 


Does anyone know how to use the mpich2 command pls ? 
I wanted to execute Bowtie2 (a mapping tool) using my three nodes : VM-669, 
VM-670 and VM-671 
and with this command : 


srun mpiexec -n 3 -machinefile Mname.txt /omaha-beach/workflow/bowtie2.sh 



But I got this error : 


srun: error: Only allocated 1 nodes asked for 3 


-n for the number of processor (I got one processor per node cf joined file for 
slurm conf file) 
Mname.txt is my file with nodes name VM ... 
bowtie2.sh is like this : 



#!/bin/bash 

bowtie2 -p 3 -x Indexed_Bowtie -q -I 0 -X 1000 -1 r1.fastq -2 r2.fastq -S 
bowtie2.sam 




Thanks for your help, 
Siva 

slur.conf
Description: Binary data


[slurm-dev] RE: can't make "sacct"

2013-09-19 Thread Sivasangari Nandy

I got this error when i try a sacct JOBID : SLURM accounting storage is 
disabled 

here my slurmctld log file ( tail -f /var/log/slurm-llnl/slurmctld.log) : 



[2013-09-19T10:58:20] sched: _slurm_rpc_allocate_resources JobId=179 
NodeList=VM-669 usec=70 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.0 VM-669 
usec=197 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.1 VM-669 
usec=187 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.1 usec=17 
[2013-09-19T10:59:00] completing job 179 
[2013-09-19T10:59:00] sched: job_complete for JobId=179 successful 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.0 usec=8 




and my conf file (/etc/slurm-llnl/slurm.conf) : 



# slurm.conf file generated by configurator.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=VM-667 
ControlAddr=192.168.2.26 
#BackupController= 
#BackupAddr= 
# 
AuthType=auth/munge 
CacheGroups=0 
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge 
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog= 
#PrologSlurmctld= 
#FirstJobId=1 
#MaxJobId=99 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey= 
#JobCredentialPublicCertificate= 
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=4 
#MaxTasksPerNode=128 
MpiDefault=none 
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid 
#Prolog= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1 
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SrunEpilog= 
#SrunProlog= 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
#TaskEpilog= 
TaskPlugin=task/none 
#TaskPluginParam= 
#TaskProlog= 
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0 
KillWait=30 
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300 
#OverTimeLimit=0 
SlurmctldTimeout=120 
SlurmdTimeout=300 
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0 
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1 
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill 
SchedulerPort=7321 
SelectType=select/linear 
#SelectTypeParameters= 
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost= 
#AccountingStorageLoc= 
#AccountingStoragePass= 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
#AccountingStorageUser= 
AccountingStoreJobComment=YES 
ClusterName=cluster 
#DebugFlags= 
#JobCompHost= 
#JobCompLoc= 
#JobCompPass= 
#JobCompPort= 
JobCompType=jobcomp/none 
#JobCompUser= 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
SlurmdDebug=3 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=VM-[669-671] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN 
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP 




- Mail original -


De: "Nancy Kritkausky"  
À: "slurm-dev"  
Envoyé: Mercredi 18 Septembre 2013 18:29:56 
Objet: [slurm-dev] RE: can't make "sacct" 



Hello Siva, 
There is not a lot of information to go on from your email. What type of 
accounting do you have configured? What does your slurm.conf and slurmdbd.conf 
file look like? I would also suggest looking at your slurmdbd.log and 
slurmd.log to see what is going on, or sending them to the dev list. 
Nancy 



From: Sivasangari Nandy [mailto:sivasangari.na...@irisa.fr] 
Sent: Wednesday, September 18, 2013 9:02 AM 
To: slurm-dev 
Subject: [slurm-dev] can't make "sacct" 


Hi, 



Hey does anyone know why my "sacct" command doesn't work ? 

I got this : 

root@VM-667:/omaha-beach/wo

[slurm-dev] can't make "sacct"

2013-09-18 Thread Sivasangari Nandy
Hi, 


Hey does anyone know why my "sacct" command doesn't work ? 
I got this : 


root@VM-667:/omaha-beach/workflow# sacct 


JobID JobName Partition Account AllocCPUS State ExitCode 
 -- -- -- -- --  
/var/log/slurm_jobacct.log: No such file or directory 

advance thanks, 
Siva 

-- 

[slurm-dev] Re: Failed to allocate resource : Unable to contact slurm controller

2013-09-17 Thread Sivasangari Nandy
hello, 

try this : 
/etc/init.d/slurm-llnl start 

/etc/init.d/slurm-llnl stop 

/etc/init.d/slurm-llnl startclean 

- Mail original -

> De: "Arjun J Rao" 
> À: "slurm-dev" 
> Envoyé: Mardi 17 Septembre 2013 11:33:53
> Objet: [slurm-dev] Failed to allocate resource : Unable to contact
> slurm controller

> Failed to allocate resource : Unable to contact slurm controller

> I want to run SLURM on on a single machine as a proof of concept to
> run some trivial MPI programs on my machine.
> I keep getting the message :
> Failed to allocate resources ; Unable to contact slurm controller
> In my slurm.conf file, I have named the ControlMachine as localhost
> and ControlAddr as 127.0.0.1
> Compute Node name as localhost and NodeAddr as 127.0.0.1 too.

> What am i doing wrong ?

> Scientific Linux 6.4
> 64-bit

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] mpich2 to use multiple machines

2013-09-17 Thread Sivasangari Nandy
Hello, 


I got a small problem with mpich2 for slurm. I want to run my jobs in more than 
one machines (here for the test I just wanted it with VM-669, so I've put in 
the file 
Mname.txt just VM-669) 
From the master (VM-667) I run : 


mpiexec -machinefile Mname.txt -np 1 /bin/spleep 60 

but I have these errors : 



[proxy:0:0@VM-669] launch_procs (./pm/pmiserv/pmip_cb.c:687): unable to change 
wdir to /root/omaha-beach (No such file or directory) 
[proxy:0:0@VM-669] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): 
launch_procs returned error 
[proxy:0:0@VM-669] HYDT_dmxu_poll_wait_for_event 
(./tools/demux/demux_poll.c:77): callback returned error status 
[proxy:0:0@VM-669] main (./pm/pmiserv/pmip.c:226): demux engine error waiting 
for event 
[mpiexec@VM-667] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) 
failed 
[mpiexec@VM-667] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): 
callback returned error status 
[mpiexec@VM-667] HYD_pmci_wait_for_completion 
(./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event 
[mpiexec@VM-667] main (./ui/mpich/mpiexec.c:405): process manager error waiting 
for completion 

Have you got any idea ? 
advance thanks, 
Siva 


-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 



[slurm-dev] Re: nodes often down

2013-09-13 Thread Sivasangari Nandy

Whaou thanks a lot Hongjia Cao ! It's working now with CPUs=1 and Sockets=1. 
Yea actually physically (cat /proc/cpuinfo) there was just 1 CPU so with 2 CPUs 
in the slurm conf file it did not work.

thanks again, 
bye.





- Mail original -
> De: "Hongjia Cao" 
> À: "slurm-dev" 
> Envoyé: Vendredi 13 Septembre 2013 06:24:53
> Objet: [slurm-dev] Re: nodes often down
> 
> 
> I guess that you have wrong number of CPUs per node configured.
> Please
> try changed the configuration file. Or you may try FastSchedule=1.
> 
> 在 2013-09-10二的 23:54 -0700,Sivasangari Nandy写道:
> > root@VM-667:/omaha-beach/workflow# sinfo -R
> > REASON   USER  TIMESTAMP   NODELIST
> > Low CPUs slurm 2013-09-10T19:38:37 VM-[669-671]
> > 
> > 
> > Then when i changed nodes to idle, and type sinfo -R,
> > there is nothing.
> > I wanted to know how to have permanently idle nodes.
> > 
> > 
> > __
> > De: "曹宏嘉" 
> > À: "slurm-dev" 
> > Envoyé: Mercredi 11 Septembre 2013 02:26:54
> > Objet: [slurm-dev] Re: nodes often down
> > 
> > You may run "sinfo -R" to see the reason that the node is
> > left
> > down. ReturnToService=1 can not recover all down nodes.
> > 
> > 
> > -原始邮件-
> > 发件人: "Sivasangari Nandy"
> > 
> > 发送时间: 2013-09-10 20:38:43 (星期二)
> > 收件人: slurm-dev 
> > 抄送: slurm-dev 
> > 主题: [slurm-dev] Re: nodes often down
> > 
> > No it was like that at first so by default.
> > And yea I've restarted slurm but no changes, after
> > 10
> > min nodes are all down.
> > 
> > 
> > Here my conf file if needed :
> > 
> > 
> > # slurm.conf file generated by configurator.html.
> > # Put this file on all nodes of your cluster.
> > # See the slurm.conf man page for more information.
> > #
> > ControlMachine=VM-667
> > ControlAddr=192.168.2.26
> > #BackupController=
> > #BackupAddr=
> > #
> > AuthType=auth/munge
> > CacheGroups=0
> > #CheckpointType=checkpoint/none
> > CryptoType=crypto/munge
> > #DisableRootJobs=NO
> > #EnforcePartLimits=NO
> > #Epilog=
> > #PrologSlurmctld=
> > #FirstJobId=1
> > #MaxJobId=99
> > #GresTypes=
> > #GroupUpdateForce=0
> > #GroupUpdateTime=600
> > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> > #JobCredentialPrivateKey=
> > #JobCredentialPublicCertificate=
> > #JobFileAppend=0
> > #JobRequeue=1
> > #JobSubmitPlugins=1
> > #KillOnBadExit=0
> > #Licenses=foo*4,bar
> > #MailProg=/usr/bin/mail
> > #MaxJobCount=5000
> > #MaxStepCount=4
> > #MaxTasksPerNode=128
> > MpiDefault=none
> > #MpiParams=ports=#-#
> > #PluginDir=
> > #PlugStackConfig=
> > #PrivateData=jobs
> > ProctrackType=proctrack/pgid
> > #Prolog=
> > #PrologSlurmctld=
> > #PropagatePrioProcess=0
> > #PropagateResourceLimits=
> > #PropagateResourceLimitsExcept=
> > ReturnToService=1
> > #SallocDefaultCommand=
> > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> > SlurmctldPort=6817
> > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> > SlurmdPort=6818
> > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> > SlurmUser=slurm
> > #SrunEpilog=
> > #SrunProlog=
> >

[slurm-dev] Re: nodes often down

2013-09-10 Thread Sivasangari Nandy
root@VM-667:/omaha-beach/workflow# sinfo -R 
REASON USER TIMESTAMP NODELIST 
Low CPUs slurm 2013-09-10T19:38:37 VM-[669-671] 

Then when i changed nodes to idle, and type sinfo -R, 
there is nothing. 
I wanted to know how to have permanently idle nodes. 
- Mail original -

> De: "曹宏嘉" 
> À: "slurm-dev" 
> Envoyé: Mercredi 11 Septembre 2013 02:26:54
> Objet: [slurm-dev] Re: nodes often down

> Re: [slurm-dev] Re: nodes often down You may run "sinfo -R" to see
> the reason that the node is left down. ReturnToService=1 can not
> recover all down nodes.

> > -原始邮件-
> 
> > 发件人: "Sivasangari Nandy" < sivasangari.na...@irisa.fr >
> 
> > 发送时间: 2013-09-10 20:38:43 (星期二)
> 
> > 收件人: slurm-dev < slurm-dev@schedmd.com >
> 
> > 抄送: slurm-dev < slurm-dev@schedmd.com >
> 
> > 主题: [slurm-dev] Re: nodes often down
> 

> > No it was like that at first so by default.
> 
> > And yea I've restarted slurm but no changes, after 10 min nodes are
> > all down.
> 

> > Here my conf file if needed :
> 

> > # slurm.conf file generated by configurator.html.
> 
> > # Put this file on all nodes of your cluster.
> 
> > # See the slurm.conf man page for more information.
> 
> > #
> 
> > ControlMachine=VM-667
> 
> > ControlAddr=192.168.2.26
> 
> > #BackupController=
> 
> > #BackupAddr=
> 
> > #
> 
> > AuthType=auth/munge
> 
> > CacheGroups=0
> 
> > #CheckpointType=checkpoint/none
> 
> > CryptoType=crypto/munge
> 
> > #DisableRootJobs=NO
> 
> > #EnforcePartLimits=NO
> 
> > #Epilog=
> 
> > #PrologSlurmctld=
> 
> > #FirstJobId=1
> 
> > #MaxJobId=99
> 
> > #GresTypes=
> 
> > #GroupUpdateForce=0
> 
> > #GroupUpdateTime=600
> 
> > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> 
> > #JobCredentialPrivateKey=
> 
> > #JobCredentialPublicCertificate=
> 
> > #JobFileAppend=0
> 
> > #JobRequeue=1
> 
> > #JobSubmitPlugins=1
> 
> > #KillOnBadExit=0
> 
> > #Licenses=foo*4,bar
> 
> > #MailProg=/usr/bin/mail
> 
> > #MaxJobCount=5000
> 
> > #MaxStepCount=4
> 
> > #MaxTasksPerNode=128
> 
> > MpiDefault=none
> 
> > #MpiParams=ports=#-#
> 
> > #PluginDir=
> 
> > #PlugStackConfig=
> 
> > #PrivateData=jobs
> 
> > ProctrackType=proctrack/pgid
> 
> > #Prolog=
> 
> > #PrologSlurmctld=
> 
> > #PropagatePrioProcess=0
> 
> > #PropagateResourceLimits=
> 
> > #PropagateResourceLimitsExcept=
> 
> > ReturnToService=1
> 
> > #SallocDefaultCommand=
> 
> > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> 
> > SlurmctldPort=6817
> 
> > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> 
> > SlurmdPort=6818
> 
> > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> 
> > SlurmUser=slurm
> 
> > #SrunEpilog=
> 
> > #SrunProlog=
> 
> > StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> 
> > SwitchType=switch/none
> 
> > #TaskEpilog=
> 
> > TaskPlugin=task/none
> 
> > #TaskPluginParam=
> 
> > #TaskProlog=
> 
> > #TopologyPlugin=topology/tree
> 
> > #TmpFs=/tmp
> 
> > #TrackWCKey=no
> 
> > #TreeWidth=
> 
> > #UnkillableStepProgram=
> 
> > #UsePAM=0
> 
> > #
> 
> > #
> 
> > # TIMERS
> 
> > #BatchStartTimeout=10
> 
> > #CompleteWait=0
> 
> > #EpilogMsgTime=2000
> 
> > #GetEnvTimeout=2
> 
> > #HealthCheckInterval=0
> 
> > #HealthCheckProgram=
> 
> > InactiveLimit=0
> 
> > KillWait=30
> 
> > #MessageTimeout=10
> 
> > #ResvOverRun=0
> 
> > MinJobAge=300
> 
> > #OverTimeLimit=0
> 
> > SlurmctldTimeout=120
> 
> > SlurmdTimeout=300
> 
> > #UnkillableStepTimeout=60
> 
> > #VSizeFactor=0
> 
> > Waittime=0
> 
> > #
> 
> > #
> 
> > # SCHEDULING
> 
> > #DefMemPerCPU=0
> 
> > FastSchedule=1
> 
> > #MaxMemPerCPU=0
> 
> > #SchedulerRootFilter=1
> 
> > #SchedulerTimeSlice=30
> 
> > SchedulerType=sched/backfill
> 
> > SchedulerPort=7321
> 
> > SelectType=select/linear
> 
> > #SelectTypeParameters=
> 
> > #
> 
> > #
> 
> > # JOB PRIORITY
> 
> > #PriorityType=priority/basic
> 
> > #PriorityDecayHalfLife=
> 
> > #PriorityCalcPeriod=
> 
> > #PriorityFavorSmall=
>

[slurm-dev] Re: nodes often down

2013-09-10 Thread Sivasangari Nandy
No it was like that at first so by default. 
And yea I've restarted slurm but no changes, after 10 min nodes are all down. 

Here my conf file if needed : 

# slurm.conf file generated by configurator.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=VM-667 
ControlAddr=192.168.2.26 
#BackupController= 
#BackupAddr= 
# 
AuthType=auth/munge 
CacheGroups=0 
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge 
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog= 
#PrologSlurmctld= 
#FirstJobId=1 
#MaxJobId=99 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey= 
#JobCredentialPublicCertificate= 
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=4 
#MaxTasksPerNode=128 
MpiDefault=none 
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid 
#Prolog= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1 
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SrunEpilog= 
#SrunProlog= 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
#TaskEpilog= 
TaskPlugin=task/none 
#TaskPluginParam= 
#TaskProlog= 
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0 
KillWait=30 
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300 
#OverTimeLimit=0 
SlurmctldTimeout=120 
SlurmdTimeout=300 
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0 
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1 
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill 
SchedulerPort=7321 
SelectType=select/linear 
#SelectTypeParameters= 
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost= 
#AccountingStorageLoc= 
#AccountingStoragePass= 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
#AccountingStorageUser= 
AccountingStoreJobComment=YES 
ClusterName=cluster 
#DebugFlags= 
#JobCompHost= 
#JobCompLoc= 
#JobCompPass= 
#JobCompPort= 
JobCompType=jobcomp/none 
#JobCompUser= 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
SlurmdDebug=3 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=VM-[669-671] CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN 
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP 
- Mail original -

> De: "Alan V. Cowles" 
> À: "slurm-dev" 
> Cc: "Sivasangari Nandy" 
> Envoyé: Mardi 10 Septembre 2013 13:17:36
> Objet: Re: [slurm-dev] nodes often down

> Siva,

> Was that the default setting you had in place with your original
> config, or a change you made recently to combat the downed nodes
> problem, and did you restart slurm or do a reconfigure to re-read
> the slurm.conf file. I've found some changes don't take effect with
> a reconfigure, and you have to restart.

> AC

> On 09/10/2013 04:01 AM, Sivasangari Nandy wrote:

> > Hello,
> 

> > My nodes are often in "down" states, so I must make a 'sview' and
> > activate all nodes manually to put them 'idle' in order to run
> > jobs.
> 
> > I've seen in the FAQ that I can change the slurm.conf file.
> 

> > " The configuration parameter ReturnToService in slurm.conf
> > controls
> > how DOWN nodes are handled. Set its value to one in order for DOWN
> > nodes to automatically be returned to service once the slurmd
> > daemon
> > registers with a valid node configuration. "
> 

> > However in my file it's already "1" for ReturnToService.
> 

> > advance thanks,
> 
> > Siva
>

[slurm-dev] nodes often down

2013-09-10 Thread Sivasangari Nandy
Hello, 


My nodes are often in "down" states, so I must make a 'sview' and activate all 
nodes manually to put them 'idle' in order to run jobs. 
I've seen in the FAQ that I can change the slurm.conf file. 


" The configuration parameter ReturnToService in slurm.conf controls how DOWN 
nodes are handled. Set its value to one in order for DOWN nodes to 
automatically be returned to service once the slurmd daemon registers with a 
valid node configuration. " 


However in my file it's already "1" for ReturnToService. 


advance thanks, 
Siva 

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 



[slurm-dev] Re: Required node not available (down or drained)

2013-08-26 Thread Sivasangari Nandy
And the log file is not informative 

tail -f /var/log/slurm-llnl/slurmd.log 

... 
[2013-08-26T11:52:16] Slurmd shutdown completing 
[2013-08-26T11:52:56] slurmd version 2.3.4 started 
[2013-08-26T11:52:56] slurmd started on Mon 26 Aug 2013 11:52:56 +0200 
[2013-08-26T11:52:56] Procs=1 Sockets=1 Cores=1 Threads=1 Memory=2012 
TmpDisk=9069 Uptime=1122626 

- Mail original -

> De: "Sivasangari Nandy" 
> À: "slurm-dev" 
> Envoyé: Lundi 26 Août 2013 14:28:28
> Objet: Re: [slurm-dev] Re: Required node not available (down or
> drained)

> Hi,

> I have checked some things, now my slurmctld and slurmd are in a
> single machine (using just one node) so the test is easier.
> For that I have modified the conf file : vi
> /etc/slurm-llnl/slurm.conf

> Slurmctld and slurmd are both running, here my ps result :

> root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm
> root 31712 31706 0 11:44 pts/1 00:00:00 tail -f
> /var/log/slurm-llnl/slurmd.log
> slurm 31990 1 0 11:52 ? 00:00:00 /usr/sbin/slurmctld
> root 32103 1 0 11:52 ? 00:00:00 /usr/sbin/slurmd -c
> root 32125 30346 0 11:53 pts/0 00:00:00 grep slurm

> So i have tried srun again but got this error yet:

> !srun
> srun /omaha-beach/test.sh
> srun: Required node not available (down or drained)
> srun: job 64 queued and waiting for resources

> Have you got any idea of the problem ?
> thanks,

> Siva

> - Mail original -

> > De: "Nikita Burtsev" 
> 
> > À: "slurm-dev" 
> 
> > Envoyé: Jeudi 22 Août 2013 09:59:52
> 
> > Objet: [slurm-dev] Re: Required node not available (down or
> > drained)
> 

> > Re: [slurm-dev] Re: Required node not available (down or drained)
> 
> > You need to have slurmd running on all nodes that will execute
> > jobs,
> > so you should start it with init script.
> 

> > --
> 
> > Nikita Burtsev
> 
> > Sent with Sparrow
> 

> > On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
> 
> > > " check if the slurmd daemon is running with the command " ps -el
> > > |
> > > grep slurmd ". "
> > 
> 

> > > Nothing is happened with ps -el ...
> > 
> 

> > > root@VM-667:~# ps -el | grep slurmd
> > 
> 

> > > > De: "Nikita Burtsev" < nikita.burt...@gmail.com >
> > > 
> > 
> 
> > > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > > 
> > 
> 
> > > > Envoyé: Mercredi 21 Août 2013 18:58:52
> > > 
> > 
> 
> > > > Objet: [slurm-dev] Re: Required node not available (down or
> > > > drained)
> > > 
> > 
> 

> > > > Re: [slurm-dev] Re: Required node not available (down or
> > > > drained)
> > > 
> > 
> 
> > > > slurmctld is the management process and since your have access
> > > > to
> > > > squeue/sinfo information it is running just fine. You need to
> > > > check
> > > > if slurmd (which is the agent part) is running on your nodes,
> > > > i.e.
> > > > VM-[669-671]
> > > 
> > 
> 

> > > > --
> > > 
> > 
> 
> > > > Nikita Burtsev
> > > 
> > 
> 

> > > > On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
> > > > wrote:
> > > 
> > 
> 
> > > > > I have tried :
> > > > 
> > > 
> > 
> 

> > > > > /etc/init.d/slurm-llnl start
> > > > 
> > > 
> > 
> 

> > > > > [ ok ] Starting slurm central management daemon: slurmctld.
> > > > 
> > > 
> > 
> 
> > > > > /usr/sbin/slurmctld already running.
> > > > 
> > > 
> > 
> 

> > > > > And :
> > > > 
> > > 
> > 
> 

> > > > > scontrol show slurmd
> > > > 
> > > 
> > 
> 

> > > > > scontrol: error: slurm_slurmd_info: Connection refused
> > > > 
> > > 
> > 
> 
> > > > > slurm_load_slurmd_status: Connection refused
> > > > 
> > > 
> > 
> 

> > > > > Hum how to proceed to repair that problem ?
> > > > 
> > > 
> > 
> 

> > > > > > De: "Danny Auble" < d...@schedmd.com >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > >

[slurm-dev] Re: Required node not available (down or drained)

2013-08-26 Thread Sivasangari Nandy
Hi, 

I have checked some things, now my slurmctld and slurmd are in a single machine 
(using just one node) so the test is easier. 
For that I have modified the conf file : vi /etc/slurm-llnl/slurm.conf 

Slurmctld and slurmd are both running, here my ps result : 

root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm 
root 31712 31706 0 11:44 pts/1 00:00:00 tail -f /var/log/slurm-llnl/slurmd.log 
slurm 31990 1 0 11:52 ? 00:00:00 /usr/sbin/slurmctld 
root 32103 1 0 11:52 ? 00:00:00 /usr/sbin/slurmd -c 
root 32125 30346 0 11:53 pts/0 00:00:00 grep slurm 

So i have tried srun again but got this error yet: 

!srun 
srun /omaha-beach/test.sh 
srun: Required node not available (down or drained) 
srun: job 64 queued and waiting for resources 

Have you got any idea of the problem ? 
thanks, 

Siva 

- Mail original -

> De: "Nikita Burtsev" 
> À: "slurm-dev" 
> Envoyé: Jeudi 22 Août 2013 09:59:52
> Objet: [slurm-dev] Re: Required node not available (down or drained)

> Re: [slurm-dev] Re: Required node not available (down or drained)
> You need to have slurmd running on all nodes that will execute jobs,
> so you should start it with init script.

> --
> Nikita Burtsev
> Sent with Sparrow

> On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
> > " check if the slurmd daemon is running with the command " ps -el |
> > grep slurmd ". "
> 

> > Nothing is happened with ps -el ...
> 

> > root@VM-667:~# ps -el | grep slurmd
> 

> > > De: "Nikita Burtsev" < nikita.burt...@gmail.com >
> > 
> 
> > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > 
> 
> > > Envoyé: Mercredi 21 Août 2013 18:58:52
> > 
> 
> > > Objet: [slurm-dev] Re: Required node not available (down or
> > > drained)
> > 
> 

> > > Re: [slurm-dev] Re: Required node not available (down or drained)
> > 
> 
> > > slurmctld is the management process and since your have access to
> > > squeue/sinfo information it is running just fine. You need to
> > > check
> > > if slurmd (which is the agent part) is running on your nodes,
> > > i.e.
> > > VM-[669-671]
> > 
> 

> > > --
> > 
> 
> > > Nikita Burtsev
> > 
> 

> > > On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
> > > wrote:
> > 
> 
> > > > I have tried :
> > > 
> > 
> 

> > > > /etc/init.d/slurm-llnl start
> > > 
> > 
> 

> > > > [ ok ] Starting slurm central management daemon: slurmctld.
> > > 
> > 
> 
> > > > /usr/sbin/slurmctld already running.
> > > 
> > 
> 

> > > > And :
> > > 
> > 
> 

> > > > scontrol show slurmd
> > > 
> > 
> 

> > > > scontrol: error: slurm_slurmd_info: Connection refused
> > > 
> > 
> 
> > > > slurm_load_slurmd_status: Connection refused
> > > 
> > 
> 

> > > > Hum how to proceed to repair that problem ?
> > > 
> > 
> 

> > > > > De: "Danny Auble" < d...@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > Envoyé: Mercredi 21 Août 2013 15:36:53
> > > > 
> > > 
> > 
> 
> > > > > Objet: [slurm-dev] Re: Required node not available (down or
> > > > > drained)
> > > > 
> > > 
> > 
> 

> > > > > Check your slurmd log. It doesn't appear the slurmd is
> > > > > running.
> > > > 
> > > 
> > 
> 

> > > > > Sivasangari Nandy < sivasangari.na...@irisa.fr > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > > > Hello,
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > I'm trying to use Slurm for the first time, and I got a
> > > > > > > > problem
> > > > > > > > with
> > > > > > > > nodes I think.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I have this message when I used squeue :
> > > > > > > 
> > > > > > 
> &

[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Sivasangari Nandy
So i have done : /etc/init.d/slurm-llnl start 
in each node and tried again but i have : 

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 
50 SLURM-deb test.sh root PD 0:00 1 (Resources) 
53 SLURM-deb test.sh root PD 0:00 1 (Resources) 

and I have this when i try : root@VM-671:~# ps -el | grep slurmd 

5 S 0 8223 1 0 80 0 - 22032 - ? 00:00:01 slurmd 

- Mail original -

> De: "Nikita Burtsev" 
> À: "slurm-dev" 
> Envoyé: Jeudi 22 Août 2013 09:59:52
> Objet: [slurm-dev] Re: Required node not available (down or drained)

> Re: [slurm-dev] Re: Required node not available (down or drained)
> You need to have slurmd running on all nodes that will execute jobs,
> so you should start it with init script.

> --
> Nikita Burtsev
> Sent with Sparrow

> On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
> > " check if the slurmd daemon is running with the command " ps -el |
> > grep slurmd ". "
> 

> > Nothing is happened with ps -el ...
> 

> > root@VM-667:~# ps -el | grep slurmd
> 

> > > De: "Nikita Burtsev" < nikita.burt...@gmail.com >
> > 
> 
> > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > 
> 
> > > Envoyé: Mercredi 21 Août 2013 18:58:52
> > 
> 
> > > Objet: [slurm-dev] Re: Required node not available (down or
> > > drained)
> > 
> 

> > > Re: [slurm-dev] Re: Required node not available (down or drained)
> > 
> 
> > > slurmctld is the management process and since your have access to
> > > squeue/sinfo information it is running just fine. You need to
> > > check
> > > if slurmd (which is the agent part) is running on your nodes,
> > > i.e.
> > > VM-[669-671]
> > 
> 

> > > --
> > 
> 
> > > Nikita Burtsev
> > 
> 

> > > On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
> > > wrote:
> > 
> 
> > > > I have tried :
> > > 
> > 
> 

> > > > /etc/init.d/slurm-llnl start
> > > 
> > 
> 

> > > > [ ok ] Starting slurm central management daemon: slurmctld.
> > > 
> > 
> 
> > > > /usr/sbin/slurmctld already running.
> > > 
> > 
> 

> > > > And :
> > > 
> > 
> 

> > > > scontrol show slurmd
> > > 
> > 
> 

> > > > scontrol: error: slurm_slurmd_info: Connection refused
> > > 
> > 
> 
> > > > slurm_load_slurmd_status: Connection refused
> > > 
> > 
> 

> > > > Hum how to proceed to repair that problem ?
> > > 
> > 
> 

> > > > > De: "Danny Auble" < d...@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > Envoyé: Mercredi 21 Août 2013 15:36:53
> > > > 
> > > 
> > 
> 
> > > > > Objet: [slurm-dev] Re: Required node not available (down or
> > > > > drained)
> > > > 
> > > 
> > 
> 

> > > > > Check your slurmd log. It doesn't appear the slurmd is
> > > > > running.
> > > > 
> > > 
> > 
> 

> > > > > Sivasangari Nandy < sivasangari.na...@irisa.fr > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > > > Hello,
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > I'm trying to use Slurm for the first time, and I got a
> > > > > > > > problem
> > > > > > > > with
> > > > > > > > nodes I think.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I have this message when I used squeue :
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > root@VM-667:~# squeue
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > JOBID PARTITION NAME USER ST TIME NODES
> > > > > > > > NODELIST(REASON)
> > > > > > > 

[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Sivasangari Nandy
that's what i have done yesterday actually : 

/etc/init.d/slurm-llnl start 

[ ok ] Starting slurm central management daemon: slurmctld. 
/usr/sbin/slurmctld already running. 
- Mail original -

> De: "Nikita Burtsev" 
> À: "slurm-dev" 
> Envoyé: Jeudi 22 Août 2013 09:59:52
> Objet: [slurm-dev] Re: Required node not available (down or drained)

> Re: [slurm-dev] Re: Required node not available (down or drained)
> You need to have slurmd running on all nodes that will execute jobs,
> so you should start it with init script.

> --
> Nikita Burtsev
> Sent with Sparrow

> On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
> > " check if the slurmd daemon is running with the command " ps -el |
> > grep slurmd ". "
> 

> > Nothing is happened with ps -el ...
> 

> > root@VM-667:~# ps -el | grep slurmd
> 

> > > De: "Nikita Burtsev" < nikita.burt...@gmail.com >
> > 
> 
> > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > 
> 
> > > Envoyé: Mercredi 21 Août 2013 18:58:52
> > 
> 
> > > Objet: [slurm-dev] Re: Required node not available (down or
> > > drained)
> > 
> 

> > > Re: [slurm-dev] Re: Required node not available (down or drained)
> > 
> 
> > > slurmctld is the management process and since your have access to
> > > squeue/sinfo information it is running just fine. You need to
> > > check
> > > if slurmd (which is the agent part) is running on your nodes,
> > > i.e.
> > > VM-[669-671]
> > 
> 

> > > --
> > 
> 
> > > Nikita Burtsev
> > 
> 

> > > On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
> > > wrote:
> > 
> 
> > > > I have tried :
> > > 
> > 
> 

> > > > /etc/init.d/slurm-llnl start
> > > 
> > 
> 

> > > > [ ok ] Starting slurm central management daemon: slurmctld.
> > > 
> > 
> 
> > > > /usr/sbin/slurmctld already running.
> > > 
> > 
> 

> > > > And :
> > > 
> > 
> 

> > > > scontrol show slurmd
> > > 
> > 
> 

> > > > scontrol: error: slurm_slurmd_info: Connection refused
> > > 
> > 
> 
> > > > slurm_load_slurmd_status: Connection refused
> > > 
> > 
> 

> > > > Hum how to proceed to repair that problem ?
> > > 
> > 
> 

> > > > > De: "Danny Auble" < d...@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > > > 
> > > 
> > 
> 
> > > > > Envoyé: Mercredi 21 Août 2013 15:36:53
> > > > 
> > > 
> > 
> 
> > > > > Objet: [slurm-dev] Re: Required node not available (down or
> > > > > drained)
> > > > 
> > > 
> > 
> 

> > > > > Check your slurmd log. It doesn't appear the slurmd is
> > > > > running.
> > > > 
> > > 
> > 
> 

> > > > > Sivasangari Nandy < sivasangari.na...@irisa.fr > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > > > Hello,
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > I'm trying to use Slurm for the first time, and I got a
> > > > > > > > problem
> > > > > > > > with
> > > > > > > > nodes I think.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I have this message when I used squeue :
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > root@VM-667:~# squeue
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > JOBID PARTITION NAME USER ST TIME NODES
> > > > > > > > NODELIST(REASON)
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > >

[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Sivasangari Nandy
" check if the slurmd daemon is running with the command " ps -el | grep slurmd 
". " 

Nothing is happened with ps -el ... 

root@VM-667:~# ps -el | grep slurmd 

- Mail original -

> De: "Nikita Burtsev" 
> À: "slurm-dev" 
> Envoyé: Mercredi 21 Août 2013 18:58:52
> Objet: [slurm-dev] Re: Required node not available (down or drained)

> Re: [slurm-dev] Re: Required node not available (down or drained)
> slurmctld is the management process and since your have access to
> squeue/sinfo information it is running just fine. You need to check
> if slurmd (which is the agent part) is running on your nodes, i.e.
> VM-[669-671]

> --
> Nikita Burtsev

> On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy wrote:
> > I have tried :
> 

> > /etc/init.d/slurm-llnl start
> 

> > [ ok ] Starting slurm central management daemon: slurmctld.
> 
> > /usr/sbin/slurmctld already running.
> 

> > And :
> 

> > scontrol show slurmd
> 

> > scontrol: error: slurm_slurmd_info: Connection refused
> 
> > slurm_load_slurmd_status: Connection refused
> 

> > Hum how to proceed to repair that problem ?
> 

> > > De: "Danny Auble" < d...@schedmd.com >
> > 
> 
> > > À: "slurm-dev" < slurm-dev@schedmd.com >
> > 
> 
> > > Envoyé: Mercredi 21 Août 2013 15:36:53
> > 
> 
> > > Objet: [slurm-dev] Re: Required node not available (down or
> > > drained)
> > 
> 

> > > Check your slurmd log. It doesn't appear the slurmd is running.
> > 
> 

> > > Sivasangari Nandy < sivasangari.na...@irisa.fr > wrote:
> > 
> 
> > > > > > Hello,
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > I'm trying to use Slurm for the first time, and I got a
> > > > > > problem
> > > > > > with
> > > > > > nodes I think.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > I have this message when I used squeue :
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > root@VM-667:~# squeue
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail)
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > or this one with an other squeue :
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > root@VM-671:~# squeue
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > 50 SLURM-deb test.sh root PD 0:00 &n bsp; 1 (Resources)
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > sinfo gives me :
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > SLURM-de* up infinite 3 down VM-[669-671]
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > I have already used slurm one time with the same
> > > > > > configuration
> > > > > > and
> > > > > > I
> > > > > > wan able to run my job.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > But now the second time I always got :
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > srun: Required node not available (down or drained)
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > srun: job 51 queued and waiting for resources
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > Advance thanks for your help,
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Siva
> > > > > 
> > > > 
> > > 
> > 
> 
> > --
> 

> > Siva sangari NANDY - Plate-forme GenOuest
> 
> > IRISA-INRIA, Campus de Beaulieu
> 
> > 263 Avenue du Général Leclerc
> 

> > 35042 Rennes cedex, France
> 
> > Tél: +33 (0) 2 99 84 25 69
> 

> > Bureau : D152
> 

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] Re: Required node not available (down or drained)

2013-08-21 Thread Sivasangari Nandy
I have tried : 

/etc/init.d/slurm-llnl start 

[ ok ] Starting slurm central management daemon: slurmctld. 
/usr/sbin/slurmctld already running. 

And : 

scontrol show slurmd 

scontrol: error: slurm_slurmd_info: Connection refused 
slurm_load_slurmd_status: Connection refused 

Hum how to proceed to repair that problem ? 

- Mail original -

> De: "Danny Auble" 
> À: "slurm-dev" 
> Envoyé: Mercredi 21 Août 2013 15:36:53
> Objet: [slurm-dev] Re: Required node not available (down or drained)

> Check your slurmd log. It doesn't appear the slurmd is running.

> Sivasangari Nandy < sivasangari.na...@irisa.fr > wrote:
> > > > Hello,
> > > 
> > 
> 

> > > > I'm trying to use Slurm for the first time, and I got a problem
> > > > with
> > > > nodes I think.
> > > 
> > 
> 
> > > > I have this message when I used squeue :
> > > 
> > 
> 

> > > > root@VM-667:~# squeue
> > > 
> > 
> 
> > > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> > > 
> > 
> 
> > > > 50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail)
> > > 
> > 
> 

> > > > or this one with an other squeue :
> > > 
> > 
> 

> > > > root@VM-671:~# squeue
> > > 
> > 
> 
> > > > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> > > 
> > 
> 
> > > > 50 SLURM-deb test.sh root PD 0:00 &n bsp; 1 (Resources)
> > > 
> > 
> 

> > > > sinfo gives me :
> > > 
> > 
> 

> > > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> > > 
> > 
> 
> > > > SLURM-de* up infinite 3 down VM-[669-671]
> > > 
> > 
> 

> > > > I have already used slurm one time with the same configuration
> > > > and
> > > > I
> > > > wan able to run my job.
> > > 
> > 
> 
> > > > But now the second time I always got :
> > > 
> > 
> 

> > > > srun: Required node not available (down or drained)
> > > 
> > 
> 
> > > > srun: job 51 queued and waiting for resources
> > > 
> > 
> 

> > > > Advance thanks for your help,
> > > 
> > 
> 
> > > > Siva
> > > 
> > 
> 
-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] Required node not available (down or drained)

2013-08-21 Thread Sivasangari Nandy
> > Hello,
> 

> > I'm trying to use Slurm for the first time, and I got a problem
> > with
> > nodes I think.
> 
> > I have this message when I used squeue :
> 

> > root@VM-667:~# squeue
> 
> > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 
> > 50 SLURM-deb test.sh root PD 0:00 1 (ReqNodeNotAvail)
> 

> > or this one with an other squeue :
> 

> > root@VM-671:~# squeue
> 
> > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
> 
> > 50 SLURM-deb test.sh root PD 0:00 1 (Resources)
> 

> > sinfo gives me :
> 

> > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> 
> > SLURM-de* up infinite 3 down VM-[669-671]
> 

> > I have already used slurm one time with the same configuration and
> > I
> > wan able to run my job.
> 
> > But now the second time I always got :
> 

> > srun: Required node not available (down or drained)
> 
> > srun: job 51 queued and waiting for resources
> 

> > Advance thanks for your help,
> 
> > Siva
>