[slurm-dev] slurm-dev mpich2 for job running in all nodes

2013-09-24 Thread Sivasangari Nandy

I have tried with salloc directly : salloc -N 2 mpiexec /path/my_application
And with sview I can see that 2 nodes are allocated simultaneously, so it is 
great but the log file of
my output shows me that my appli ran 2 times.
However I don't want to run 2 times but just use 2 nodes to run faster my 
unique appli... :(

Does anyone has an idea ?
advance thanks, 
Siva


- Mail original -
 De: Moe Jette je...@schedmd.com
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Vendredi 20 Septembre 2013 17:27:53
 Objet: [slurm-dev] Re: mpich2 for job running in all nodes
 
 
 Slurm use with the various MPI distributions is documented here:
 
 http://slurm.schedmd.com/mpi_guide.html#mpich2
 
 
 Quoting Sivasangari Nandy sivasangari.na...@irisa.fr:
 
  Hi,
 
 
  Does anyone know how to use the mpich2 command pls ?
  I wanted to execute Bowtie2 (a mapping tool) using my three nodes :
  VM-669, VM-670 and VM-671
  and with this command :
 
 
  srun mpiexec -n 3 -machinefile Mname.txt
  /omaha-beach/workflow/bowtie2.sh
 
 
 
  But I got this error :
 
 
  srun: error: Only allocated 1 nodes asked for 3
 
 
  -n for the number of processor (I got one processor per node cf
  joined file for slurm conf file)
  Mname.txt is my file with nodes name VM ...
  bowtie2.sh is like this :
 
 
 
  #!/bin/bash
 
  bowtie2 -p 3 -x Indexed_Bowtie -q -I 0 -X 1000 -1 r1.fastq -2
  r2.fastq -S bowtie2.sam
 
 
 
 
  Thanks for your help,
  Siva
 
 
 

-- 
Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 

[slurm-dev] sacct error

2013-09-24 Thread Sivasangari Nandy

root@VM-667:/omaha-beach/workflow# sacct 208 
SLURM accounting storage is disabled 


I don't understand why i got this error ... 
I've joined my conf file. 
Have you already have this problem ? 





slur.conf
Description: Binary data


[slurm-dev] RE: can't make sacct

2013-09-19 Thread Sivasangari Nandy

I got this error when i try a sacct JOBID : SLURM accounting storage is 
disabled 

here my slurmctld log file ( tail -f /var/log/slurm-llnl/slurmctld.log) : 



[2013-09-19T10:58:20] sched: _slurm_rpc_allocate_resources JobId=179 
NodeList=VM-669 usec=70 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.0 VM-669 
usec=197 
[2013-09-19T10:58:20] sched: _slurm_rpc_job_step_create: StepId=179.1 VM-669 
usec=187 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.1 usec=17 
[2013-09-19T10:59:00] completing job 179 
[2013-09-19T10:59:00] sched: job_complete for JobId=179 successful 
[2013-09-19T10:59:00] sched: _slurm_rpc_step_complete StepId=179.0 usec=8 




and my conf file (/etc/slurm-llnl/slurm.conf) : 



# slurm.conf file generated by configurator.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=VM-667 
ControlAddr=192.168.2.26 
#BackupController= 
#BackupAddr= 
# 
AuthType=auth/munge 
CacheGroups=0 
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge 
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog= 
#PrologSlurmctld= 
#FirstJobId=1 
#MaxJobId=99 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey= 
#JobCredentialPublicCertificate= 
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=4 
#MaxTasksPerNode=128 
MpiDefault=none 
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid 
#Prolog= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1 
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SrunEpilog= 
#SrunProlog= 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
#TaskEpilog= 
TaskPlugin=task/none 
#TaskPluginParam= 
#TaskProlog= 
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0 
KillWait=30 
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300 
#OverTimeLimit=0 
SlurmctldTimeout=120 
SlurmdTimeout=300 
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0 
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1 
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill 
SchedulerPort=7321 
SelectType=select/linear 
#SelectTypeParameters= 
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost= 
#AccountingStorageLoc= 
#AccountingStoragePass= 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
#AccountingStorageUser= 
AccountingStoreJobComment=YES 
ClusterName=cluster 
#DebugFlags= 
#JobCompHost= 
#JobCompLoc= 
#JobCompPass= 
#JobCompPort= 
JobCompType=jobcomp/none 
#JobCompUser= 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
SlurmdDebug=3 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=VM-[669-671] CPUs=1 Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN 
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP 




- Mail original -


De: Nancy Kritkausky nancy.kritkau...@bull.com 
À: slurm-dev slurm-dev@schedmd.com 
Envoyé: Mercredi 18 Septembre 2013 18:29:56 
Objet: [slurm-dev] RE: can't make sacct 



Hello Siva, 
There is not a lot of information to go on from your email. What type of 
accounting do you have configured? What does your slurm.conf and slurmdbd.conf 
file look like? I would also suggest looking at your slurmdbd.log and 
slurmd.log to see what is going on, or sending them to the dev list. 
Nancy 



From: Sivasangari Nandy [mailto:sivasangari.na...@irisa.fr] 
Sent: Wednesday, September 18, 2013 9:02 AM 
To: slurm-dev 
Subject: [slurm-dev] can't make sacct 


Hi, 



Hey does anyone know why my sacct command doesn't work ? 

I got this : 

root@VM-667:/omaha-beach/workflow# sacct 



JobID JobName

[slurm-dev] can't make sacct

2013-09-18 Thread Sivasangari Nandy
Hi, 


Hey does anyone know why my sacct command doesn't work ? 
I got this : 


root@VM-667:/omaha-beach/workflow# sacct 


JobID JobName Partition Account AllocCPUS State ExitCode 
 -- -- -- -- --  
/var/log/slurm_jobacct.log: No such file or directory 

advance thanks, 
Siva 

-- 

[slurm-dev] mpich2 to use multiple machines

2013-09-17 Thread Sivasangari Nandy
Hello, 


I got a small problem with mpich2 for slurm. I want to run my jobs in more than 
one machines (here for the test I just wanted it with VM-669, so I've put in 
the file 
Mname.txt just VM-669) 
From the master (VM-667) I run : 


mpiexec -machinefile Mname.txt -np 1 /bin/spleep 60 

but I have these errors : 



[proxy:0:0@VM-669] launch_procs (./pm/pmiserv/pmip_cb.c:687): unable to change 
wdir to /root/omaha-beach (No such file or directory) 
[proxy:0:0@VM-669] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:935): 
launch_procs returned error 
[proxy:0:0@VM-669] HYDT_dmxu_poll_wait_for_event 
(./tools/demux/demux_poll.c:77): callback returned error status 
[proxy:0:0@VM-669] main (./pm/pmiserv/pmip.c:226): demux engine error waiting 
for event 
[mpiexec@VM-667] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) 
failed 
[mpiexec@VM-667] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): 
callback returned error status 
[mpiexec@VM-667] HYD_pmci_wait_for_completion 
(./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event 
[mpiexec@VM-667] main (./ui/mpich/mpiexec.c:405): process manager error waiting 
for completion 

Have you got any idea ? 
advance thanks, 
Siva 


-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 



[slurm-dev] Re: Failed to allocate resource : Unable to contact slurm controller

2013-09-17 Thread Sivasangari Nandy
hello, 

try this : 
/etc/init.d/slurm-llnl start 

/etc/init.d/slurm-llnl stop 

/etc/init.d/slurm-llnl startclean 

- Mail original -

 De: Arjun J Rao rectangle.k...@gmail.com
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Mardi 17 Septembre 2013 11:33:53
 Objet: [slurm-dev] Failed to allocate resource : Unable to contact
 slurm controller

 Failed to allocate resource : Unable to contact slurm controller

 I want to run SLURM on on a single machine as a proof of concept to
 run some trivial MPI programs on my machine.
 I keep getting the message :
 Failed to allocate resources ; Unable to contact slurm controller
 In my slurm.conf file, I have named the ControlMachine as localhost
 and ControlAddr as 127.0.0.1
 Compute Node name as localhost and NodeAddr as 127.0.0.1 too.

 What am i doing wrong ?

 Scientific Linux 6.4
 64-bit

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] Re: nodes often down

2013-09-13 Thread Sivasangari Nandy

Whaou thanks a lot Hongjia Cao ! It's working now with CPUs=1 and Sockets=1. 
Yea actually physically (cat /proc/cpuinfo) there was just 1 CPU so with 2 CPUs 
in the slurm conf file it did not work.

thanks again, 
bye.





- Mail original -
 De: Hongjia Cao hj...@nudt.edu.cn
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Vendredi 13 Septembre 2013 06:24:53
 Objet: [slurm-dev] Re: nodes often down
 
 
 I guess that you have wrong number of CPUs per node configured.
 Please
 try changed the configuration file. Or you may try FastSchedule=1.
 
 在 2013-09-10二的 23:54 -0700,Sivasangari Nandy写道:
  root@VM-667:/omaha-beach/workflow# sinfo -R
  REASON   USER  TIMESTAMP   NODELIST
  Low CPUs slurm 2013-09-10T19:38:37 VM-[669-671]
  
  
  Then when i changed nodes to idle, and type sinfo -R,
  there is nothing.
  I wanted to know how to have permanently idle nodes.
  
  
  __
  De: 曹宏嘉 hj...@nudt.edu.cn
  À: slurm-dev slurm-dev@schedmd.com
  Envoyé: Mercredi 11 Septembre 2013 02:26:54
  Objet: [slurm-dev] Re: nodes often down
  
  You may run sinfo -R to see the reason that the node is
  left
  down. ReturnToService=1 can not recover all down nodes.
  
  
  -原始邮件-
  发件人: Sivasangari Nandy
  sivasangari.na...@irisa.fr
  发送时间: 2013-09-10 20:38:43 (星期二)
  收件人: slurm-dev slurm-dev@schedmd.com
  抄送: slurm-dev slurm-dev@schedmd.com
  主题: [slurm-dev] Re: nodes often down
  
  No it was like that at first so by default.
  And yea I've restarted slurm but no changes, after
  10
  min nodes are all down.
  
  
  Here my conf file if needed :
  
  
  # slurm.conf file generated by configurator.html.
  # Put this file on all nodes of your cluster.
  # See the slurm.conf man page for more information.
  #
  ControlMachine=VM-667
  ControlAddr=192.168.2.26
  #BackupController=
  #BackupAddr=
  #
  AuthType=auth/munge
  CacheGroups=0
  #CheckpointType=checkpoint/none
  CryptoType=crypto/munge
  #DisableRootJobs=NO
  #EnforcePartLimits=NO
  #Epilog=
  #PrologSlurmctld=
  #FirstJobId=1
  #MaxJobId=99
  #GresTypes=
  #GroupUpdateForce=0
  #GroupUpdateTime=600
  JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
  #JobCredentialPrivateKey=
  #JobCredentialPublicCertificate=
  #JobFileAppend=0
  #JobRequeue=1
  #JobSubmitPlugins=1
  #KillOnBadExit=0
  #Licenses=foo*4,bar
  #MailProg=/usr/bin/mail
  #MaxJobCount=5000
  #MaxStepCount=4
  #MaxTasksPerNode=128
  MpiDefault=none
  #MpiParams=ports=#-#
  #PluginDir=
  #PlugStackConfig=
  #PrivateData=jobs
  ProctrackType=proctrack/pgid
  #Prolog=
  #PrologSlurmctld=
  #PropagatePrioProcess=0
  #PropagateResourceLimits=
  #PropagateResourceLimitsExcept=
  ReturnToService=1
  #SallocDefaultCommand=
  SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
  SlurmctldPort=6817
  SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
  SlurmdPort=6818
  SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
  SlurmUser=slurm
  #SrunEpilog=
  #SrunProlog=
  StateSaveLocation=/var/lib/slurm-llnl/slurmctld
  SwitchType=switch/none
  #TaskEpilog=
  TaskPlugin=task/none
  #TaskPluginParam=
  #TaskProlog=
  #TopologyPlugin=topology/tree
  #TmpFs=/tmp
  #TrackWCKey=no
  #TreeWidth=
  #UnkillableStepProgram=
  #UsePAM=0
  #
  #
  # TIMERS
  #BatchStartTimeout=10
  #CompleteWait=0
  #EpilogMsgTime=2000
  #GetEnvTimeout=2
  #HealthCheckInterval=0
  #HealthCheckProgram

[slurm-dev] nodes often down

2013-09-10 Thread Sivasangari Nandy
Hello, 


My nodes are often in down states, so I must make a 'sview' and activate all 
nodes manually to put them 'idle' in order to run jobs. 
I've seen in the FAQ that I can change the slurm.conf file. 


 The configuration parameter ReturnToService in slurm.conf controls how DOWN 
nodes are handled. Set its value to one in order for DOWN nodes to 
automatically be returned to service once the slurmd daemon registers with a 
valid node configuration.  


However in my file it's already 1 for ReturnToService. 


advance thanks, 
Siva 

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 



[slurm-dev] Re: nodes often down

2013-09-10 Thread Sivasangari Nandy
No it was like that at first so by default. 
And yea I've restarted slurm but no changes, after 10 min nodes are all down. 

Here my conf file if needed : 

# slurm.conf file generated by configurator.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=VM-667 
ControlAddr=192.168.2.26 
#BackupController= 
#BackupAddr= 
# 
AuthType=auth/munge 
CacheGroups=0 
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge 
#DisableRootJobs=NO 
#EnforcePartLimits=NO 
#Epilog= 
#PrologSlurmctld= 
#FirstJobId=1 
#MaxJobId=99 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 
#JobCredentialPrivateKey= 
#JobCredentialPublicCertificate= 
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
#Licenses=foo*4,bar 
#MailProg=/usr/bin/mail 
#MaxJobCount=5000 
#MaxStepCount=4 
#MaxTasksPerNode=128 
MpiDefault=none 
#MpiParams=ports=#-# 
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/pgid 
#Prolog= 
#PrologSlurmctld= 
#PropagatePrioProcess=0 
#PropagateResourceLimits= 
#PropagateResourceLimitsExcept= 
ReturnToService=1 
#SallocDefaultCommand= 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SrunEpilog= 
#SrunProlog= 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
#TaskEpilog= 
TaskPlugin=task/none 
#TaskPluginParam= 
#TaskProlog= 
#TopologyPlugin=topology/tree 
#TmpFs=/tmp 
#TrackWCKey=no 
#TreeWidth= 
#UnkillableStepProgram= 
#UsePAM=0 
# 
# 
# TIMERS 
#BatchStartTimeout=10 
#CompleteWait=0 
#EpilogMsgTime=2000 
#GetEnvTimeout=2 
#HealthCheckInterval=0 
#HealthCheckProgram= 
InactiveLimit=0 
KillWait=30 
#MessageTimeout=10 
#ResvOverRun=0 
MinJobAge=300 
#OverTimeLimit=0 
SlurmctldTimeout=120 
SlurmdTimeout=300 
#UnkillableStepTimeout=60 
#VSizeFactor=0 
Waittime=0 
# 
# 
# SCHEDULING 
#DefMemPerCPU=0 
FastSchedule=1 
#MaxMemPerCPU=0 
#SchedulerRootFilter=1 
#SchedulerTimeSlice=30 
SchedulerType=sched/backfill 
SchedulerPort=7321 
SelectType=select/linear 
#SelectTypeParameters= 
# 
# 
# JOB PRIORITY 
#PriorityType=priority/basic 
#PriorityDecayHalfLife= 
#PriorityCalcPeriod= 
#PriorityFavorSmall= 
#PriorityMaxAge= 
#PriorityUsageResetPeriod= 
#PriorityWeightAge= 
#PriorityWeightFairshare= 
#PriorityWeightJobSize= 
#PriorityWeightPartition= 
#PriorityWeightQOS= 
# 
# 
# LOGGING AND ACCOUNTING 
#AccountingStorageEnforce=0 
#AccountingStorageHost= 
#AccountingStorageLoc= 
#AccountingStoragePass= 
#AccountingStoragePort= 
AccountingStorageType=accounting_storage/none 
#AccountingStorageUser= 
AccountingStoreJobComment=YES 
ClusterName=cluster 
#DebugFlags= 
#JobCompHost= 
#JobCompLoc= 
#JobCompPass= 
#JobCompPort= 
JobCompType=jobcomp/none 
#JobCompUser= 
JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
SlurmdDebug=3 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
#SlurmSchedLogFile= 
#SlurmSchedLogLevel= 
# 
# 
# POWER SAVE SUPPORT FOR IDLE NODES (optional) 
#SuspendProgram= 
#ResumeProgram= 
#SuspendTimeout= 
#ResumeTimeout= 
#ResumeRate= 
#SuspendExcNodes= 
#SuspendExcParts= 
#SuspendRate= 
#SuspendTime= 
# 
# 
# COMPUTE NODES 
NodeName=VM-[669-671] CPUs=2 Sockets=2 CoresPerSocket=1 ThreadsPerCore=1 
State=UNKNOWN 
PartitionName=SLURM-debug Nodes=VM-[669-671] Default=YES MaxTime=INFINITE 
State=UP 
- Mail original -

 De: Alan V. Cowles alan.cow...@duke.edu
 À: slurm-dev slurm-dev@schedmd.com
 Cc: Sivasangari Nandy sivasangari.na...@irisa.fr
 Envoyé: Mardi 10 Septembre 2013 13:17:36
 Objet: Re: [slurm-dev] nodes often down

 Siva,

 Was that the default setting you had in place with your original
 config, or a change you made recently to combat the downed nodes
 problem, and did you restart slurm or do a reconfigure to re-read
 the slurm.conf file. I've found some changes don't take effect with
 a reconfigure, and you have to restart.

 AC

 On 09/10/2013 04:01 AM, Sivasangari Nandy wrote:

  Hello,
 

  My nodes are often in down states, so I must make a 'sview' and
  activate all nodes manually to put them 'idle' in order to run
  jobs.
 
  I've seen in the FAQ that I can change the slurm.conf file.
 

   The configuration parameter ReturnToService in slurm.conf
  controls
  how DOWN nodes are handled. Set its value to one in order for DOWN
  nodes to automatically be returned to service once the slurmd
  daemon
  registers with a valid node configuration. 
 

  However in my file it's already 1 for ReturnToService.
 

  advance thanks,
 
  Siva
 

  --
 

  Siva sangari NANDY - Plate-forme GenOuest
 
  IRISA-INRIA, Campus de Beaulieu
 
  263 Avenue du Général Leclerc
 

  35042 Rennes cedex, France
 
  Tél: +33 (0) 2 99 84 25 69
 

  Bureau : D152
 

-- 

Siva sangari NANDY - Plate

[slurm-dev] Re: Required node not available (down or drained)

2013-08-26 Thread Sivasangari Nandy
And the log file is not informative 

tail -f /var/log/slurm-llnl/slurmd.log 

... 
[2013-08-26T11:52:16] Slurmd shutdown completing 
[2013-08-26T11:52:56] slurmd version 2.3.4 started 
[2013-08-26T11:52:56] slurmd started on Mon 26 Aug 2013 11:52:56 +0200 
[2013-08-26T11:52:56] Procs=1 Sockets=1 Cores=1 Threads=1 Memory=2012 
TmpDisk=9069 Uptime=1122626 

- Mail original -

 De: Sivasangari Nandy sivasangari.na...@irisa.fr
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Lundi 26 Août 2013 14:28:28
 Objet: Re: [slurm-dev] Re: Required node not available (down or
 drained)

 Hi,

 I have checked some things, now my slurmctld and slurmd are in a
 single machine (using just one node) so the test is easier.
 For that I have modified the conf file : vi
 /etc/slurm-llnl/slurm.conf

 Slurmctld and slurmd are both running, here my ps result :

 root@VM-667:/etc/slurm-llnl# ps -ef | grep slurm
 root 31712 31706 0 11:44 pts/1 00:00:00 tail -f
 /var/log/slurm-llnl/slurmd.log
 slurm 31990 1 0 11:52 ? 00:00:00 /usr/sbin/slurmctld
 root 32103 1 0 11:52 ? 00:00:00 /usr/sbin/slurmd -c
 root 32125 30346 0 11:53 pts/0 00:00:00 grep slurm

 So i have tried srun again but got this error yet:

 !srun
 srun /omaha-beach/test.sh
 srun: Required node not available (down or drained)
 srun: job 64 queued and waiting for resources

 Have you got any idea of the problem ?
 thanks,

 Siva

 - Mail original -

  De: Nikita Burtsev nikita.burt...@gmail.com
 
  À: slurm-dev slurm-dev@schedmd.com
 
  Envoyé: Jeudi 22 Août 2013 09:59:52
 
  Objet: [slurm-dev] Re: Required node not available (down or
  drained)
 

  Re: [slurm-dev] Re: Required node not available (down or drained)
 
  You need to have slurmd running on all nodes that will execute
  jobs,
  so you should start it with init script.
 

  --
 
  Nikita Burtsev
 
  Sent with Sparrow
 

  On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
 
check if the slurmd daemon is running with the command  ps -el
   |
   grep slurmd . 
  
 

   Nothing is happened with ps -el ...
  
 

   root@VM-667:~# ps -el | grep slurmd
  
 

De: Nikita Burtsev  nikita.burt...@gmail.com 
   
  
 
À: slurm-dev  slurm-dev@schedmd.com 
   
  
 
Envoyé: Mercredi 21 Août 2013 18:58:52
   
  
 
Objet: [slurm-dev] Re: Required node not available (down or
drained)
   
  
 

Re: [slurm-dev] Re: Required node not available (down or
drained)
   
  
 
slurmctld is the management process and since your have access
to
squeue/sinfo information it is running just fine. You need to
check
if slurmd (which is the agent part) is running on your nodes,
i.e.
VM-[669-671]
   
  
 

--
   
  
 
Nikita Burtsev
   
  
 

On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
wrote:
   
  
 
 I have tried :

   
  
 

 /etc/init.d/slurm-llnl start

   
  
 

 [ ok ] Starting slurm central management daemon: slurmctld.

   
  
 
 /usr/sbin/slurmctld already running.

   
  
 

 And :

   
  
 

 scontrol show slurmd

   
  
 

 scontrol: error: slurm_slurmd_info: Connection refused

   
  
 
 slurm_load_slurmd_status: Connection refused

   
  
 

 Hum how to proceed to repair that problem ?

   
  
 

  De: Danny Auble  d...@schedmd.com 
 

   
  
 
  À: slurm-dev  slurm-dev@schedmd.com 
 

   
  
 
  Envoyé: Mercredi 21 Août 2013 15:36:53
 

   
  
 
  Objet: [slurm-dev] Re: Required node not available (down or
  drained)
 

   
  
 

  Check your slurmd log. It doesn't appear the slurmd is
  running.
 

   
  
 

  Sivasangari Nandy  sivasangari.na...@irisa.fr  wrote:
 

   
  
 
 Hello,

   
  
 

   
  
 

 I'm trying to use Slurm for the first time, and I got
 a
 problem
 with
 nodes I think.

   
  
 

   
  
 
 I have this message when I used squeue :

   
  
 

   
  
 

 root@VM-667:~# squeue

   
  
 

   
  
 
 JOBID PARTITION NAME USER ST TIME NODES
 NODELIST(REASON)

   
  
 

   
  
 
 50 SLURM-deb test.sh root PD ; 0:00 1
 (ReqNodeNotAvail)

   
  
 

   
  
 

 or this one with an other squeue :

   
  
 

   
  
 

 root@VM-671:~# squeue

   
  
 

   
  
 
 JOBID PARTITION NAME USER ST TIME NODES
 NODELIST(REASON)

   
  
 

   
  
 
 50 SLURM-deb test.sh root PD 0:00 n bsp; 1
 (Resources)

   
  
 

   
  
 

 sinfo gives me :

   
  
 

   
  
 

 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Sivasangari Nandy
that's what i have done yesterday actually : 

/etc/init.d/slurm-llnl start 

[ ok ] Starting slurm central management daemon: slurmctld. 
/usr/sbin/slurmctld already running. 
- Mail original -

 De: Nikita Burtsev nikita.burt...@gmail.com
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Jeudi 22 Août 2013 09:59:52
 Objet: [slurm-dev] Re: Required node not available (down or drained)

 Re: [slurm-dev] Re: Required node not available (down or drained)
 You need to have slurmd running on all nodes that will execute jobs,
 so you should start it with init script.

 --
 Nikita Burtsev
 Sent with Sparrow

 On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
   check if the slurmd daemon is running with the command  ps -el |
  grep slurmd . 
 

  Nothing is happened with ps -el ...
 

  root@VM-667:~# ps -el | grep slurmd
 

   De: Nikita Burtsev  nikita.burt...@gmail.com 
  
 
   À: slurm-dev  slurm-dev@schedmd.com 
  
 
   Envoyé: Mercredi 21 Août 2013 18:58:52
  
 
   Objet: [slurm-dev] Re: Required node not available (down or
   drained)
  
 

   Re: [slurm-dev] Re: Required node not available (down or drained)
  
 
   slurmctld is the management process and since your have access to
   squeue/sinfo information it is running just fine. You need to
   check
   if slurmd (which is the agent part) is running on your nodes,
   i.e.
   VM-[669-671]
  
 

   --
  
 
   Nikita Burtsev
  
 

   On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
   wrote:
  
 
I have tried :
   
  
 

/etc/init.d/slurm-llnl start
   
  
 

[ ok ] Starting slurm central management daemon: slurmctld.
   
  
 
/usr/sbin/slurmctld already running.
   
  
 

And :
   
  
 

scontrol show slurmd
   
  
 

scontrol: error: slurm_slurmd_info: Connection refused
   
  
 
slurm_load_slurmd_status: Connection refused
   
  
 

Hum how to proceed to repair that problem ?
   
  
 

 De: Danny Auble  d...@schedmd.com 

   
  
 
 À: slurm-dev  slurm-dev@schedmd.com 

   
  
 
 Envoyé: Mercredi 21 Août 2013 15:36:53

   
  
 
 Objet: [slurm-dev] Re: Required node not available (down or
 drained)

   
  
 

 Check your slurmd log. It doesn't appear the slurmd is
 running.

   
  
 

 Sivasangari Nandy  sivasangari.na...@irisa.fr  wrote:

   
  
 
Hello,
   
  
 

   
  
 

I'm trying to use Slurm for the first time, and I got a
problem
with
nodes I think.
   
  
 

   
  
 
I have this message when I used squeue :
   
  
 

   
  
 

root@VM-667:~# squeue
   
  
 

   
  
 
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
   
  
 

   
  
 
50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail)
   
  
 

   
  
 

or this one with an other squeue :
   
  
 

   
  
 

root@VM-671:~# squeue
   
  
 

   
  
 
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
   
  
 

   
  
 
50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources)
   
  
 

   
  
 

sinfo gives me :
   
  
 

   
  
 

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
   
  
 

   
  
 
SLURM-de* up infinite 3 down VM-[669-671]
   
  
 

   
  
 

I have already used slurm one time with the same
configuration
and
I
wan able to run my job.
   
  
 

   
  
 
But now the second time I always got :
   
  
 

   
  
 

srun: Required node not available (down or drained)
   
  
 

   
  
 
srun: job 51 queued and waiting for resources
   
  
 

   
  
 

Advance thanks for your help,
   
  
 

   
  
 
Siva
   
  
 

   
  
 
--
   
  
 

Siva sangari NANDY - Plate-forme GenOuest
   
  
 
IRISA-INRIA, Campus de Beaulieu
   
  
 
263 Avenue du Général Leclerc
   
  
 

35042 Rennes cedex, France
   
  
 
Tél: +33 (0) 2 99 84 25 69
   
  
 

Bureau : D152
   
  
 

  --
 

  Siva sangari NANDY - Plate-forme GenOuest
 
  IRISA-INRIA, Campus de Beaulieu
 
  263 Avenue du Général Leclerc
 

  35042 Rennes cedex, France
 
  Tél: +33 (0) 2 99 84 25 69
 

  Bureau : D152
 

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] Re: Required node not available (down or drained)

2013-08-22 Thread Sivasangari Nandy
So i have done : /etc/init.d/slurm-llnl start 
in each node and tried again but i have : 

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 
50 SLURM-deb test.sh root PD 0:00 1 (Resources) 
53 SLURM-deb test.sh root PD 0:00 1 (Resources) 

and I have this when i try : root@VM-671:~# ps -el | grep slurmd 

5 S 0 8223 1 0 80 0 - 22032 - ? 00:00:01 slurmd 

- Mail original -

 De: Nikita Burtsev nikita.burt...@gmail.com
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Jeudi 22 Août 2013 09:59:52
 Objet: [slurm-dev] Re: Required node not available (down or drained)

 Re: [slurm-dev] Re: Required node not available (down or drained)
 You need to have slurmd running on all nodes that will execute jobs,
 so you should start it with init script.

 --
 Nikita Burtsev
 Sent with Sparrow

 On Thursday, August 22, 2013 at 11:55 AM, Sivasangari Nandy wrote:
   check if the slurmd daemon is running with the command  ps -el |
  grep slurmd . 
 

  Nothing is happened with ps -el ...
 

  root@VM-667:~# ps -el | grep slurmd
 

   De: Nikita Burtsev  nikita.burt...@gmail.com 
  
 
   À: slurm-dev  slurm-dev@schedmd.com 
  
 
   Envoyé: Mercredi 21 Août 2013 18:58:52
  
 
   Objet: [slurm-dev] Re: Required node not available (down or
   drained)
  
 

   Re: [slurm-dev] Re: Required node not available (down or drained)
  
 
   slurmctld is the management process and since your have access to
   squeue/sinfo information it is running just fine. You need to
   check
   if slurmd (which is the agent part) is running on your nodes,
   i.e.
   VM-[669-671]
  
 

   --
  
 
   Nikita Burtsev
  
 

   On Wednesday, August 21, 2013 at 8:13 PM, Sivasangari Nandy
   wrote:
  
 
I have tried :
   
  
 

/etc/init.d/slurm-llnl start
   
  
 

[ ok ] Starting slurm central management daemon: slurmctld.
   
  
 
/usr/sbin/slurmctld already running.
   
  
 

And :
   
  
 

scontrol show slurmd
   
  
 

scontrol: error: slurm_slurmd_info: Connection refused
   
  
 
slurm_load_slurmd_status: Connection refused
   
  
 

Hum how to proceed to repair that problem ?
   
  
 

 De: Danny Auble  d...@schedmd.com 

   
  
 
 À: slurm-dev  slurm-dev@schedmd.com 

   
  
 
 Envoyé: Mercredi 21 Août 2013 15:36:53

   
  
 
 Objet: [slurm-dev] Re: Required node not available (down or
 drained)

   
  
 

 Check your slurmd log. It doesn't appear the slurmd is
 running.

   
  
 

 Sivasangari Nandy  sivasangari.na...@irisa.fr  wrote:

   
  
 
Hello,
   
  
 

   
  
 

I'm trying to use Slurm for the first time, and I got a
problem
with
nodes I think.
   
  
 

   
  
 
I have this message when I used squeue :
   
  
 

   
  
 

root@VM-667:~# squeue
   
  
 

   
  
 
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
   
  
 

   
  
 
50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail)
   
  
 

   
  
 

or this one with an other squeue :
   
  
 

   
  
 

root@VM-671:~# squeue
   
  
 

   
  
 
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
   
  
 

   
  
 
50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources)
   
  
 

   
  
 

sinfo gives me :
   
  
 

   
  
 

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
   
  
 

   
  
 
SLURM-de* up infinite 3 down VM-[669-671]
   
  
 

   
  
 

I have already used slurm one time with the same
configuration
and
I
wan able to run my job.
   
  
 

   
  
 
But now the second time I always got :
   
  
 

   
  
 

srun: Required node not available (down or drained)
   
  
 

   
  
 
srun: job 51 queued and waiting for resources
   
  
 

   
  
 

Advance thanks for your help,
   
  
 

   
  
 
Siva
   
  
 

   
  
 
--
   
  
 

Siva sangari NANDY - Plate-forme GenOuest
   
  
 
IRISA-INRIA, Campus de Beaulieu
   
  
 
263 Avenue du Général Leclerc
   
  
 

35042 Rennes cedex, France
   
  
 
Tél: +33 (0) 2 99 84 25 69
   
  
 

Bureau : D152
   
  
 

  --
 

  Siva sangari NANDY - Plate-forme GenOuest
 
  IRISA-INRIA, Campus de Beaulieu
 
  263 Avenue du Général Leclerc
 

  35042 Rennes cedex, France
 
  Tél: +33 (0) 2 99 84 25 69
 

  Bureau : D152
 

-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152 


[slurm-dev] Required node not available (down or drained)

2013-08-21 Thread Sivasangari Nandy
  Hello,
 

  I'm trying to use Slurm for the first time, and I got a problem
  with
  nodes I think.
 
  I have this message when I used squeue :
 

  root@VM-667:~# squeue
 
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 
  50 SLURM-deb test.sh root PD 0:00 1 (ReqNodeNotAvail)
 

  or this one with an other squeue :
 

  root@VM-671:~# squeue
 
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
 
  50 SLURM-deb test.sh root PD 0:00 1 (Resources)
 

  sinfo gives me :
 

  PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
 
  SLURM-de* up infinite 3 down VM-[669-671]
 

  I have already used slurm one time with the same configuration and
  I
  wan able to run my job.
 
  But now the second time I always got :
 

  srun: Required node not available (down or drained)
 
  srun: job 51 queued and waiting for resources
 

  Advance thanks for your help,
 
  Siva
 

[slurm-dev] Re: Required node not available (down or drained)

2013-08-21 Thread Sivasangari Nandy
I have tried : 

/etc/init.d/slurm-llnl start 

[ ok ] Starting slurm central management daemon: slurmctld. 
/usr/sbin/slurmctld already running. 

And : 

scontrol show slurmd 

scontrol: error: slurm_slurmd_info: Connection refused 
slurm_load_slurmd_status: Connection refused 

Hum how to proceed to repair that problem ? 

- Mail original -

 De: Danny Auble d...@schedmd.com
 À: slurm-dev slurm-dev@schedmd.com
 Envoyé: Mercredi 21 Août 2013 15:36:53
 Objet: [slurm-dev] Re: Required node not available (down or drained)

 Check your slurmd log. It doesn't appear the slurmd is running.

 Sivasangari Nandy  sivasangari.na...@irisa.fr  wrote:
Hello,
   
  
 

I'm trying to use Slurm for the first time, and I got a problem
with
nodes I think.
   
  
 
I have this message when I used squeue :
   
  
 

root@VM-667:~# squeue
   
  
 
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
   
  
 
50 SLURM-deb test.sh root PD ; 0:00 1 (ReqNodeNotAvail)
   
  
 

or this one with an other squeue :
   
  
 

root@VM-671:~# squeue
   
  
 
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
   
  
 
50 SLURM-deb test.sh root PD 0:00 n bsp; 1 (Resources)
   
  
 

sinfo gives me :
   
  
 

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
   
  
 
SLURM-de* up infinite 3 down VM-[669-671]
   
  
 

I have already used slurm one time with the same configuration
and
I
wan able to run my job.
   
  
 
But now the second time I always got :
   
  
 

srun: Required node not available (down or drained)
   
  
 
srun: job 51 queued and waiting for resources
   
  
 

Advance thanks for your help,
   
  
 
Siva
   
  
 
-- 

Siva sangari NANDY - Plate-forme GenOuest 
IRISA-INRIA, Campus de Beaulieu 
263 Avenue du Général Leclerc 

35042 Rennes cedex, France 
Tél: +33 (0) 2 99 84 25 69 

Bureau : D152