Hello, currently I'm trying to set up SLURM on a gpu cluster with a small number of nodes (where smurf0[1-7] are the node names) using the gpu plugin to allocate jobs (requiring gpus).
Unfortunately, when trying to run a gpu-job (any number of gpus; --gres=gpu:N), SLURM doesn't execute it, asserting unavailability of the requested configuration. I attached some logs and configuration text files in order to provide any information necessary to analyze this issue. Note: Cross posted here: http://serverfault.com/questions/685258 Example (using some test.sh which is echoing $CUDA_VISIBLE_DEVICES): srun -n1 --gres=gpu:1 test.sh --> srun: error: Unable to allocate resources: Requested node configuration is not available The slurmctld log for such calls shows: gres: gpu state for job X gres_cnt:1 node_cnt:1 type:(null) _pick_best_nodes: job X never runnable _slurm_rpc_allocate_resources: Requested node configuration is not available Jobs with any other type of configured generic resource complete successfully: srun -n1 --gres=gram:500 test.sh --> CUDA_VISIBLE_DEVICES=NoDevFiles The nodes and gres configuration in slurm.conf (which is attached as well) are like: GresTypes=gpu,ram,gram,scratch ... NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 The respective gres.conf files are Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7] Name=ram Count=48 Name=gram Count=6000 Name=scratch Count=1300 The output of "scontrol show node" lists all the nodes with the correct gres configuration i.e.: NodeName=smurf01 Arch=x86_64 CoresPerSocket=6 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 ...etc. As far as I can tell, the slurmd daemon on the nodes recognizes the gpus (and other generic resources) correctly. My slurmd.log on node smurf01 says Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = /dev /nvidia[0 - 7] The log for slurmctld shows gres / gpu: state for smurf01 gres_cnt found : 8 configured : 8 avail : 8 alloc : 0 gres_bit_alloc : gres_used : (null) I can't figure out why the controller node states that jobs using --gres=gpu:N are "never runnable" and why "the requested node configuration is not available". Any help is appreciated. Kind regards, Daniel Weber PS: If further information is required, don't hesitate to ask.
# GENERAL ClusterName=egpc ControlMachine=wtch020 ControlAddr=192.168.1.1 AuthType=auth/munge CryptoType=crypto/munge CacheGroups=0 DisableRootJobs=YES MpiDefault=none Proctracktype=proctrack/cgroup # DAEMONS StateSaveLocation=/var/spool/slurmctld SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root SwitchType=switch/none # SCRIPTS Prolog=/apps/.slurm/job_prolog.sh Epilog=/apps/.slurm/job_epilog.sh TaskProlog=/apps/.slurm/task_prolog.sh TaskEpilog=/apps/.slurm/task_epilog.sh # TIMERS AND LIMITS MaxJobCount=5000 ReturnToService=1 TaskPlugin=task/cgroup InactiveLimit=0 KillWait=30 MinJobAge=120 OverTimeLimit=60 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 FastSchedule=1 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # JOB PRIORITY #PriorityFlags= PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityFavorSmall=YES #PriorityCalcPeriod= #PriorityUsageResetPeriod= PriorityMaxAge=3-0 PriorityWeightAge=1 #PriorityWeightFairshare=0 PriorityWeightJobSize=1 #PriorityWeightQOS= #PriorityWeightPartition= # ACCOUNTING #AccountingStorageEnforce=0 AccountingStorageType=accounting_storage/slurmdbd AccountingStoragePort=6819 AccountingStorageUser=slurm AccountingStoreJobComment=YES JobAcctGatherFrequency=60 JobAcctGatherType=jobacct_gather/linux # LOGGING SlurmctldDebug=5 SlurmdDebug=5 DebugFlags=Gres SlurmctldLogFile=/var/log/slurm/ctld SlurmdLogFile=/var/log/slurmd #SlurmSchedLogFile= #SlurmSchedLogLevel= # COMPUTE NODES GresTypes=gpu,ram,gram,scratch NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300 NodeName=smurf03 NodeAddr=192.168.1.103 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 Gres=gpu:gtx:3,ram:94,gram:no_consume:1500,scratch:280 NodeName=smurf04 NodeAddr=192.168.1.104 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 Gres=gpu:gtx:4,ram:94,gram:no_consume:1500,scratch:280 NodeName=smurf05 NodeAddr=192.168.1.105 Feature="intel,kepler" Boards=1 SocketsPerBoard=2 CoresperSocket=8 ThreadsPerCore=2 Gres=gpu:gtx:4,ram:256,gram:no_consume:6000,scratch:2400 NodeName=smurf06 NodeAddr=192.168.1.106 Feature="intel,fermi" Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 Gres=gpu:gtx:2,ram:8,gram:no_consume:1250,scratch:1800 NodeName=smurf07 NodeAddr=192.168.1.107 Feature="amd,fermi" Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:gtx:2,ram:16,gram:no_consume:1250,scratch:54 # PARTITIONS PartitionName=work Nodes=smurf0[1-7] Default=YES MaxTime=INFINITE State=UP #PartitionName=supermicro Nodes=smurf03,smurf04,smurf05 MaxTime=INFINITE State=UP #PartitionName=tower Nodes=smurf06,smurf07 MaxTime=INFINITE State=UP
[2015-05-06T14:58:13.476] slurmctld version 14.11.5 started on cluster egpc [2015-05-06T14:58:13.477] Munge cryptographic signature plugin loaded [2015-05-06T14:58:13.477] debug: init: Gres GPU plugin loaded [2015-05-06T14:58:13.477] debug: gres: Couldn't find the specified plugin name for gres/ram looking at all files [2015-05-06T14:58:13.477] debug: Cannot find plugin of type gres/ram, just track gres counts [2015-05-06T14:58:13.477] debug: gres: Couldn't find the specified plugin name for gres/gram looking at all files [2015-05-06T14:58:13.477] debug: Cannot find plugin of type gres/gram, just track gres counts [2015-05-06T14:58:13.478] debug: gres: Couldn't find the specified plugin name for gres/scratch looking at all files [2015-05-06T14:58:13.478] debug: Cannot find plugin of type gres/scratch, just track gres counts [2015-05-06T14:58:13.478] preempt/none loaded [2015-05-06T14:58:13.478] debug: Checkpoint plugin loaded: checkpoint/none [2015-05-06T14:58:13.478] debug: AcctGatherEnergy NONE plugin loaded [2015-05-06T14:58:13.478] debug: AcctGatherProfile NONE plugin loaded [2015-05-06T14:58:13.479] debug: AcctGatherInfiniband NONE plugin loaded [2015-05-06T14:58:13.479] debug: AcctGatherFilesystem NONE plugin loaded [2015-05-06T14:58:13.479] debug: Job accounting gather LINUX plugin loaded [2015-05-06T14:58:13.479] ExtSensors NONE plugin loaded [2015-05-06T14:58:13.479] debug: switch NONE plugin loaded [2015-05-06T14:58:13.479] debug: No backup controller to shutdown [2015-05-06T14:58:13.479] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) [2015-05-06T14:58:13.480] debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded [2015-05-06T14:58:13.481] debug: slurmdbd: Sent DbdInit msg [2015-05-06T14:58:13.481] slurmdbd: recovered 0 pending RPCs [2015-05-06T14:58:13.805] debug: Reading slurm.conf file: /etc/slurm/slurm.conf [2015-05-06T14:58:13.807] layouts: no layout to initialize [2015-05-06T14:58:13.807] topology NONE plugin loaded [2015-05-06T14:58:13.807] debug: No DownNodes [2015-05-06T14:58:13.807] sched: Backfill scheduler plugin loaded [2015-05-06T14:58:13.807] route default plugin loaded [2015-05-06T14:58:13.808] layouts: loading entities/relations information [2015-05-06T14:58:13.808] debug: layouts: 7/7 nodes in hash table, rc=0 [2015-05-06T14:58:13.808] debug: layouts: loading stage 1 [2015-05-06T14:58:13.808] debug: layouts: loading stage 2 [2015-05-06T14:58:13.808] Recovered state of 7 nodes [2015-05-06T14:58:13.808] gres: gpu state for job 120 [2015-05-06T14:58:13.808] gres_cnt:1 node_cnt:0 type:(null) [2015-05-06T14:58:13.808] Recovered JobID=120 State=0x5 NodeCnt=0 Assoc=0 [2015-05-06T14:58:13.808] Recovered information about 1 jobs [2015-05-06T14:58:13.808] init_requeue_policy: kill_invalid_depend is set to 0 [2015-05-06T14:58:13.808] gres/gpu: state for smurf01 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:8 avail:8 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc: [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.808] type_cnt_avail[0]:8 [2015-05-06T14:58:13.808] type[0]:tesla [2015-05-06T14:58:13.808] gres/ram: state for smurf01 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:48 avail:48 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gram: state for smurf01 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:6000 avail:6000 no_consume [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/scratch: state for smurf01 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:1300 avail:1300 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gpu: state for smurf02 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:8 avail:8 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc: [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.808] type_cnt_avail[0]:8 [2015-05-06T14:58:13.808] type[0]:tesla [2015-05-06T14:58:13.808] gres/ram: state for smurf02 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:48 avail:48 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gram: state for smurf02 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:6000 avail:6000 no_consume [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/scratch: state for smurf02 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:1300 avail:1300 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gpu: state for smurf03 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:3 avail:3 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc: [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.808] type_cnt_avail[0]:3 [2015-05-06T14:58:13.808] type[0]:gtx [2015-05-06T14:58:13.808] gres/ram: state for smurf03 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:94 avail:94 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gram: state for smurf03 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:1500 avail:1500 no_consume [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/scratch: state for smurf03 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:280 avail:280 alloc:0 [2015-05-06T14:58:13.808] gres_bit_alloc:NULL [2015-05-06T14:58:13.808] gres_used:(null) [2015-05-06T14:58:13.808] gres/gpu: state for smurf04 [2015-05-06T14:58:13.808] gres_cnt found:TBD configured:4 avail:4 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc: [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.809] type_cnt_avail[0]:4 [2015-05-06T14:58:13.809] type[0]:gtx [2015-05-06T14:58:13.809] gres/ram: state for smurf04 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:94 avail:94 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gram: state for smurf04 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:1500 avail:1500 no_consume [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/scratch: state for smurf04 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:280 avail:280 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gpu: state for smurf05 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:4 avail:4 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc: [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.809] type_cnt_avail[0]:4 [2015-05-06T14:58:13.809] type[0]:gtx [2015-05-06T14:58:13.809] gres/ram: state for smurf05 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:256 avail:256 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gram: state for smurf05 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:6000 avail:6000 no_consume [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/scratch: state for smurf05 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:2400 avail:2400 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gpu: state for smurf06 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:2 avail:2 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc: [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.809] type_cnt_avail[0]:2 [2015-05-06T14:58:13.809] type[0]:gtx [2015-05-06T14:58:13.809] gres/ram: state for smurf06 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:8 avail:8 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gram: state for smurf06 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:1250 avail:1250 no_consume [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/scratch: state for smurf06 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:1800 avail:1800 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gpu: state for smurf07 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:2 avail:2 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc: [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] type_cnt_alloc[0]:0 [2015-05-06T14:58:13.809] type_cnt_avail[0]:2 [2015-05-06T14:58:13.809] type[0]:gtx [2015-05-06T14:58:13.809] gres/ram: state for smurf07 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:16 avail:16 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/gram: state for smurf07 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:1250 avail:1250 no_consume [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] gres/scratch: state for smurf07 [2015-05-06T14:58:13.809] gres_cnt found:TBD configured:54 avail:54 alloc:0 [2015-05-06T14:58:13.809] gres_bit_alloc:NULL [2015-05-06T14:58:13.809] gres_used:(null) [2015-05-06T14:58:13.809] de[2015-05-06T14:58:13.809] Recovered state of 0 reservations [2015-05-06T14:58:13.809] State of 0 triggers recovered [2015-05-06T14:58:13.809] read_slurm_conf: backup_controller not specified. [2015-05-06T14:58:13.809] Running as primary controller [2015-05-06T14:58:13.809] Registering slurmctld at port 6817 with slurmdbd. [2015-05-06T14:58:13.982] debug: Priority MULTIFACTOR plugin loaded [2015-05-06T14:58:13.983] debug: power_save module disabled, SuspendTime < 0 [2015-05-06T14:58:17.005] debug: Spawning registration agent for smurf[01-07] 7 hosts [2015-05-06T14:58:17.009] gres/gpu: state for smurf06 [2015-05-06T14:58:17.009] gres_cnt found:2 configured:2 avail:2 alloc:0 [2015-05-06T14:58:17.009] gres_bit_alloc: [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.009] topo_gres_bitmap[0]:0-1 [2015-05-06T14:58:17.009] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.009] topo_gres_cnt_avail[0]:2 [2015-05-06T14:58:17.009] type[0]:gtx [2015-05-06T14:58:17.009] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.009] type_cnt_avail[0]:2 [2015-05-06T14:58:17.009] type[0]:gtx [2015-05-06T14:58:17.009] gres/ram: state for smurf06 [2015-05-06T14:58:17.009] gres_cnt found:8 configured:8 avail:8 alloc:0 [2015-05-06T14:58:17.009] gres_bit_alloc:NULL [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] gres/gram: state for smurf06 [2015-05-06T14:58:17.009] gres_cnt found:1250 configured:1250 avail:1250 no_consume [2015-05-06T14:58:17.009] gres_bit_alloc:NULL [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] gres/scratch: state for smurf06 [2015-05-06T14:58:17.009] gres_cnt found:1800 configured:1800 avail:1800 alloc:0 [2015-05-06T14:58:17.009] gres_bit_alloc:NULL [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] debug: validate_node_specs: node smurf06 registered with 0 jobs [2015-05-06T14:58:17.009] gres/gpu: state for smurf04 [2015-05-06T14:58:17.009] gres_cnt found:4 configured:4 avail:4 alloc:0 [2015-05-06T14:58:17.009] gres_bit_alloc: [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.009] topo_gres_bitmap[0]:0-3 [2015-05-06T14:58:17.009] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.009] topo_gres_cnt_avail[0]:4 [2015-05-06T14:58:17.009] type[0]:gtx [2015-05-06T14:58:17.009] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.009] type_cnt_avail[0]:4 [2015-05-06T14:58:17.009] type[0]:gtx [2015-05-06T14:58:17.009] gres/ram: state for smurf04 [2015-05-06T14:58:17.009] gres_cnt found:94 configured:94 avail:94 alloc:0 [2015-05-06T14:58:17.009] gres_bit_alloc:NULL [2015-05-06T14:58:17.009] gres_used:(null) [2015-05-06T14:58:17.009] gres/gram: state for smurf04 [2015-05-06T14:58:17.009] gres_cnt found:1500 configured:1500 avail:1500 no_consume [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/scratch: state for smurf04 [2015-05-06T14:58:17.010] gres_cnt found:280 configured:280 avail:280 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] debug: validate_node_specs: node smurf04 registered with 0 jobs [2015-05-06T14:58:17.010] gres/gpu: state for smurf07 [2015-05-06T14:58:17.010] gres_cnt found:2 configured:2 avail:2 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc: [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.010] topo_gres_bitmap[0]:0-1 [2015-05-06T14:58:17.010] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] topo_gres_cnt_avail[0]:2 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] type_cnt_avail[0]:2 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] gres/ram: state for smurf07 [2015-05-06T14:58:17.010] gres_cnt found:16 configured:16 avail:16 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/gram: state for smurf07 [2015-05-06T14:58:17.010] gres_cnt found:1250 configured:1250 avail:1250 no_consume [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/scratch: state for smurf07 [2015-05-06T14:58:17.010] gres_cnt found:54 configured:54 avail:54 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] debug: validate_node_specs: node smurf07 registered with 0 jobs [2015-05-06T14:58:17.010] gres/gpu: state for smurf05 [2015-05-06T14:58:17.010] gres_cnt found:4 configured:4 avail:4 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc: [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.010] topo_gres_bitmap[0]:0-3 [2015-05-06T14:58:17.010] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] topo_gres_cnt_avail[0]:4 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] type_cnt_avail[0]:4 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] gres/ram: state for smurf05 [2015-05-06T14:58:17.010] gres_cnt found:256 configured:256 avail:256 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/gram: state for smurf05 [2015-05-06T14:58:17.010] gres_cnt found:6000 configured:6000 avail:6000 no_consume [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/scratch: state for smurf05 [2015-05-06T14:58:17.010] gres_cnt found:2400 configured:2400 avail:2400 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] debug: validate_node_specs: node smurf05 registered with 0 jobs [2015-05-06T14:58:17.010] gres/gpu: state for smurf03 [2015-05-06T14:58:17.010] gres_cnt found:3 configured:3 avail:3 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc: [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.010] topo_gres_bitmap[0]:0-2 [2015-05-06T14:58:17.010] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] topo_gres_cnt_avail[0]:3 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.010] type_cnt_avail[0]:3 [2015-05-06T14:58:17.010] type[0]:gtx [2015-05-06T14:58:17.010] gres/ram: state for smurf03 [2015-05-06T14:58:17.010] gres_cnt found:94 configured:94 avail:94 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/gram: state for smurf03 [2015-05-06T14:58:17.010] gres_cnt found:1500 configured:1500 avail:1500 no_consume [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] gres/scratch: state for smurf03 [2015-05-06T14:58:17.010] gres_cnt found:280 configured:280 avail:280 alloc:0 [2015-05-06T14:58:17.010] gres_bit_alloc:NULL [2015-05-06T14:58:17.010] gres_used:(null) [2015-05-06T14:58:17.010] debug: validate_node_specs: node smurf03 registered with 0 jobs [2015-05-06T14:58:17.011] gres/gpu: state for smurf01 [2015-05-06T14:58:17.011] gres_cnt found:8 configured:8 avail:8 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc: [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[0]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[0]:1 [2015-05-06T14:58:17.011] type[0]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[1]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[1]:1 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[1]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[1]:1 [2015-05-06T14:58:17.011] type[1]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[2]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[2]:2 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[2]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[2]:1 [2015-05-06T14:58:17.011] type[2]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[3]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[3]:3 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[3]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[3]:1 [2015-05-06T14:58:17.011] type[3]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[4]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[4]:4 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[4]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[4]:1 [2015-05-06T14:58:17.011] type[4]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[5]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[5]:5 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[5]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[5]:1 [2015-05-06T14:58:17.011] type[5]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[6]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[6]:6 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[6]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[6]:1 [2015-05-06T14:58:17.011] type[6]:tesla [2015-05-06T14:58:17.011] topo_cpus_bitmap[7]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[7]:7 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[7]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[7]:1 [2015-05-06T14:58:17.011] type[7]:tesla [2015-05-06T14:58:17.011] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.011] type_cnt_avail[0]:8 [2015-05-06T14:58:17.011] type[0]:tesla [2015-05-06T14:58:17.011] gres/ram: state for smurf01 [2015-05-06T14:58:17.011] gres_cnt found:48 configured:48 avail:48 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] gres/gram: state for smurf01 [2015-05-06T14:58:17.011] gres_cnt found:6000 configured:6000 avail:6000 no_consume [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] gres/scratch: state for smurf01 [2015-05-06T14:58:17.011] gres_cnt found:1300 configured:1300 avail:1300 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] debug: validate_node_specs: node smurf01 registered with 0 jobs [2015-05-06T14:58:17.011] gres/gpu: state for smurf02 [2015-05-06T14:58:17.011] gres_cnt found:8 configured:8 avail:8 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc: [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] topo_cpus_bitmap[0]:NULL [2015-05-06T14:58:17.011] topo_gres_bitmap[0]:0-7 [2015-05-06T14:58:17.011] topo_gres_cnt_alloc[0]:0 [2015-05-06T14:58:17.011] topo_gres_cnt_avail[0]:8 [2015-05-06T14:58:17.011] type[0]:tesla [2015-05-06T14:58:17.011] type_cnt_alloc[0]:0 [2015-05-06T14:58:17.011] type_cnt_avail[0]:8 [2015-05-06T14:58:17.011] type[0]:tesla [2015-05-06T14:58:17.011] gres/ram: state for smurf02 [2015-05-06T14:58:17.011] gres_cnt found:48 configured:48 avail:48 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] gres/gram: state for smurf02 [2015-05-06T14:58:17.011] gres_cnt found:6000 configured:6000 avail:6000 no_consume [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] gres/scratch: state for smurf02 [2015-05-06T14:58:17.011] gres_cnt found:1300 configured:1300 avail:1300 alloc:0 [2015-05-06T14:58:17.011] gres_bit_alloc:NULL [2015-05-06T14:58:17.011] gres_used:(null) [2015-05-06T14:58:17.011] debug: validate_node_specs: node smurf02 registered with 0 jobs [2015-05-06T14:58:18.013] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0 [2015-05-06T14:58:18.013] debug: sched: Running job scheduler
Name=gpu Type=tesla File=/dev/nvidia0 Name=gpu Type=tesla File=/dev/nvidia1 Name=gpu Type=tesla File=/dev/nvidia2 Name=gpu Type=tesla File=/dev/nvidia3 Name=gpu Type=tesla File=/dev/nvidia4 Name=gpu Type=tesla File=/dev/nvidia5 Name=gpu Type=tesla File=/dev/nvidia6 Name=gpu Type=tesla File=/dev/nvidia7 Name=ram Count=48 Name=gram Count=6000 Name=scratch Count=1300
[2015-05-06T14:52:52.502] debug: init: Gres GPU plugin loaded [2015-05-06T14:52:52.502] debug: gres: Couldn't find the specified plugin name for gres/ram looking at all files [2015-05-06T14:52:52.503] debug: Cannot find plugin of type gres/ram, just track gres counts [2015-05-06T14:52:52.503] debug: gres: Couldn't find the specified plugin name for gres/gram looking at all files [2015-05-06T14:52:52.503] debug: Cannot find plugin of type gres/gram, just track gres counts [2015-05-06T14:52:52.503] debug: gres: Couldn't find the specified plugin name for gres/scratch looking at all files [2015-05-06T14:52:52.503] debug: Cannot find plugin of type gres/scratch, just track gres counts [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia0 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia1 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia2 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia3 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia4 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia5 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia6 [2015-05-06T14:52:52.504] Gres Name=gpu Type=tesla Count=1 ID=7696487 File=/dev/nvidia7 [2015-05-06T14:52:52.504] Gres Name=ram Type=(null) Count=48 ID=7168370 [2015-05-06T14:52:52.504] Gres Name=gram Type=(null) Count=6000 ID=1835102823 [2015-05-06T14:52:52.504] Gres Name=scratch Type=(null) Count=1300 ID=1641727719 [2015-05-06T14:52:52.504] gpu 0 is device number 0 [2015-05-06T14:52:52.504] gpu 1 is device number 1 [2015-05-06T14:52:52.504] gpu 2 is device number 2 [2015-05-06T14:52:52.504] gpu 3 is device number 3 [2015-05-06T14:52:52.504] gpu 4 is device number 4 [2015-05-06T14:52:52.504] gpu 5 is device number 5 [2015-05-06T14:52:52.504] gpu 6 is device number 6 [2015-05-06T14:52:52.504] gpu 7 is device number 7 [2015-05-06T14:52:52.504] topology NONE plugin loaded [2015-05-06T14:52:52.504] route default plugin loaded [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:0 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:1 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:2 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:3 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:4 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:5 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:6 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:7 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:8 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:9 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:10 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.505] debug: cpu_freq_init: CPU:11 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:12 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:13 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:14 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:15 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:16 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:17 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:18 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:19 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:20 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:21 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:22 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] debug: cpu_freq_init: CPU:23 reset_freq:1600000 avail_gov:1f orig_governor:conservative [2015-05-06T14:52:52.506] No specialized cores configured by default on this node [2015-05-06T14:52:52.506] Resource spec: system memory limit not configured for this node [2015-05-06T14:52:52.507] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2015-05-06T14:52:52.507] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2015-05-06T14:52:52.508] debug: task/cgroup: now constraining jobs allocated cores [2015-05-06T14:52:52.508] debug: task/cgroup: loaded [2015-05-06T14:52:52.508] debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded [2015-05-06T14:52:52.508] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2015-05-06T14:52:52.508] Munge cryptographic signature plugin loaded [2015-05-06T14:52:52.509] Warning: Core limit is only 0 KB [2015-05-06T14:52:52.509] slurmd version 14.11.5 started [2015-05-06T14:52:52.509] debug: Job accounting gather LINUX plugin loaded [2015-05-06T14:52:52.509] debug: job_container none plugin loaded [2015-05-06T14:52:52.510] debug: switch NONE plugin loaded [2015-05-06T14:52:52.510] slurmd started on Wed, 06 May 2015 14:52:52 +0200 [2015-05-06T14:52:52.512] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=48128 TmpDisk=51175 Uptime=9544 CPUSpecList=(null) [2015-05-06T14:52:52.512] debug: AcctGatherEnergy NONE plugin loaded [2015-05-06T14:52:52.512] debug: AcctGatherProfile NONE plugin loaded [2015-05-06T14:52:52.513] debug: AcctGatherInfiniband NONE plugin loaded [2015-05-06T14:52:52.513] debug: AcctGatherFilesystem NONE plugin loaded
