Hello, I am currently trying to setup slurm to run on our 4 frame, 8 midplane BlueGene Q. I am quite new to BlueGene but I have a existing background in HPC sysadmin but I'm finding the configuration of slurm on BGQ a bit confusing.
Currently I have 8 blocks already set up in the mmcs and would like to get slurm to manage each of these 8 separate resources. The reason behind this is that we wish to map any application failures to a specific midplane for hardware fault finding as we are still in the process of cabling and configuring the racks. I have read though the BG specific docs but I am at a loss as to how to set up my bluegene.conf and slurm.conf files to achieve a single 'queue' for each of the 8 midplanes. Running smap -Dc doesn't seem to generate a bluegene.conf file and I have tried to run slurmctrld to see if the verbose output will help; [jsweet@bgqsn ~]$ /opt/slurm/2.3.3/sbin/slurmctld -D -f /opt/slurm/2.3.3/etc/slurm.conf -vvvv slurmctld: pidfile not locked, assuming no running daemon slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/accounting_storage_none.so slurmctld: Accounting storage NOT INVOKED plugin loaded slurmctld: debug3: Success. slurmctld: debug3: not enforcing associations and no list was given so we are giving a blank list slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover slurmctld: slurmctld version 2.3.3 started on cluster bgq slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/crypto_munge.so slurmctld: Munge cryptographic signature plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/select_bluegene.so slurmctld: BlueGene node selection plugin loading... slurmctld: debug: Setting dimensions from slurm.conf file slurmctld: Attempting to contact MMCS slurmctld: BlueGene configured with 2122 midplanes slurmctld: debug: We are using 1112 of the system. slurmctld: BlueGene plugin loaded successfully slurmctld: BlueGene node selection plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/preempt_none.so slurmctld: preempt/none loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/checkpoint_none.so slurmctld: debug3: Success. slurmctld: Checkpoint plugin loaded: checkpoint/none slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/jobacct_gather_none.so slurmctld: Job accounting gather NOT_INVOKED plugin loaded slurmctld: debug3: Success. slurmctld: debug: No backup controller to shutdown slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/switch_none.so slurmctld: switch NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Prefix is bgq bgq[0000x0001] 4 slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/topology_none.so slurmctld: topology NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug: No DownNodes slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/jobcomp_none.so slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /opt/slurm/2.3.3/lib/slurm/sched_backfill.so slurmctld: sched: Backfill scheduler plugin loaded slurmctld: debug3: Success. slurmctld: error: read_slurm_conf: default partition not set. slurmctld: error: Could not open node state file /tmp/node_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Information may be lost! slurmctld: No node state file (/tmp/node_state.old) to recover slurmctld: error: Incomplete node data checkpoint file slurmctld: Recovered state of 0 nodes slurmctld: error: Could not open front_end state file /tmp/front_end_state: No such file or directory slurmctld: error: NOTE: Trying backup front_end_state save file. Information may be lost! slurmctld: No node state file (/tmp/front_end_state.old) to recover slurmctld: error: Incomplete front_end node data checkpoint file slurmctld: Recovered state of 0 front_end nodes slurmctld: error: Could not open job state file /tmp/job_state: No such file or directory slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! slurmctld: No job state file (/tmp/job_state.old) to recover slurmctld: error: hostlist.c:1727 Invalid range: `000x001': Invalid argument slurmctld: hostlist.c:3069: hostlist_ranged_string_dims: Assertion `hl != ((void *)0)' failed. **bluegene.conf** MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts LayoutMode=STATIC BasePartitionNodeCnt=512 NodeCardNodeCnt=32 Numpsets=8 #used for IO poor systems (Can't create 32 cnode blocks) BridgeAPILogFile=/var/log/slurm/bridgeapi.log BridgeAPIVerbose=0 BPs=[000x001] Type=TORUS # 1x1x1 = 4-32 c-node blocks 3-128 c-node blocks **slurm.conf** ClusterName=bgq ControlMachine=bgqsn SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/tmp SlurmdSpoolDir=/tmp/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid CacheGroups=0 ReturnToService=0 Prolog=/opt/slurm/2.3.3/etc/bg_prolog Epilog=/opt/slurm/2.3.3/etc/bg_epilog SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SchedulerType=sched/backfill SelectType=select/bluegene FastSchedule=1 SlurmctldDebug=3 SlurmdDebug=3 JobCompType=jobcomp/none FrontEndName=bgqsn State=UNKNOWN NodeName=bgq[0000x0001] CPUS=9216 State=UNKNOWN PartitionName=R00-M0 What changes do I need to make to my bluegene.conf and slurm.conf files to get the 8 'queue' setup that I'm look for? Thanks for you help James -- James Sweet ACF Systems Administrator EPCC, School of Physics, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh. EH9 3JZ. Tel: 0131 445 7831 The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
