So, I built a multi-node, single-head Warewulf cluster and installed SLURM, minus the accounting bits, and after a bit of learning, I got it up and running just fine and dandy. Jobs submitted fine, MPI ran fine, etc.
However, when I tore SLURM out, recompiled it to be a bit more production-ready (integrating with MySQL for accounting and BLCR for checkpoint/restart)...things went awry. According to scontrol, all my nodes show up and register. I added the cluster to the accounting database (sacctmgr add cluster proto), added a test account (sacctmgr add account test Description="test" Organization="none"), and added a user to that account (sacctmgr add user protoadmin DefaultAccount=test Cluster=proto Partition=protonodes). Then I attempted to submit a job (salloc -N8 sh)--it failed. The error message: salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified. So then I tried manually specifying the account (salloc -N8 --account=test sh), same error. So I decided to check the slurmdbd log, and all I see are a long string of "DBD_CLUSTER_CPUS: cluster not registered"...not seemingly useful, so I decided to check the slurmctld.log file. This provided some lovely error messages: error: User 500 not found\n _job_create: invalid account or partition for user 500, account '(null)', and partition 'protonodes' \n_slurm_rpc_allocate_resources: Invalid account or account/partition combination specified. (and then repeated, except account '(test)' for when I tried specifying the account) Which I found odd--userid 500 DOES belong to protoadmin, and the associations for it clearly show that it is a member of the test account and associated with the protonodes partition. Weird. I tried the same thing but with root, and got the exact same type of error message (s/500/0/g), minus the "error: User xxx not found" bit. I've been poking around at the config files, and as far as I can tell, reading the documentation, nothing seems inconsistent. Does anyone have any ideas what might be holding this up? It seems to me like the database for some reason can't associate the user id with the user name it stores in the database...but isn't it supposed to only care what the user name is? Also, I have the /etc/passwd, /group, /shadow, etc. files managed with warewulf's file provisioning, so they are identical across the cluster. Same with the slurm.conf file. Some other tidbits: Packages installed on head node: yum list installed | grep slurm slurm.x86_64 14.11.3-1.el6 (won't retype version unless it changes, which it doesn't) slurm-blcr slurm-devel slurm-munge slurm-pam_slurm slurm-perlapi slurm-plugins slurm-sjobexit slurm-sjstat slurm-slurmdb-direct slurm-slurmdbd slurm-sql slurm-torque (I actually need to remove this as I won't be using torque) On the nodes: yum --installroot=/var/chroots/nodecent65/ list installed | grep slurm: slurm slurm-munge slurm-plugins Slurm.conf: SlurmUser=slurm usepam=no AccountingStorageEnforce=associations AccountingStorageHost=protohead AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd AccountStorageUser=slurmdbd ClusterName=proto Slurmdbd.conf: DbdAddr=protohead DbdHost=protohead DbdPort=6819 SlurmUser=slurm DebugLevel=5 StorageType=accounting_storage/MySQL StorageHost=protohead StoragePort=3306 StoragePass=(password) StorageUser=slurmdbd StorageLoc=slurm_acct_db If someone could help me figure this out, it'd be great! Been beating my head against the wall the last day and a half with this. Thanks! -Dave
