So, I built a multi-node, single-head Warewulf cluster and installed
SLURM, minus the accounting bits, and after a bit of learning, I got it up
and running just fine and dandy.  Jobs submitted fine, MPI ran fine, etc.

However, when I tore SLURM out, recompiled it to be a bit more
production-ready (integrating with MySQL for accounting and BLCR for
checkpoint/restart)...things went awry.

According to scontrol, all my nodes show up and register.  I added the
cluster to the accounting database (sacctmgr add cluster proto), added a
test account (sacctmgr add account test Description="test"
Organization="none"), and added a user to that account (sacctmgr add user
protoadmin DefaultAccount=test Cluster=proto Partition=protonodes).  Then I
attempted to submit a job (salloc -N8 sh)--it failed.  The error message:
salloc: error: Job submit/allocate failed: Invalid account or
account/partition combination specified.  So then I tried manually
specifying the account (salloc -N8 --account=test sh), same error.  So I
decided to check the slurmdbd log, and all I see are a long string of
"DBD_CLUSTER_CPUS: cluster not registered"...not seemingly useful, so I
decided to check the slurmctld.log file.  This provided some lovely error
messages: error: User 500 not found\n _job_create: invalid account or
partition for user 500, account '(null)', and partition 'protonodes'
\n_slurm_rpc_allocate_resources: Invalid account or account/partition
combination specified. (and then repeated, except account '(test)' for when
I tried specifying the account)

Which I found odd--userid 500 DOES belong to protoadmin, and the
associations for it clearly show that it is a member of the test account
and associated with the protonodes partition.  Weird.

I tried the same thing but with root, and got the exact same type of error
message (s/500/0/g), minus the "error: User xxx not found" bit.

I've been poking around at the config files, and as far as I can tell,
reading the documentation, nothing seems inconsistent.  Does anyone have
any ideas what might be holding this up?  It seems to me like the database
for some reason can't associate the user id with the user name it stores in
the database...but isn't it supposed to only care what the user name is?

Also, I have the /etc/passwd, /group, /shadow, etc. files managed with
warewulf's file provisioning, so they are identical across the cluster.
Same with the slurm.conf file.

Some other tidbits:
Packages installed on head node:
yum list installed | grep slurm
slurm.x86_64 14.11.3-1.el6 (won't retype version unless it changes, which
it doesn't)
slurm-blcr
slurm-devel
slurm-munge
slurm-pam_slurm
slurm-perlapi
slurm-plugins
slurm-sjobexit
slurm-sjstat
slurm-slurmdb-direct
slurm-slurmdbd
slurm-sql
slurm-torque (I actually need to remove this as I won't be using torque)

On the nodes:
yum --installroot=/var/chroots/nodecent65/ list installed | grep slurm:
slurm
slurm-munge
slurm-plugins

Slurm.conf:
SlurmUser=slurm
usepam=no
AccountingStorageEnforce=associations
AccountingStorageHost=protohead
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountStorageUser=slurmdbd
ClusterName=proto

Slurmdbd.conf:
DbdAddr=protohead
DbdHost=protohead
DbdPort=6819
SlurmUser=slurm
DebugLevel=5
StorageType=accounting_storage/MySQL
StorageHost=protohead
StoragePort=3306
StoragePass=(password)
StorageUser=slurmdbd
StorageLoc=slurm_acct_db

If someone could help me figure this out, it'd be great!  Been beating my
head against the wall the last day and a half with this.

Thanks!
-Dave

Reply via email to