[slurm-dev] Re: Setting up a test cluster

hgieselman Tue, 17 Jul 2012 07:23:05 -0700

First, I want to thank Mark and Moe for helping me get a test cluster setup. I 
got so focused on getting this running that I failed to show my appreciation in 
a timely matter.


I took 5 nodes from my production cluster and configured as a separate 
installation of SLURM 2.3.3. I was not successful in getting the test cluster 
to use the same MySQL database so I built a new one. Everything seems to be 
communicating now but I have a strange problem. No matter which user I submit 
jobs from I get the following error on the screen:

srun: error: Unable to allocate resources: Invalid account or account/partition 
combination specified

The Slurmctld.log shows this:

[2012-07-17T09:30:10] debug2: sched: Processing RPC: 
REQUEST_RESOURCE_ALLOCATION from uid=20053
[2012-07-17T09:30:10] debug3: JobDesc: user_id=20053 job_id=-1 partition=test1 
name=sleep
[2012-07-17T09:30:10] debug3:    cpus=1-4294967294 pn_min_cpus=-1
[2012-07-17T09:30:10] debug3:    -N min-[max]: 1-[4294967294]:65534:65534:65534
[2012-07-17T09:30:10] debug3:    pn_min_memory_job=-1 pn_min_tmp_disk=-1
[2012-07-17T09:30:10] debug3:    immediate=0 features=(null) reservation=(null)
[2012-07-17T09:30:10] debug3:    req_nodes=(null) exc_nodes=(null) gres=(null)
[2012-07-17T09:30:10] debug3:    time_limit=-1--1 priority=-1 contiguous=0 
shared=-1
[2012-07-17T09:30:10] debug3:    kill_on_node_fail=-1 script=(null)
[2012-07-17T09:30:10] debug3:    argv="/bin/sleep"
[2012-07-17T09:30:10] debug3:    stdin=(null) stdout=(null) stderr=(null)
[2012-07-17T09:30:10] debug3:    work_dir=/home/hgieselman 
alloc_node:sid=myvtlin20:28611
[2012-07-17T09:30:10] debug3:    resp_host=10.88.226.20 alloc_resp_port=47269  
other_port=36729
[2012-07-17T09:30:10] debug3:    dependency=(null) account=(null) qos=(null) 
comment=(null)
[2012-07-17T09:30:10] debug3:    mail_type=0 mail_user=(null) nice=55534 
num_tasks=4294967294 open_mode=0 overcommit=-1 acctg_freq=-1
[2012-07-17T09:30:10] debug3:    network=(null) begin=Unknown cpus_per_task=-1 
requeue=-1 licenses=(null)
[2012-07-17T09:30:10] debug3:    end_time=Unknown signal=0@0 wait_all_nodes=-1
[2012-07-17T09:30:10] debug3:    ntasks_per_node=-1 ntasks_per_socket=-1 
ntasks_per_core=-1
[2012-07-17T09:30:10] debug3:    cpus_bind=65534:(null) mem_bind=65534:(null) 
plane_size:65534
[2012-07-17T09:30:10] error: User 20053 not found
[2012-07-17T09:30:10] _job_create: invalid account or partition for user 20053, 
account '(null)', and partition 'test1'
[2012-07-17T09:30:10] _slurm_rpc_allocate_resources: Invalid account or 
account/partition combination specified

The issue seems to be that the jobs get submitted with the uid instead of the 
username as the user. We use NIS and all of my testing of NIS comes out clean. 
As I mentioned earlier, these were production compute nodes and did not have 
this problem then. Any ideas on where to look for the root cause?

Thanks,

Howard Gieselman


-----Original Message-----
From: Mark A. Grondona [mailto:[email protected]] 
Sent: Thursday, July 12, 2012 4:47 PM
To: slurm-dev
Subject: [slurm-dev] Re: Setting up a test cluster








[email protected] writes:

> How do I get another slurmctld running for the test cluster?



You just need to point the slurmctld at a different slurm.conf

that uses a different port for slurmctld and slurmd. Then point

slurmd and all slurm commands to this new slurm.conf. If you

are running out of a build directory or alternate path, you

may also need to update PluginDir or other directories (perhaps

epilog and prolog location as well, you decide for your testing)



I used to have a script that would "boot" a test slurm instance

as a SLURM job, but that script doesn't work anymore with

recent versions of SLURM (however, there really wasn't much

to it)



mark



> Howard Gieselman

>

> -----Original Message-----

> From: Mark A. Grondona [mailto:[email protected]] 

> Sent: Thursday, July 12, 2012 12:20 PM

> To: slurm-dev

> Subject: [slurm-dev] Re: Setting up a test cluster

>

>

> Moe Jette <[email protected]> writes:

>

>

>

>> Quoting [email protected]:

>

>>

>

>>> I have SLURM 2.3.3 installed and running in production but have found

>

>>> that I need to do some more testing and tweaking before I can migrate

>

>>> all of our LSF jobs to SLURM. I would like to install a test cluster but

>

>>> am unsure about the following:

>

>>>

>

>>>

>

>>>

>

>>> 1.       Will I need to install the test cluster on a separate

>

>>> controller or is it enough that I just install it to a different path

>

>>> and use different port numbers?

>

>>

>

>> Different paths and ports are sufficient.

>

>>

>

>>

>

>>> 2.       Can I use existing nodes from my production cluster in the test

>

>>> cluster?

>

>>

>

>> Yes.

>

>>

>

>>

>

>>> 3.       Are there any other things to look out for in running a

>

>>> parallel cluster?

>

>>

>

>> On pretty much all system types you will just over-subscribe resources.

>

>

>

> Also, be sure to audit your epilog and prolog scripts to make sure

>

> they won't cause harm when running in parallel with another SLURM

>

> instance. Be careful if you are using SLURM cgroups support, as cgroups

>

> created for each parallel instance of SLURM will exist in the same

>

> namespace, and that may cause unexpected issues.

>

>

>

> mark

>

>

>

>

>

>>

>

>>>

>

>>> Thanks,

>

>>>

>

>>>

>

>>>

>

>>> Howard Gieselman

>

>>>

>

>>>

[slurm-dev] Re: Setting up a test cluster

Reply via email to