First, I want to thank Mark and Moe for helping me get a test cluster setup. I got so focused on getting this running that I failed to show my appreciation in a timely matter.
I took 5 nodes from my production cluster and configured as a separate installation of SLURM 2.3.3. I was not successful in getting the test cluster to use the same MySQL database so I built a new one. Everything seems to be communicating now but I have a strange problem. No matter which user I submit jobs from I get the following error on the screen: srun: error: Unable to allocate resources: Invalid account or account/partition combination specified The Slurmctld.log shows this: [2012-07-17T09:30:10] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=20053 [2012-07-17T09:30:10] debug3: JobDesc: user_id=20053 job_id=-1 partition=test1 name=sleep [2012-07-17T09:30:10] debug3: cpus=1-4294967294 pn_min_cpus=-1 [2012-07-17T09:30:10] debug3: -N min-[max]: 1-[4294967294]:65534:65534:65534 [2012-07-17T09:30:10] debug3: pn_min_memory_job=-1 pn_min_tmp_disk=-1 [2012-07-17T09:30:10] debug3: immediate=0 features=(null) reservation=(null) [2012-07-17T09:30:10] debug3: req_nodes=(null) exc_nodes=(null) gres=(null) [2012-07-17T09:30:10] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1 [2012-07-17T09:30:10] debug3: kill_on_node_fail=-1 script=(null) [2012-07-17T09:30:10] debug3: argv="/bin/sleep" [2012-07-17T09:30:10] debug3: stdin=(null) stdout=(null) stderr=(null) [2012-07-17T09:30:10] debug3: work_dir=/home/hgieselman alloc_node:sid=myvtlin20:28611 [2012-07-17T09:30:10] debug3: resp_host=10.88.226.20 alloc_resp_port=47269 other_port=36729 [2012-07-17T09:30:10] debug3: dependency=(null) account=(null) qos=(null) comment=(null) [2012-07-17T09:30:10] debug3: mail_type=0 mail_user=(null) nice=55534 num_tasks=4294967294 open_mode=0 overcommit=-1 acctg_freq=-1 [2012-07-17T09:30:10] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null) [2012-07-17T09:30:10] debug3: end_time=Unknown signal=0@0 wait_all_nodes=-1 [2012-07-17T09:30:10] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1 [2012-07-17T09:30:10] debug3: cpus_bind=65534:(null) mem_bind=65534:(null) plane_size:65534 [2012-07-17T09:30:10] error: User 20053 not found [2012-07-17T09:30:10] _job_create: invalid account or partition for user 20053, account '(null)', and partition 'test1' [2012-07-17T09:30:10] _slurm_rpc_allocate_resources: Invalid account or account/partition combination specified The issue seems to be that the jobs get submitted with the uid instead of the username as the user. We use NIS and all of my testing of NIS comes out clean. As I mentioned earlier, these were production compute nodes and did not have this problem then. Any ideas on where to look for the root cause? Thanks, Howard Gieselman -----Original Message----- From: Mark A. Grondona [mailto:[email protected]] Sent: Thursday, July 12, 2012 4:47 PM To: slurm-dev Subject: [slurm-dev] Re: Setting up a test cluster [email protected] writes: > How do I get another slurmctld running for the test cluster? You just need to point the slurmctld at a different slurm.conf that uses a different port for slurmctld and slurmd. Then point slurmd and all slurm commands to this new slurm.conf. If you are running out of a build directory or alternate path, you may also need to update PluginDir or other directories (perhaps epilog and prolog location as well, you decide for your testing) I used to have a script that would "boot" a test slurm instance as a SLURM job, but that script doesn't work anymore with recent versions of SLURM (however, there really wasn't much to it) mark > Howard Gieselman > > -----Original Message----- > From: Mark A. Grondona [mailto:[email protected]] > Sent: Thursday, July 12, 2012 12:20 PM > To: slurm-dev > Subject: [slurm-dev] Re: Setting up a test cluster > > > Moe Jette <[email protected]> writes: > > > >> Quoting [email protected]: > >> > >>> I have SLURM 2.3.3 installed and running in production but have found > >>> that I need to do some more testing and tweaking before I can migrate > >>> all of our LSF jobs to SLURM. I would like to install a test cluster but > >>> am unsure about the following: > >>> > >>> > >>> > >>> 1. Will I need to install the test cluster on a separate > >>> controller or is it enough that I just install it to a different path > >>> and use different port numbers? > >> > >> Different paths and ports are sufficient. > >> > >> > >>> 2. Can I use existing nodes from my production cluster in the test > >>> cluster? > >> > >> Yes. > >> > >> > >>> 3. Are there any other things to look out for in running a > >>> parallel cluster? > >> > >> On pretty much all system types you will just over-subscribe resources. > > > > Also, be sure to audit your epilog and prolog scripts to make sure > > they won't cause harm when running in parallel with another SLURM > > instance. Be careful if you are using SLURM cgroups support, as cgroups > > created for each parallel instance of SLURM will exist in the same > > namespace, and that may cause unexpected issues. > > > > mark > > > > > >> > >>> > >>> Thanks, > >>> > >>> > >>> > >>> Howard Gieselman > >>> > >>>
