[slurm-dev] How to avoid using shared GPU cards with multi-slurmd
I am now running two slurmds on one node with 4 GPUs, to simulate two nodes with 3 GPUs. The configuration is as follows: #slurm.conf ... NodeName=bnode1,Gres=gpu:4 NodeName=foo1,NodeHostName=bnode1,Gres=gpu:3 NodeName=foo2,NodeHostName=bnode1,Gres=gpu:3 #gres.conf NodeName=bnode1,Name=gpu,File=/dev/nvidia[0-3] NodeName=foo1, Name=gpu,File=/dev/nvidia[0-2] NodeName=foo1, Name=gpu,File=/dev/nvidia[1-3] I expect that when I run a 3-card program on foo1, another 3-card program should be queued for resources. #srun -wfoo1 --gres=gpu:3 ./myprog1 srun -wfoo2 --gres=gpu:3 ./myprog2 However the two programs are actually sharing GPU card[1-2]. Does anyone have an idea on this? -- James Zhang At University of Science and Technology of China 15556980026 zxia...@mail.ustc.edu.cn
[slurm-dev] Restrict nodes for users
Could some one please point me towards the right documentation for setting up slurm in such a way, that some of the users will not have the permissions to run jobs on some of the nodes. Google points me to Accounting pages and am not really sure how it fits in. Appreciate any help I can get for this. Regards Krishna
[slurm-dev] [ sshare ] RAW Usage
Hello SLURM users, http://slurm.schedmd.com/sshare.html Raw Usage The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different RAW Usage values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks!
[slurm-dev] RE: RAW Usage reported by sshare
Hello SLURM users, It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks! From: Roshan Thomas Mathew roshanthomasmat...@gmail.com Sent: 21 November 2014 17:36 To: slurm-dev Subject: ++SPAM++ [slurm-dev] RAW Usage reported by sshare Hello SLURM users, I am not sure if have understood this correctly. http://slurm.schedmd.com/sshare.html Raw Usage The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different RAW Usage values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. Can someone please point as to how the RAW Usage is calculated and what are the parameters that this values is depended on? Thanks, Roshan
[slurm-dev] Re: [ sshare ] RAW Usage
Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html *Raw Usage* The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different /RAW Usage/ values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks!
[slurm-dev] Re: [ sshare ] RAW Usage
Thanks Ryan, Is this value stored anywhere in the SLURM accounting DB? I could not find any value for the JOB that corresponds to this RAW usage. Roshan From: Ryan Cox ryan_...@byu.edu Sent: 25 November 2014 17:43 To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html Raw Usage The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different RAW Usage values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks!
[slurm-dev] RE: ++SPAM++ Restrict nodes for users
1. Group nodes as a partition 2. Groups users into an account Restrict access to that partition from the slurm.conf ? Example skeleton for the configuration line PartitionName=partition_name DenyAccount=account_name Hope this helps. From: Krishna Teja teja...@gmail.com Sent: 25 November 2014 17:25 To: slurm-dev Subject: ++SPAM++ [slurm-dev] Restrict nodes for users Could some one please point me towards the right documentation for setting up slurm in such a way, that some of the users will not have the permissions to run jobs on some of the nodes. Google points me to Accounting pages and am not really sure how it fits in. Appreciate any help I can get for this. Regards Krishna
[slurm-dev] Re: Segmentation fault in scancel
Tyes, thanks. I just switched the tests in the if statement to fix that: https://github.com/SchedMD/slurm/commit/6fdc4a4fa490bb4f3b040d9e09350835bab9d8c6 Quoting Dominik Bartkiewicz d.bartkiew...@icm.edu.pl: On 11/21/2014 06:00 PM, je...@schedmd.com wrote: Thank you! I committed a slight variation of your patch in order to show the job array ID information when applicable: https://github.com/SchedMD/slurm/commit/51da758614ce0c65a63e6069b9897f91967f387f Hi Are you sure that in line 225 IS_JOB_FINISHED(jp) won't segfault for i = job_buffer_ptr-record_count? cheers, DB Quoting Dominik Bartkiewicz d.bartkiew...@icm.edu.pl: We have observe in log that scancel sometimes segfaulting. In scancel function _verify_job_ids: IS_JOB_FINISHED(jp) is check even if i = job_buffer_ptr-record_count this can make segmentation fault for invalid job_id. Another problem is _verify_job_ids return 1 only when opt.verbose = 0. cheers, DB I fixed _verify_job_ids function: static int _verify_job_ids (void) { /* If a list of jobs was given, make sure each job is actually in * our list of job records. */ int i, j; job_info_t *job_ptr = job_buffer_ptr-job_array; int rc = 0; for (j = 0; j opt.job_cnt; j++ ) { job_info_t *jp; for (i = 0; i job_buffer_ptr-record_count; i++) { if (_match_job(j, i)) break; } jp = job_ptr[i]; if (i = job_buffer_ptr-record_count) { if (opt.verbose = 0) error(Kill job error on job id %u: %s, opt.job_id[j], slurm_strerror(ESLURM_INVALID_JOB_ID)); rc = 1; } else if ((IS_JOB_FINISHED(jp)) || (job_ptr[i].array_task_id == NO_VAL)) { if (opt.verbose = 0) { if (opt.step_id[j] == SLURM_BATCH_SCRIPT) error(Kill job error on job id %u: %s, opt.job_id[j], slurm_strerror(ESLURM_INVALID_JOB_ID)); else error(Kill job error on job step id %u.%u: %s, opt.job_id[j], opt.step_id[j], slurm_strerror(ESLURM_INVALID_JOB_ID)); } rc = 1; } } return rc; } -- Morris Moe Jette CTO, SchedMD LLC
[slurm-dev] Re: [ sshare ] RAW Usage
I believe that the info share data is kept by slurmctld in memory. As far as I could tell from the code, it should be checkpointing the info to the assoc_usage file wherever slurm is saving state information. I couldn’t find any docs on that, you’d have to check the code for more information. However, if you just want to see what was used, you can get the raw usage using sacct. For example, for a given job, you can do something like: sacct -X -a -j 1182128 --format Jobid,jobname,partition,account,alloccpus,state,exitcode,cputimeraw - Gary Skouson From: Roshan Mathew [mailto:r.t.mat...@bath.ac.uk] Sent: Tuesday, November 25, 2014 9:51 AM To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Thanks Ryan, Is this value stored anywhere in the SLURM accounting DB? I could not find any value for the JOB that corresponds to this RAW usage. Roshan From: Ryan Cox ryan_...@byu.edu Sent: 25 November 2014 17:43 To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users, http://slurm.schedmd.com/sshare.html Raw Usage The number of cpu-seconds of all the jobs that charged the account by the user. This number will decay over time when PriorityDecayHalfLife is defined. I am getting different RAW Usage values for the same job every time it is executed. The Job am using is a CPU stress test for 1 minute. It would be very useful to understand the formula for how this RAW Usage is calculated when we are using the plugin PriorityType=priority/multifactor. Snip of my slurm.conf file:- # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 PriorityCalcPeriod=1 PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=0 PriorityWeightFairshare=100 PriorityWeightJobSize=0 PriorityWeightPartition=0 PriorityWeightQOS=0 # don't use the qos factor Thanks!