[slurm-dev] How to avoid using shared GPU cards with multi-slurmd

2014-11-25 Thread zxiaoci
I am now running two slurmds on one node with 4 GPUs, to simulate two nodes 
with 3 GPUs. The configuration is as follows:


#slurm.conf
...
NodeName=bnode1,Gres=gpu:4
NodeName=foo1,NodeHostName=bnode1,Gres=gpu:3
NodeName=foo2,NodeHostName=bnode1,Gres=gpu:3


#gres.conf
NodeName=bnode1,Name=gpu,File=/dev/nvidia[0-3]
NodeName=foo1,  Name=gpu,File=/dev/nvidia[0-2]
NodeName=foo1,  Name=gpu,File=/dev/nvidia[1-3]

I expect that when I run a 3-card program on foo1, another 3-card program 
should be queued for resources.
#srun -wfoo1 --gres=gpu:3 ./myprog1  srun -wfoo2 --gres=gpu:3 ./myprog2 


However the two programs are actually sharing GPU card[1-2].


Does anyone have an idea on this?



--
James Zhang At University of Science and Technology of China
15556980026
zxia...@mail.ustc.edu.cn

[slurm-dev] Restrict nodes for users

2014-11-25 Thread Krishna Teja
Could some one please point me towards the right documentation for setting
up slurm in such a way, that some of the users will not have the
permissions to run jobs on some of the nodes. Google points me to
Accounting pages and am not really sure how it fits in. Appreciate any help
I can get for this.

Regards
Krishna


[slurm-dev] [ sshare ] RAW Usage

2014-11-25 Thread Roshan Mathew
Hello SLURM users,

http://slurm.schedmd.com/sshare.html
Raw Usage
The number of cpu-seconds of all the jobs that charged the account by the user. 
This number will decay over time when PriorityDecayHalfLife is defined.
I am getting different RAW Usage  values for the same job every time it is 
executed. The Job am using is a CPU stress test for 1 minute.

It would be very useful to understand the formula for how this RAW Usage is 
calculated when we are using the plugin PriorityType=priority/multifactor.

Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!



[slurm-dev] RE: RAW Usage reported by sshare

2014-11-25 Thread Roshan Mathew
Hello SLURM users,


It would be very useful to understand the formula for how this RAW Usage is 
calculated when we are using the plugin PriorityType=priority/multifactor.


Snip of my slurm.conf file:-


# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor


Thanks!



From: Roshan Thomas Mathew roshanthomasmat...@gmail.com
Sent: 21 November 2014 17:36
To: slurm-dev
Subject: ++SPAM++ [slurm-dev] RAW Usage reported by sshare

Hello SLURM users,

I am not sure if have understood this correctly.

http://slurm.schedmd.com/sshare.html

Raw Usage
The number of cpu-seconds of all the jobs that charged the account by the user. 
This number will decay over time when PriorityDecayHalfLife is defined.
I am getting different RAW Usage  values for the same job every time it is 
executed. The Job am using is a CPU stress test for 1 minute.

Can someone please point as to how the RAW Usage is calculated and what are the 
parameters that this values is depended on?

Thanks,
Roshan



[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Ryan Cox
Raw usage is a long double and the time added by jobs can be off by a 
few seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly 
what happens.


Ryan

On 11/25/2014 10:34 AM, Roshan Mathew wrote:

Hello SLURM users,

http://slurm.schedmd.com/sshare.html
*Raw Usage*
The number of cpu-seconds of all the jobs that charged the account by 
the user. This number will decay over time when PriorityDecayHalfLife 
is defined.
I am getting different /RAW Usage/  values for the same job every time 
it is executed. The Job am using is a CPU stress test for 1 minute.


It would be very useful to understand the formula for how this RAW 
Usage is calculated when we are using the plugin 
PriorityType=priority/multifactor.


Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!




[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Roshan Mathew
Thanks Ryan,


Is this value stored anywhere in the SLURM accounting DB? I could not find any 
value for the JOB that corresponds to this RAW usage.


Roshan


From: Ryan Cox ryan_...@byu.edu
Sent: 25 November 2014 17:43
To: slurm-dev
Subject: [slurm-dev] Re: [ sshare ] RAW Usage

Raw usage is a long double and the time added by jobs can be off by a few 
seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly what 
happens.

Ryan

On 11/25/2014 10:34 AM, Roshan Mathew wrote:
Hello SLURM users,

http://slurm.schedmd.com/sshare.html
Raw Usage
The number of cpu-seconds of all the jobs that charged the account by the user. 
This number will decay over time when PriorityDecayHalfLife is defined.
I am getting different RAW Usage  values for the same job every time it is 
executed. The Job am using is a CPU stress test for 1 minute.

It would be very useful to understand the formula for how this RAW Usage is 
calculated when we are using the plugin PriorityType=priority/multifactor.

Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!




[slurm-dev] RE: ++SPAM++ Restrict nodes for users

2014-11-25 Thread Roshan Mathew
1. Group nodes as a partition

2. Groups users into an account


Restrict access to that partition from the slurm.conf ?


Example skeleton for the configuration line

PartitionName=partition_name DenyAccount=account_name


Hope this helps.


From: Krishna Teja teja...@gmail.com
Sent: 25 November 2014 17:25
To: slurm-dev
Subject: ++SPAM++ [slurm-dev] Restrict nodes for users

Could some one please point me towards the right documentation for setting up 
slurm in such a way, that some of the users will not have the permissions to 
run jobs on some of the nodes. Google points me to Accounting pages and am not 
really sure how it fits in. Appreciate any help I can get for this.

Regards
Krishna



[slurm-dev] Re: Segmentation fault in scancel

2014-11-25 Thread jette


Tyes, thanks. I just switched the tests in the if statement to fix that:
https://github.com/SchedMD/slurm/commit/6fdc4a4fa490bb4f3b040d9e09350835bab9d8c6


Quoting Dominik Bartkiewicz d.bartkiew...@icm.edu.pl:


On 11/21/2014 06:00 PM, je...@schedmd.com wrote:


Thank you! I committed a slight variation of your patch in order to show
the job array ID information when applicable:
https://github.com/SchedMD/slurm/commit/51da758614ce0c65a63e6069b9897f91967f387f



Hi
Are you sure that in line 225  IS_JOB_FINISHED(jp) won't segfault  
for i = job_buffer_ptr-record_count?


cheers,
DB



Quoting Dominik Bartkiewicz d.bartkiew...@icm.edu.pl:


We have observe in  log that scancel sometimes segfaulting.

In scancel function _verify_job_ids:
IS_JOB_FINISHED(jp) is check even if  i = job_buffer_ptr-record_count
this can make segmentation fault for invalid job_id.

Another problem is _verify_job_ids return 1 only when opt.verbose = 0.

cheers,
DB


I fixed _verify_job_ids function:

static int
_verify_job_ids (void)
{
   /* If a list of jobs was given, make sure each job is actually in
* our list of job records. */
   int i, j;
   job_info_t *job_ptr = job_buffer_ptr-job_array;
   int rc = 0;

   for (j = 0; j  opt.job_cnt; j++ ) {
   job_info_t *jp;

   for (i = 0; i  job_buffer_ptr-record_count; i++) {
   if (_match_job(j, i))
   break;
   }
   jp = job_ptr[i];
   if (i = job_buffer_ptr-record_count) {
   if (opt.verbose = 0)
   error(Kill job error on job id %u: %s,
   opt.job_id[j],
   slurm_strerror(ESLURM_INVALID_JOB_ID));
   rc = 1;
   } else if ((IS_JOB_FINISHED(jp)) ||
(job_ptr[i].array_task_id == NO_VAL)) {
if (opt.verbose = 0) {
   if (opt.step_id[j] == SLURM_BATCH_SCRIPT)
   error(Kill job error on job
id %u: %s,
 opt.job_id[j],

slurm_strerror(ESLURM_INVALID_JOB_ID));
   else
   error(Kill job error on job
step id %u.%u: %s,
 opt.job_id[j],
opt.step_id[j],

slurm_strerror(ESLURM_INVALID_JOB_ID));
   }
   rc = 1;

   }
   }

   return rc;
}






--
Morris Moe Jette
CTO, SchedMD LLC


[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Skouson, Gary B
I believe that the info share data is kept by slurmctld in memory.  As far as I 
could tell from the code, it should be checkpointing the info to the 
assoc_usage file wherever slurm is saving state information.  I couldn’t find 
any docs on that, you’d have to check the code for more information.

However, if you just want to see what was used, you can get the raw usage using 
sacct.  For example, for a given job, you can do something like:

sacct -X -a -j 1182128  --format 
Jobid,jobname,partition,account,alloccpus,state,exitcode,cputimeraw

-
Gary Skouson


From: Roshan Mathew [mailto:r.t.mat...@bath.ac.uk]
Sent: Tuesday, November 25, 2014 9:51 AM
To: slurm-dev
Subject: [slurm-dev] Re: [ sshare ] RAW Usage


Thanks Ryan,



Is this value stored anywhere in the SLURM accounting DB? I could not find any 
value for the JOB that corresponds to this RAW usage.



Roshan


From: Ryan Cox ryan_...@byu.edu
Sent: 25 November 2014 17:43
To: slurm-dev
Subject: [slurm-dev] Re: [ sshare ] RAW Usage

Raw usage is a long double and the time added by jobs can be off by a few 
seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly what 
happens.

Ryan
On 11/25/2014 10:34 AM, Roshan Mathew wrote:
Hello SLURM users,

http://slurm.schedmd.com/sshare.html
Raw Usage
The number of cpu-seconds of all the jobs that charged the account by the user. 
This number will decay over time when PriorityDecayHalfLife is defined.
I am getting different RAW Usage  values for the same job every time it is 
executed. The Job am using is a CPU stress test for 1 minute.

It would be very useful to understand the formula for how this RAW Usage is 
calculated when we are using the plugin PriorityType=priority/multifactor.

Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!