[slurm-dev] Re: Restrict access for a user group to certain nodes

2016-12-01 Thread Magnus Jonsson


Hi!

You could either setup a partition for your tests with group 
restrictions or you can use the reservation feature depending on your 
exact use case.


/Magnus

On 2016-12-01 15:54, Felix Willenborg wrote:


Dear everybody,

I'd like to restrict submissions from a certain user group or allow only
one certain user group to submit jobs to certain nodes. Does Slurm offer
groups which can handle such an occassion? It'd be prefered if there is
a linux user group support, because this would save time setting up a
new user group environment.

The intention is that only administrators can submit jobs to those
certain nodes to perform some tests, which might be disturbed by users
submitting their jobs to those nodes. Various Search Engines didn't
offer answers to my question, which is why I'm writing you here.

Looking forward to some answers!

Best,
Felix Willenborg



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet


[slurm-dev] Re: sacct vs sacct -X

2016-04-12 Thread Magnus Jonsson


On 2016-03-23 16:17, Skouson, Gary B wrote:

Yes, but why are we not getting the job information from sacct when we 
are running without -X in this case?


The problem with running with -X is that we don't get all cumulative 
statistics for the job. We are missing some of the information like UserCPU.


/Magnus


The man page says:

-X, --allocations
  Only show cumulative statistics for each job, not the 
intermediate steps.


What's allocated to the job may not match the utilization of the job steps.

-
Gary Skouson



-Original Message-----
From: Magnus Jonsson [mailto:mag...@hpc2n.umu.se]
Sent: Wednesday, March 23, 2016 7:10 AM
To: slurm-dev 
Subject: [slurm-dev] Re: sacct vs sacct -X


The behaviour seems to be diffrent in slurm 15.08 at least.

sacct  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j  7364851
 JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
736485100:00:00 16  0
7364851.0  00:00:00  1  0

sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851
 JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
736485100:00:00 16  0

/Magnus

On 2016-03-23 09:29, Magnus Jonsson wrote:

Hi!

   From this simple example could someone explain to me if this is the
expected behaviour or a bug?

$ srun -n1 --exclusive hostname
srun: job 4232239 queued and waiting for resources
srun: job 4232239 has been allocated resources
host0001.example.com

$ sacct -X  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
  JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03 48144

$ sacct  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
  JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03  1144

We are currently running 14.03 but the same behaviour exist in 14.11 as
well.

I see that the TRES-feature change a lot of this in the 15+ releases but
does it change this behaviour (I don't have access to any 15 cluster
right now)?

Best regards,
Magnus





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet


[slurm-dev] Re: sacct vs sacct -X

2016-03-23 Thread Magnus Jonsson


The behaviour seems to be diffrent in slurm 15.08 at least.

sacct  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j  7364851
   JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
736485100:00:00 16  0
7364851.0  00:00:00  1  0

sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851
   JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
736485100:00:00 16  0

/Magnus

On 2016-03-23 09:29, Magnus Jonsson wrote:

Hi!

  From this simple example could someone explain to me if this is the
expected behaviour or a bug?

$ srun -n1 --exclusive hostname
srun: job 4232239 queued and waiting for resources
srun: job 4232239 has been allocated resources
host0001.example.com

$ sacct -X  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
 JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03 48144

$ sacct  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
 JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03  1144

We are currently running 14.03 but the same behaviour exist in 14.11 as
well.

I see that the TRES-feature change a lot of this in the 15+ releases but
does it change this behaviour (I don't have access to any 15 cluster
right now)?

Best regards,
Magnus



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet


[slurm-dev] sacct vs sacct -X

2016-03-23 Thread Magnus Jonsson

Hi!

 From this simple example could someone explain to me if this is the
expected behaviour or a bug?

$ srun -n1 --exclusive hostname
srun: job 4232239 queued and waiting for resources
srun: job 4232239 has been allocated resources
host0001.example.com

$ sacct -X  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03 48144

$ sacct  --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239
JobIDElapsed  AllocCPUS CPUTimeRAW
 -- -- --
423223900:00:03  1144

We are currently running 14.03 but the same behaviour exist in 14.11 as
well.

I see that the TRES-feature change a lot of this in the 15+ releases but
does it change this behaviour (I don't have access to any 15 cluster
right now)?

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: As a user how can I re-order my job submissions

2015-08-28 Thread Magnus Jonsson

Hi!

You could always use the "nice" feature to change the priority of your 
old jobs.


It might be a little bit of a work to make a script that sets the nice 
value of all your jobs but nothing that some cut/grep/xargs can't fix ;-)


/Magnus

On 2015-08-28 15:48, Kumar, Amit wrote:

Dear SLURM,

If I am a regular user and imagine I have ton’s of jobs submitted, then
I come up with another batch of jobs that I want to run before that
batch I submitted few hours back that is still in the queue waiting for
resources and priority. Is there a way to do this?  From an admin
perspective I wouldn’t want this, because users could misuse this
feature. But from a user perspective I could genuinely have some
dependencies that I would like to have it addressed before beginning my
batch of thousands of jobs.

Any help here is greatly appreciated.

Regards,
Amit



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] sbcast, prolog and SPANK.

2015-04-29 Thread Magnus Jonsson


Hi everybody.

As some of you may now from my presentation in Lugano we are using a 
SPANK-plugin to give private /tmp directories to our users.


One of our users was using the sbcast command to send files to nodes in 
the allocation.


This works badly as the SPANK-plugin is not used at all for sbcast.

I'm unsure exactly which part of Slurm that receives the data and how 
this is implemented at all and if SPANK should be involved at all but 
the files do not show up where the user expects them to be. Is this 
solvable in any way with sbcast. For now we just recommended the user to 
use "srun cp ${PATH_TO_FILES}/* $TMPDIR/"


Also this has the side effect that the prolog on the node is not run 
until you actually send a job to the node.
I.e. you can send data to a node with sbcast before the prolog this 
might not be an expected/wanted behaviour.


Best,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: prevent slurm from parsing the full script

2015-04-21 Thread Magnus Jonsson
A better approach would be to add to SLURM a "#SBATCH END-OF-OPTIONS" or 
something similar to mark the end of sbatch options and that sbatch can 
stop parsing from that point.


/Magnus

On 2015-04-21 14:40, Andy Riebs wrote:

Never mind; which I changed "#sbatch" to the correct "#SBATCH", I got 4
tasks. According to the man page, this is a bug. For now, I like
Magnus's suggestion :-)

On 04/21/2015 08:21 AM, Andy Riebs wrote:

Hendryk, what sbatch command line options are you using? How are you
determining that job 1 got 2 tasks? I just tried the following script,
and it correctly ran just 1 task:

$ cat test.sh
#!/bin/bash
#SBATCH --ntasks=1

srun hostname

#sbatch --ntasks=4

## end of script
$ sbatch test.sh
Submitted batch job 18720
$ cat slurm-18720.out
node09
$

For further discussion on this topic, please

 1. Reply to the whole list, not just me
 2. Indicate what OS and Slurm versions you are using
 3. Provide a copy of your slurm.conf file with any sensitive
information, like node names or IP addresses, removed

Andy

On 04/21/2015 07:50 AM, Hendryk Bockelmann wrote:

Hello,

is there a way to prevent slurm from parsing the whole jobscript for
#SBATCH statements?
Assume I have the following jobscript "job1.sh":

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=job1

srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME"

cat > job2.sh <





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: prevent slurm from parsing the full script

2015-04-21 Thread Magnus Jonsson

Hi!

A simple solution would be do:

SBATCH="#SBATCH"

cat << EOF
...

$SBATCH --nodes=1

EOF

/Magnus

On 2015-04-21 13:50, Hendryk Bockelmann wrote:

Hello,

is there a way to prevent slurm from parsing the whole jobscript for
#SBATCH statements?
Assume I have the following jobscript "job1.sh":

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=job1

srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME"

cat > job2.sh <

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] A way of abuse the priority option in Slurm?

2015-03-31 Thread Magnus Jonsson


Hi!

I just discovered a possible way for a user to abuse the priority in Slurm.

This is the scenario:

1. A user has not run any jobs in a long time and has therefore has a 
high fairshare priority. Lets say: 1.


2. The user submits 1000 jobs into the queue that is far above his 
fairshare target.


3. The user changes the priority of his job (It's ok for a user to lower 
the priority of jobs as long as the user is the owner) to lets say:  
(still a high priority. +-1 is in practice nothing). (scontrol update 
jobid=1 priority=


4. The users jobs starts and the fairshare priority lowers. But here is 
the big _BUT_ the jobs with changed priority does not seams to change 
leaving the users job with maximum priority until all of the jobs are 
completed.


Have I missed something in this scenario?

If this is true what do we do about it? Should users be able to change 
the priority at all?


The user can use the 'nice' option to alter the priority of a job within 
a small limit that does not alter the priority as defined above.


Please let me be wrong :-)

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Slurm restart count in SPANK

2015-02-27 Thread Magnus Jonsson

Hi Aaron,

From my spank code: spank_get_item(sp, S_SLURM_RESTART_COUNT, 
&restartcount)


The S_SLURM_RESTART_COUNT "item" was added to the plugstack on my 
request/patch.


But thanks for the concern :-)

Best regards,
Magnus

On 2015-02-27 14:44, Aaron Knister wrote:


Hi Magnus,

While I can't tell you OTTOMH why the behavior changed, I can suggest a 
different perhaps more spank-y way to do that. From within your spank 
function(s) use the spank_get_item call to get the restart count:

int restart_count;

// sp is the spank_t argument to your SPANK function
spank_get_item(sp, S_SLURM_RESTART_COUNT, &restart_count);

Hope that helps!

Sent from my iPhone


On Feb 27, 2015, at 8:14 AM, Magnus Jonsson  wrote:


It seams that the restart count in SPANK (prolog) is missing in resent versions 
of Slurm.

I always returns 0 even if the jobs ha restarted.

It also seams that the "SLURM_RESTART_COUNT" environment is missing in the 
epilog script (might be related).

I'm not sure when this was changed but I'm pretty sure it worked on in 2.6 (it 
was when we developed our tmpdir spank plugin).

"SLURM_RESTART_COUNT" is available in the job user environment.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet


--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Slurm restart count in SPANK

2015-02-27 Thread Magnus Jonsson


It seams that the restart count in SPANK (prolog) is missing in resent 
versions of Slurm.


I always returns 0 even if the jobs ha restarted.

It also seams that the "SLURM_RESTART_COUNT" environment is missing in 
the epilog script (might be related).


I'm not sure when this was changed but I'm pretty sure it worked on in 
2.6 (it was when we developed our tmpdir spank plugin).


"SLURM_RESTART_COUNT" is available in the job user environment.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Two patches for jobacct_gather.

2015-02-05 Thread Magnus Jonsson


Hi!

I have attached two patches to the jobacct_gather plugin (common).

The first uses Proportional Set Size (PSS) instead of RSS to determinate 
the memory footprint of a job.


More information about PSS can be found here:
http://lwn.net/Articles/230975/

Gather the PSS information is a little bit more complicated (and CPU 
intensive) then just the RSS value and might be problem on some 
applications.


We have a subset of jobs that loads the dataset in the first process and 
then just do a fork() for the number of cores available and do parallel 
computation of the data set.


This makes the RSS value go sky high as Slurm calculates the sum of all 
RSS values of the processes in the job and Slurm then kills the job :-(



The second patch adds an option not to kill jobs that is over memory 
limit. This works well for us that have working cgroups memory limits.


Best regards,
Magnus Jonsson

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index 29f730d..cb85598 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1046,6 +1046,9 @@ Exclude shared memory from accounting.
 .TP
 \fBUsePss\fR
 Use PSS value instead of RSS (saved as RSS) to calculate real usage of memory.
+.TP
+\fBNoOverMemoryKill\fR
+Do not kill process that uses more then requested memory but do JobAcctGather.
 .RE
 
 .TP
diff --git a/src/plugins/jobacct_gather/common/common_jag.c b/src/plugins/jobacct_gather/common/common_jag.c
index b6204d6..36864b0 100644
--- a/src/plugins/jobacct_gather/common/common_jag.c
+++ b/src/plugins/jobacct_gather/common/common_jag.c
@@ -671,6 +671,7 @@ extern void jag_common_poll_data(
 	char		sbuf[72];
 	int energy_counted = 0;
 	static int first = 1;
+	static int no_over_memory_kill = -1;
 
 	xassert(callbacks);
 
@@ -685,6 +686,15 @@ extern void jag_common_poll_data(
 	}
 	processing = 1;
 
+	if (no_over_memory_kill == -1) {
+		char *acct_params = slurm_get_jobacct_gather_params();
+		if (acct_params && strstr(acct_params, "NoOverMemoryKill"))
+			no_over_memory_kill = 1;
+		else
+			no_over_memory_kill = 0;
+		xfree(acct_params);
+	}
+
 	if (!callbacks->get_precs)
 		callbacks->get_precs = _get_precs;
 
@@ -783,7 +793,9 @@ extern void jag_common_poll_data(
 	}
 	list_iterator_destroy(itr);
 
-	jobacct_gather_handle_mem_limit(total_job_mem, total_job_vsize);
+	if(!no_over_memory_kill) {
+		jobacct_gather_handle_mem_limit(total_job_mem, total_job_vsize);
+	}
 
 finished:
 	list_destroy(prec_list);
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index ee7674b..29f730d 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1043,6 +1043,9 @@ Acceptable values at present include:
 .TP 20
 \fBNoShared\fR
 Exclude shared memory from accounting.
+.TP
+\fBUsePss\fR
+Use PSS value instead of RSS (saved as RSS) to calculate real usage of memory.
 .RE
 
 .TP
diff --git a/src/plugins/jobacct_gather/common/common_jag.c b/src/plugins/jobacct_gather/common/common_jag.c
index 84b6775..b6204d6 100644
--- a/src/plugins/jobacct_gather/common/common_jag.c
+++ b/src/plugins/jobacct_gather/common/common_jag.c
@@ -95,6 +95,46 @@ static char *_skipdot (char *str)
 	return str;
 }
 
+/*
+ * collects the Pss value from /proc//smaps
+ */
+static int _get_pss(char *proc_smaps_file, jag_prec_t *prec) {
+uint64_t pss=0;
+char line[128];
+
+FILE *fp = fopen(proc_smaps_file, "r");
+if(!fp) {
+return -1;
+}
+	fcntl(fileno(fp), F_SETFD, FD_CLOEXEC);
+while(fgets(line,sizeof(line),fp)) {
+if(strncmp(line,"Pss:",4)) {
+continue;
+}
+int i=4;
+for(;i 0 && prec->rss > pss) {
+prec->rss = pss;
+}
+return 0;
+}
+
 static int _get_sys_interface_freq_line(uint32_t cpu, char *filename,
 	char * sbuf)
 {
@@ -359,10 +399,11 @@ static int _get_process_io_data_line(int in, jag_prec_t *prec) {
 	return 1;
 }
 
-static void _handle_stats(List prec_list, char *proc_stat_file,
-			  char *proc_io_file, jag_callbacks_t *callbacks)
+static void _handle_stats(List prec_list, char *proc_stat_file, char *proc_io_file, 
+char *proc_smaps_file, jag_callbacks_t *callbacks)
 {
 	static int no_share_data = -1;
+	static int use_pss = -1;
 	FILE *stat_fp = NULL;
 	FILE *io_fp = NULL;
 	int fd, fd2;
@@ -374,6 +415,11 @@ static void _handle_stats(List prec_list, char *proc_stat_file,
 			no_share_data = 1;
 		else
 			no_share_data = 0;
+
+		if (acct_params && strstr(acct_params, "UsePss"))
+			use_pss = 1;
+		else
+			use_pss = 0;
 		xfree(acct_params);
 	}
 
@@ -393,22 +439,35 @@ static void _handle_stats(List prec_list, char *proc_stat_file,
 	fcntl(fd, F_SETFD, FD_CLOEXEC);
 
 	prec = xmalloc(sizeof(jag_prec_t));
-	if (_get_process_data_line(fd, prec)) {
-		if (no_sha

[slurm-dev] Re: Job on wrong node

2015-02-04 Thread Magnus Jonsson

It would be nice to eliminate most of the slurm.conf on the nodes.

Most of the information could as easily be fetched (or not needed at 
all) from the slurmctld on the master node.


An API to make a call to the master node and fetch configuration options 
could eliminate the need for NO_CONF_HASH :-)


All that should be needed is a slim slurm.conf with information where 
the slurmctld lives (and how to contact (munge/...)).


/Magnus

On 2015-02-04 20:54, Danny Auble wrote:



On 02/04/2015 11:23 AM, Ulf Markwardt wrote:



DebugFlags=NO_CONF_HASH

But we do have different slurm.conf files due to different energy
sensors, prolog/epilog scripts.

The NO_CONF_HASH is very dangerous in most systems.  It should be
avoided at all cost.

It is interesting you have different sensors per node.  I could
understand in this case to have NO_CONF_HASH set.  We are thinking of
adding a new kind of slurm.conf include that doesn't get added to the
hash which you could put node specific information like this and could
remove the NO_CONF_HASH.

You might be able to get around the pro/epilog issue by having a master
pro/epilog that in turn calls different ones depending on the node.
Adding the new file would also eliminate this issue as well. This
doesn't exist today, but is being thought about.





I am guessing the slurm.conf file on your nodes may be insync, but
perhaps the slurmd on the troubled nodes may be running with an old
version.

All show slurm 14.11.3

I meant an older version of the file, not Slurm :).  With NO_CONF_HASH
set there isn't a real good way to verify the slurmd's are all running
the same slurm.conf.

I would suggest issuing a "scontrol shutdown" then restarting all your
nodes and your controller.  If you still see the problem after that then
indeed something else is the matter.  Perhaps routing tables or
something else.


U



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Changed behaviour of --exclusive in srun (job step context)

2014-10-02 Thread Magnus Jonsson

Is no one else affected by this?

/Magnus

On 2014-09-11 14:46, Magnus Jonsson wrote:

Hi!

A user found a "strange" new behaviour when using --exclusive with srun.

I have an example submit-script[1] that shows this.

I have tested this on 2.6.4 with the output [2] & [3] (stderr) and on
14.03.7 with the output [4] & [5] (stderr).

In 14.03.7 without --exclusive behaves like 2.6.4 with --exclusive.
In 14.03.7 with --exclusive get some kind of "node exclusive" within the
job. --overcommit gets the same behaviour on both versions.

in 14.03.7 -c3 does not seams to work at all in job step context I see
warnings about this in the man page for srun but in 2.6.4 this works as
I aspect. Stderr output from srun: "srun: error: Unable to create job
step: Requested node configuration is not available"


If you need more information please let me know.

Best regards,
Magnus

1, http://www.hpc2n.umu.se/staff/magnus/slurm/submit.sh
2, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.2.6.4
3, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.2.6.4
4, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.14.03.7
5, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.14.03.7



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Changed behaviour of --exclusive in srun (job step context)

2014-09-11 Thread Magnus Jonsson

Hi!

A user found a "strange" new behaviour when using --exclusive with srun.

I have an example submit-script[1] that shows this.

I have tested this on 2.6.4 with the output [2] & [3] (stderr) and on
14.03.7 with the output [4] & [5] (stderr).

In 14.03.7 without --exclusive behaves like 2.6.4 with --exclusive.
In 14.03.7 with --exclusive get some kind of "node exclusive" within the 
job. --overcommit gets the same behaviour on both versions.


in 14.03.7 -c3 does not seams to work at all in job step context I see 
warnings about this in the man page for srun but in 2.6.4 this works as 
I aspect. Stderr output from srun: "srun: error: Unable to create job 
step: Requested node configuration is not available"



If you need more information please let me know.

Best regards,
Magnus

1, http://www.hpc2n.umu.se/staff/magnus/slurm/submit.sh
2, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.2.6.4
3, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.2.6.4
4, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.14.03.7
5, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.14.03.7

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Killing the backfill...

2014-05-20 Thread Magnus Jonsson

On 2014-05-20 14:54, Tommi T wrote:



On Tuesday, May 20, 2014 1:51 PM, Magnus Jonsson  wrote:

Hi!

While investigating an other matter I found that if you have lots of
jobs running with short job steps they killing the backfill very effective.


Hi,

Do you use bf_continue-flag?

http://slurm.schedmd.com/sched_config.html


Yes and no. I implemented the first version of bf_continue but it was 
while debugging some strange behaviour of bf_continue I started looking 
more into what exactly caused the last_job_update to be updated all the 
time.


I will return with more information about my bf_continue-findings.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Killing the backfill...

2014-05-20 Thread Magnus Jonsson


Hi!

While investigating an other matter I found that if you have lots of 
jobs running with short job steps they killing the backfill very effective.


As all actions on a job step modifies the last_job_update global 
variable that effective stops the backfill loop.


This could be very simple demonstrated with this simple batch script on 
a system with some jobs in the queue.


8<
#!/bin/bash

for n in `seq 120`; do
srun sleep 1
done
8<

In 2.6.7-version I can only find a few places where last_job_update is 
used and only one that is directly related to job step.


Is there a need to have the code updated the last_job_update for every 
action of a job step?


Should there be a last_job_step_update also? Is there actions of a job 
step that affects the queue?


Could there be an other variable that could be used to trigger a 
reschedule of the queue based on events that actually affects the 
scheduling of the queue?


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Change node weight based on partition? QOS? other?

2014-05-19 Thread Magnus Jonsson


Hi!

We have a scenario there we would like to use the node weight feature in 
Slurm to pack groups of job to one half of the machine and other jobs to 
the other part but overlapping is OK in some degree.


Is there a way of altering the node weight for one job, via a partition 
or via an QOS?


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Added spank_item.

2014-03-04 Thread Magnus Jonsson
I have made a patch for spank to allow to fetch the SLURM_RESTART_COUNT 
into my spank plugin.


The patch is attached (against 2.6.6).

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff a/slurm/spank.h b/slurm/spank.h
--- a/slurm/spank.h
+++ b/slurm/spank.h
@@ -169,7 +169,8 @@ enum spank_item {
 S_JOB_ALLOC_CORES,   /* Job allocated cores in list format (char **) */
 S_JOB_ALLOC_MEM, /* Job allocated memory in MB (uint32_t *)  */
 S_STEP_ALLOC_CORES,  /* Step alloc'd cores in list format  (char **) */
-S_STEP_ALLOC_MEM /* Step alloc'd memory in MB (uint32_t *)   */
+S_STEP_ALLOC_MEM,/* Step alloc'd memory in MB (uint32_t *)   */
+S_SLURM_RESTART_COUNT/* Job restart count (uint32_t *)   */
 };
 
 typedef enum spank_item spank_item_t;
diff a/src/common/plugstack.c b/src/common/plugstack.c
--- a/src/common/plugstack.c
+++ b/src/common/plugstack.c
@@ -2133,6 +2133,13 @@ spank_err_t spank_get_item(spank_t spank, spank_item_t item, ...)
 		else
 			*p2uint32 = 0;
 		break;
+	case S_SLURM_RESTART_COUNT:
+		p2uint32 = va_arg(vargs, uint32_t *);
+		if (slurmd_job)
+			*p2uint32 = slurmd_job->restart_cnt;
+		else
+			*p2uint32 = 0;
+		break;
 	case S_SLURM_VERSION:
 		p2vers = va_arg(vargs, char  **);
 		*p2vers = SLURM_VERSION_STRING;


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] RE: --exclusive together with --ntasks-per-node not working as expected.

2014-02-20 Thread Magnus Jonsson

Yes.. but I fail to see the absence.

I say that I want 8 tasks per node and not 16.

If say I want the node exclusive should not invalidate that in my opinion.

As a basic rule we tell our users to not think of nodes and think in 
terms of task and Slurm will give them the number of nodes they need.

This might not be true for more advanced users/use cases but

Best regards,
Magnus

On 2014-02-19 15:49, Rod Schultz wrote:

In the absence of other directives, slurm tries to use the minimum number of 
nodes.

Instead of -n16, try -N 2
That tells slurm to use two nodes.

Here's a demo case

srun -l -N2 --tasks-per-node=2 hostname
1: trek0
0: trek0
2: trek1
3: trek1



-Original Message-
From: Magnus Jonsson [mailto:mag...@hpc2n.umu.se]
Sent: Wednesday, February 19, 2014 1:28 AM
To: slurm-dev
Subject: [slurm-dev] --exclusive together with --ntasks-per-node not working as 
expected.

Hi!

We have a user that submitted a job that did not start as expected.

He was using --exclusive together with --ntasks-per-node but ended up
with all task on one node anyway.

8<
#SBATCH -n 16
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
8<

See the attached files for more information about how the job was submitted.

We are currently running version 2.6.3.

Best regards,
Magnus



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] --exclusive together with --ntasks-per-node not working as expected.

2014-02-19 Thread Magnus Jonsson

Hi!

We have a user that submitted a job that did not start as expected.

He was using --exclusive together with --ntasks-per-node but ended up 
with all task on one node anyway.


8<
#SBATCH -n 16
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
8<

See the attached files for more information about how the job was submitted.

We are currently running version 2.6.3.

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
JobId=1603907 Name=submit_e
   UserId=magnus(2066) GroupId=folk(3001)
   Priority=658834 Account=sysop QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:44 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2014-02-19T09:03:10 EligibleTime=2014-02-19T09:03:10
   StartTime=2014-02-19T09:05:45 EndTime=2014-02-19T09:35:45
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=devel AllocNode:Sid=t-mn01:14395
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t-cn0304
   BatchHost=t-cn0304
   NumNodes=6 NumCPUs=48 CPUs/Task=1 ReqS:C:T=*:*:*
 Nodes=t-cn0304 CPU_IDs=0-47 Mem=127200
   MinCPUsNode=8 MinMemoryCPU=2650M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/nobackup/home/m/magnus/y/submit_e
   WorkDir=/pfs/nobackup/home/m/magnus/y
   BatchScript=
#!/bin/bash

#SBATCH -A sysop
#SBATCH -p devel
#SBATCH -o e.out
#SBATCH -n 16
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8

scontrol show job -d -d $SLURM_JOBID

srun hostname


t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
t-cn0304.hpc2n.umu.se
JobId=1603906 Name=submit
   UserId=magnus(2066) GroupId=folk(3001)
   Priority=658834 Account=sysop QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2014-02-19T09:03:09 EligibleTime=2014-02-19T09:03:09
   StartTime=2014-02-19T09:03:44 EndTime=2014-02-19T09:33:44
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=devel AllocNode:Sid=t-mn01:14395
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t-cn[1015,1017]
   BatchHost=t-cn1015
   NumNodes=2 NumCPUs=24 CPUs/Task=1 ReqS:C:T=*:*:*
 Nodes=t-cn[1015,1017] CPU_IDs=0-11 Mem=31800
   MinCPUsNode=8 MinMemoryCPU=2650M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/nobackup/home/m/magnus/y/submit
   WorkDir=/pfs/nobackup/home/m/magnus/y
   BatchScript=
#!/bin/bash

#SBATCH -A sysop
#SBATCH -p devel
#SBATCH -o o.out
#SBATCH -n 16
#SBATCH --ntasks-per-node=8

scontrol show job -d -d $SLURM_JOBID

srun hostname


t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1015.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se
t-cn1017.hpc2n.umu.se


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Only allow some nodes in an partition to run jobs that stay within one node.

2014-02-05 Thread Magnus Jonsson

Hi!

We have a part of our cluster that have limited interconnect.

Is there a way of make a part of a partition only to allow jobs that 
stay within one node without making a new partition?


I know I can make a submitplugin that changes the partition if the job 
seams to fit within the limits but this might also confuse the users.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Bad behaviour of slurm with -c

2013-10-22 Thread Magnus Jonsson

Got the same behaviour with 2.6.3.

/Magnus

On 2013-10-04 22:51, Moe Jette wrote:


There were bug fixes related to socket-based allocations in both version
2.6.2 and 2.6.3. I am not sure if these changes will fix the problem
that you report, but it is probably worth a look.

Quoting Magnus Jonsson :



Hi!

I have a case where slurm allocated less cores then is required.

It looks like it's not happening everytime but (right now) 1 of 10
failes this way.

Probably because of the layout of the current jobs of the nodes.

Here is some information I collected. I also attach our slurm.conf.

We are running Slurm 2.6.1.

Best regards,
Magnus

==> submit <==
#!/bin/bash
#SBATCH -J 84212
#SBATCH --error=err.%J
#SBATCH --output=out.%J
#SBATCH -n 16
#SBATCH -c 12
#SBATCH -t 00:05:00

echo ---
env | grep ^SLURM
echo ---
scontrol show job -d -d $SLURM_JOBID
echo ---

srun echo ""

==> out.1313514 <==
---
SLURM_CHECKPOINT_IMAGE_DIR=/pfs/nobackup/home/m/magnus/84212
SLURM_NODELIST=t-cn[0113,0423-0424,0433-0434]
SLURM_JOB_NAME=84212
SLURMD_NODENAME=t-cn0113
SLURM_TOPOLOGY_ADDR=t-isw0501.t-isw0101.t-cn0113
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_MEM_PER_CPU=2500
SLURM_NNODES=5
SLURM_JOBID=1313514
SLURM_NTASKS=16
SLURM_TASKS_PER_NODE=3,4(x2),3,2
SLURM_JOB_ID=1313514
SLURM_CPUS_PER_TASK=12
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/pfs/nobackup/home/m/magnus/84212
SLURM_TASK_PID=19827
SLURM_NPROCS=16
SLURM_CPUS_ON_NODE=24
SLURM_PROCID=0
SLURM_JOB_NODELIST=t-cn[0113,0423-0424,0433-0434]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=24,48(x2),36,24
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=t-mn01.hpc2n.umu.se
SLURM_JOB_NUM_NODES=5
---
JobId=1313514 Name=84212
   UserId=magnus(2066) GroupId=folk(3001)
   Priority=10 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2013-10-04T15:16:43 EligibleTime=2013-10-04T15:16:43
   StartTime=2013-10-04T15:23:02 EndTime=2013-10-04T15:28:02
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=t-mn01:8853
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t-cn[0113,0423-0424,0433-0434]
   BatchHost=t-cn0113
   NumNodes=5 NumCPUs=180 CPUs/Task=12 ReqS:C:T=*:*:*
 Nodes=t-cn0113 CPU_IDs=12-35 Mem=6
 Nodes=t-cn[0423-0424] CPU_IDs=0-47 Mem=12
 Nodes=t-cn0433 CPU_IDs=6-41 Mem=9
 Nodes=t-cn0434 CPU_IDs=6-11,24-41 Mem=6
   MinCPUsNode=12 MinMemoryCPU=2500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/nobackup/home/m/magnus/84212/submit
   WorkDir=/pfs/nobackup/home/m/magnus/84212
   BatchScript=
#!/bin/bash
#SBATCH -J 84212
#SBATCH --error=err.%J
#SBATCH --output=%J
#SBATCH -n 16
#SBATCH -c 12
#SBATCH -t 00:05:00

echo ---
env | grep ^SLURM
echo ---
scontrol show job -d -d $SLURM_JOBID
echo ---

srun echo ""

---

==> err.1313514 <==
srun: error: Unable to create job step: More processors requested than
permitted


==> slurm.log <==
Oct  4 15:23:02 t-mn02 slurmctld[28426]: backfill test for job 1313514
Oct  4 15:23:02 t-mn02 slurmctld[28426]: error: cons_res:
_compute_c_b_task_dist oversubscribe for job 1313514
Oct  4 15:23:02 t-mn02 slurmctld[28426]: backfill: Started
JobId=1313514 on t-cn[0113,0423-0424,0433-0434]
Oct  4 15:23:03 t-mn02 slurmctld[28426]: _slurm_rpc_job_step_create
for job 1313514: More processors requested than permitted
Oct  4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514
Oct  4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for
JobId=1313514 successful, exit code=256

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Bad behaviour of slurm with -c

2013-10-04 Thread Magnus Jonsson


Hi!

I have a case where slurm allocated less cores then is required.

It looks like it's not happening everytime but (right now) 1 of 10 
failes this way.


Probably because of the layout of the current jobs of the nodes.

Here is some information I collected. I also attach our slurm.conf.

We are running Slurm 2.6.1.

Best regards,
Magnus

==> submit <==
#!/bin/bash
#SBATCH -J 84212
#SBATCH --error=err.%J
#SBATCH --output=out.%J
#SBATCH -n 16
#SBATCH -c 12
#SBATCH -t 00:05:00

echo ---
env | grep ^SLURM
echo ---
scontrol show job -d -d $SLURM_JOBID
echo ---

srun echo ""

==> out.1313514 <==
---
SLURM_CHECKPOINT_IMAGE_DIR=/pfs/nobackup/home/m/magnus/84212
SLURM_NODELIST=t-cn[0113,0423-0424,0433-0434]
SLURM_JOB_NAME=84212
SLURMD_NODENAME=t-cn0113
SLURM_TOPOLOGY_ADDR=t-isw0501.t-isw0101.t-cn0113
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_MEM_PER_CPU=2500
SLURM_NNODES=5
SLURM_JOBID=1313514
SLURM_NTASKS=16
SLURM_TASKS_PER_NODE=3,4(x2),3,2
SLURM_JOB_ID=1313514
SLURM_CPUS_PER_TASK=12
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/pfs/nobackup/home/m/magnus/84212
SLURM_TASK_PID=19827
SLURM_NPROCS=16
SLURM_CPUS_ON_NODE=24
SLURM_PROCID=0
SLURM_JOB_NODELIST=t-cn[0113,0423-0424,0433-0434]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=24,48(x2),36,24
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=t-mn01.hpc2n.umu.se
SLURM_JOB_NUM_NODES=5
---
JobId=1313514 Name=84212
   UserId=magnus(2066) GroupId=folk(3001)
   Priority=10 Account=default QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2013-10-04T15:16:43 EligibleTime=2013-10-04T15:16:43
   StartTime=2013-10-04T15:23:02 EndTime=2013-10-04T15:28:02
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=t-mn01:8853
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t-cn[0113,0423-0424,0433-0434]
   BatchHost=t-cn0113
   NumNodes=5 NumCPUs=180 CPUs/Task=12 ReqS:C:T=*:*:*
 Nodes=t-cn0113 CPU_IDs=12-35 Mem=6
 Nodes=t-cn[0423-0424] CPU_IDs=0-47 Mem=12
 Nodes=t-cn0433 CPU_IDs=6-41 Mem=9
 Nodes=t-cn0434 CPU_IDs=6-11,24-41 Mem=6
   MinCPUsNode=12 MinMemoryCPU=2500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/nobackup/home/m/magnus/84212/submit
   WorkDir=/pfs/nobackup/home/m/magnus/84212
   BatchScript=
#!/bin/bash
#SBATCH -J 84212
#SBATCH --error=err.%J
#SBATCH --output=%J
#SBATCH -n 16
#SBATCH -c 12
#SBATCH -t 00:05:00

echo ---
env | grep ^SLURM
echo ---
scontrol show job -d -d $SLURM_JOBID
echo ---

srun echo ""

---

==> err.1313514 <==
srun: error: Unable to create job step: More processors requested than 
permitted



==> slurm.log <==
Oct  4 15:23:02 t-mn02 slurmctld[28426]: backfill test for job 1313514
Oct  4 15:23:02 t-mn02 slurmctld[28426]: error: cons_res: 
_compute_c_b_task_dist oversubscribe for job 1313514
Oct  4 15:23:02 t-mn02 slurmctld[28426]: backfill: Started JobId=1313514 
on t-cn[0113,0423-0424,0433-0434]
Oct  4 15:23:03 t-mn02 slurmctld[28426]: _slurm_rpc_job_step_create for 
job 1313514: More processors requested than permitted

Oct  4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514
Oct  4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for 
JobId=1313514 successful, exit code=256


--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
#
# See the slurm.conf man page for more information.
#
ControlMachine=t-mn02
#ControlAddr=
#BackupController=
#BackupAddr=
# 
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none 
CryptoType=crypto/munge
DisableRootJobs=YES
EnforcePartLimits=YES
RebootProgram=/sbin/reboot
# our cleanup epilog
Epilog=/var/conf/slurm/hpc2n-epilog
#PrologSlurmctld= 
#FirstJobId=1 
#GresTypes= 
#GroupUpdateForce=0 
#GroupUpdateTime=600 
#JobCheckpointDir=/var/slurm/checkpoint 
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0 
#JobRequeue=1 
#JobSubmitPlugins=1 
#KillOnBadExit=0 
###
#Optimization Toolbox   2
#Partial Differential Equation Toolbox  2
#Statistics Toolbox 2
#Image Processing Toolbox   2
#Curve Fitting Toolbox  5
#Signal Processing Toolbox  5
#Communications Toolbox 5
#Parallel Computing Toolbox 15
# One more then we actually have!
Licenses=matlab*21,matlab-pct*16,matlab-ct*6,matlab-spt*6,matlab-cft*6,matlab-ipt*3,matlab-st*3,matlab-pdet*3,matlab-ot*3
MailProg=/usr/bin/mail 
MaxJobCount=2
#MaxTasksPerNode=128 
#MpiDefault=none
MpiDefault=openmpi
#MpiParams=ports=#-# 
## needs openmpi-1.5+
MpiParams=ports=12000-12999
#PluginDir= 
#PlugStackConfig= 
#PrivateData=jobs 
ProctrackType=proctrack/cgroup
#P

[slurm-dev] Expected start time far, far away...

2013-10-04 Thread Magnus Jonsson

We have a reservation for some nodes in our cluster

ReservationName=test StartTime=2013-09-20T09:22:50 
EndTime=2014-09-20T09:22:50 Duration=365-00:00:00
   Nodes=t-cn[0301,0710,0715-0716,0736,0828] NodeCnt=6 CoreCnt=288 
Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES

   Users=xxx Accounts=sysop Licenses=(null) State=ACTIVE

For some users we get the output from squeue that jobs till start after 
this reservation ends (2014-09-20) and the users are asking us "Will my 
jobs not start before that?".


Is there a way to prevent this from happening? Maybe expected starttimes 
that are that far in the future should not be allowed to appear.


If the starttime is more then a week or two into the future the 
starttime will probably not be that accurate anyway.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Bug in squeue command.

2013-08-22 Thread Magnus Jonsson

Hi!

A user reported a strange behaviour of squeue in our newly installed 
2.6.1 version.


The following submit file:
-8<-
#!/bin/bash -l

#SBATCH -N 1
#SBATCH -n 12
#SBATCH --time=5-00:00:00

hostname
-8<-

Results in the following output from squeue if I use -j or -u:

-8<-
% squeue -u magnus
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   1190903 batch  submit2   magnus PD   0:00 12 
(Priority)

-8<-

-8<-
% squeue -j 1190903
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   1190903 batch  submit2   magnus PD   0:00 12 
(Priority)

-8<-

But if I use grep:

-8<-
% squeue | grep 1190903
   1190903 batch  submit2   magnus PD   0:00  1 
(Priority)

-8<-

I get the expected behaviour.

I have tracked this down to a commit to "To minimize overhead"

https://github.com/SchedMD/slurm/commit/ac44db862c8d1f460e55ad09017d058942ff6499

on line 397/416 in src/squeue/opts.c. max_cpus is used in the 
_get_node_cnt() to estimate the number of nodes required.


Reverting params.max_cpus code I get the expected behaviour.

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: cons_res: Can't use Partition SelectType

2013-08-07 Thread Magnus Jonsson


Hi!

To use Partition SelectTypeParameter you must use CR_Socket or (CR_Core 
and CR_ALLOCATE_FULL_SOCKET) as the default SelectTypeParameter.


You are using CR_CPU_Memory.

Best regards,
Magnus

On 2013-08-05 23:13, Eva Hocks wrote:




I am getting spam messages in the logs:

[2013-08-05T14:04:32.000] cons_res: Can't use Partition SelectType
unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET



The slurm.conf settings are:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory



and I have set one partition in partitions.conf to
SelectTypeParameters=CR_Core


Why does slurm complain? I HAVE set CR_Core and I even checked the
spelling.


Thanks
Eva



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: strftime issue(s).

2013-06-17 Thread Magnus Jonsson

Works for me :-)

/M

On 2013-06-17 10:25, Moe Jette wrote:

Hi Magnus,

Thank you for reporting this problem and providing a patch. Most places
that use strftime already test for a return value of zero, but I see two
that do not. Perhaps using a macro that can be used from multiple places
will be a better solution than changing the code in various places. See
attached variation of your patch.

Moe

Quoting Magnus Jonsson :


Hi!

We found an issue in sacct that we pined down to a strftime call in
'src/common/parse_time.c' (slurm_make_time_str).

Reproducable with (in 2.5.{6,7}):

OK:
% sacct -Po end%19 -s failed,completed,timeout,cancelled -S 2013-06-13
| head | cat -vet
2013-06-14T11:40:14$
2013-06-14T11:40:14$
2013-06-14T11:40:11$


Fail:
sacct -Po end%18 -s failed,completed,timeout,cancelled -S 2013-06-13 |
head | cat -vet
End$
2013-06-14TVM-^P^?$
2013-06-14TVM-^P^?$
2013-06-14TVM-^P^?$

The problem is that the output from strftime with a buffer less the
the required length of output is undefined in libc since libc 4.4.4.

Different libc implementations seams to implement this differently. On
Solaris for example the buffer is truncated at the given length but
still returns 0.

From man page of strftime (Ubuntu Precise):

--8<
RETURN VALUE
The  strftime() function returns the number of characters placed in the
array s, not including the terminating null byte, provided the  string,
including  the  terminating  null byte, fits.  Otherwise, it returns 0,
and the contents of the array is  undefined.   (This  behavior  applies
since  at  least  libc  4.4.4;  very old versions of libc, such as libc
4.4.1, would return max if the array was too small.)

Note that the return value 0 does not necessarily  indicate  an  error;
for example, in many locales %p yields an empty string.
--8<

From man page of strftime (Solaris)
--8<
If  the total number of resulting characters including the terminating
null character is more than maxsize, strftime() returns 0 and the
contents of the array are indeterminate.
--8<

The return value of strftime is not checked for in slurm_make_time_str
making the returned value undefined.

As I see it the problem can be solved in many ways.

1. using a "large" temporary buffer for the output the expected
behaviour of sacct end%N will for most normal cases be fine.

2. Only checking the return value and returning an error or an well
defined output.

I haved attached a patch for case 1 with that sets the output to "#"
if the output still not fit into the buffer.

There seams to be other places in the slurm code base that uses
strftime and not checking the return code. Some of them might be OK
due to the format string and size of the buffer. But this might needs
to be looked into with more depth.

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] strftime issue(s).

2013-06-17 Thread Magnus Jonsson

Hi!

We found an issue in sacct that we pined down to a strftime call in 
'src/common/parse_time.c' (slurm_make_time_str).


Reproducable with (in 2.5.{6,7}):

OK:
% sacct -Po end%19 -s failed,completed,timeout,cancelled -S 2013-06-13 | 
head | cat -vet

2013-06-14T11:40:14$
2013-06-14T11:40:14$
2013-06-14T11:40:11$


Fail:
sacct -Po end%18 -s failed,completed,timeout,cancelled -S 2013-06-13 | 
head | cat -vet

End$
2013-06-14TVM-^P^?$
2013-06-14TVM-^P^?$
2013-06-14TVM-^P^?$

The problem is that the output from strftime with a buffer less the the 
required length of output is undefined in libc since libc 4.4.4.


Different libc implementations seams to implement this differently. On 
Solaris for example the buffer is truncated at the given length but 
still returns 0.


From man page of strftime (Ubuntu Precise):

--8<
RETURN VALUE
The  strftime() function returns the number of characters placed in the
array s, not including the terminating null byte, provided the  string,
including  the  terminating  null byte, fits.  Otherwise, it returns 0,
and the contents of the array is  undefined.   (This  behavior  applies
since  at  least  libc  4.4.4;  very old versions of libc, such as libc
4.4.1, would return max if the array was too small.)

Note that the return value 0 does not necessarily  indicate  an  error;
for example, in many locales %p yields an empty string.
--8<

From man page of strftime (Solaris)
--8<
If  the total number of resulting characters including the terminating 
null character is more than maxsize, strftime() returns 0 and the 
contents of the array are indeterminate.

--8<

The return value of strftime is not checked for in slurm_make_time_str 
making the returned value undefined.


As I see it the problem can be solved in many ways.

1. using a "large" temporary buffer for the output the expected 
behaviour of sacct end%N will for most normal cases be fine.


2. Only checking the return value and returning an error or an well 
defined output.


I haved attached a patch for case 1 with that sets the output to "#" if 
the output still not fit into the buffer.


There seams to be other places in the slurm code base that uses strftime 
and not checking the return code. Some of them might be OK due to the 
format string and size of the buffer. But this might needs to be looked 
into with more depth.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff -ru site/src/common/parse_time.c amd64_ubuntu1004/src/common/parse_time.c
--- site/src/common/parse_time.c	2013-06-05 21:43:00.0 +0200
+++ amd64_ubuntu1004/src/common/parse_time.c	2013-06-14 16:56:20.0 +0200
@@ -597,6 +597,7 @@
 		static char fmt_buf[32];
 		static const char *display_fmt;
 		static bool use_relative_format;
+		char tmp_string[(size<256?256:size+1)];
 
 		if (!display_fmt) {
 			char *fmt = getenv("SLURM_TIME_FORMAT");
@@ -626,7 +627,11 @@
 		if (use_relative_format)
 			display_fmt = _relative_date_fmt(&time_tm);
 
-		strftime(string, size, display_fmt, &time_tm);
+		if(strftime(tmp_string, sizeof(tmp_string), display_fmt, &time_tm) == 0) {
+			memset(tmp_string,'#',size);
+		}
+		strncpy(string,tmp_string,size);
+		string[size-1] = 0;
 	}
 }


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] sbatch --exclusive --mem-per-cpu

2013-03-21 Thread Magnus Jonsson

Hi!

If a user asks for more then the available memory of a core (in our case 
2500/core) with -N1 --mem-per-cpu and also add --exclusive.


Slurm will allocate all the cores in the node but only account for the 
number of nodes that fulfil the requirements of --mem-per-cpu.


For example if I say --mem-per-cpu=5000 only half of the available cores 
will be accounted for but all of them will be blocked.


This is the (relevant) output of scontrol show job of real job at our 
system:


-8<-
JobId=69211 Name=memory
   NumNodes=1 NumCPUs=30 CPUs/Task=1 ReqS:C:T=*:*:*
 Nodes=t-cn0102 CPU_IDs=0-47 Mem=12
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)

#!/bin/bash
#SBATCH --mem-per-cpu=4000
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --exclusive
-8<-

Total memory/node=128000M, 48 cores, default 2500M/core

As I understand it this will give the wrong input for the fair share 
scheduler and results the wrong priority (to high) for the user.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] issue with task/affinity and srun --exclusive

2013-03-21 Thread Magnus Jonsson

Hi!

We have a problem with task/affinity and srun --exclusive

If I submit a job with sbatch that runs srun --exclusive it looks like 
from the output of hwloc-bind --get that cores are allocated (and 
binded) to cores before task/affinity gets a chance of distribute them 
according to the cpu_bind.


In the example below I use 'sbatch --exclusive' and gets 48 cores in total.

srun -n1 -c6 --cpu_bind=rank_ldom sh -c "hwloc-bind --get | ./hex2bin"
results in:
00 00 00 00 00 00 00 11  = 0x003f

srun -n1 -c6 --exclusive --cpu_bind=rank_ldom sh -c "hwloc-bind --get | 
./hex2bin"

results in:
00 00 01 01 01 01 01 01  = 0x41041041

This is also looks like the bitmask that task/affinity gets from slurm.

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet




smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] cons_res select_p_select_nodeinfo_set_all problem with multiple partitions.

2013-03-12 Thread Magnus Jonsson

Hi!

I found a bug in cons_res/select_p_select_nodeinfo_set_all.

If a node is part of two (or more) partitions the code will only count 
the number of cores/cpus in the partition that has the most running jobs 
on that node.


Patch attached to fix the problem.

I also added an new function to bitstring to count the number of bits in 
an range (bit_set_count_range) and made a minor improvement of 
(bit_set_count) while reviewing the range version.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff -ru site/src/common/bitstring.c amd64_ubuntu1004/src/common/bitstring.c
--- site/src/common/bitstring.c	2013-03-08 20:29:51.0 +0100
+++ amd64_ubuntu1004/src/common/bitstring.c	2013-03-12 14:07:20.0 +0100
@@ -69,6 +69,7 @@
 strong_alias(bit_not,		slurm_bit_not);
 strong_alias(bit_or,		slurm_bit_or);
 strong_alias(bit_set_count,	slurm_bit_set_count);
+strong_alias(bit_set_count_range, slurm_bit_set_count_range);
 strong_alias(bit_clear_count,	slurm_bit_clear_count);
 strong_alias(bit_nset_max_count,slurm_bit_nset_max_count);
 strong_alias(int_and_set_count,	slurm_int_and_set_count);
@@ -662,15 +663,45 @@
 	_assert_bitstr_valid(b);
 
 	bit_cnt = _bitstr_bits(b);
-	for (bit = 0; bit < bit_cnt; bit += word_size) {
-		if ((bit + word_size - 1) >= bit_cnt)
-			break;
+	for (bit = 0; (bit + word_size) <= bit_cnt; bit += word_size) {
 		count += hweight(b[_bit_word(bit)]);
 	}
 	for ( ; bit < bit_cnt; bit++) {
 		if (bit_test(b, bit))
 			count++;
 	}
+	return count;
+}
+
+/*
+ * Count the number of bits set in a range of bitstring.
+ *   b (IN)		bitstring to check
+ *   start (IN) first bit to check
+ *   end (IN)	last bit to check+1
+ *   RETURN		count of set bits
+ */
+int
+bit_set_count_range(bitstr_t *b, int start, int end)
+{
+	int count = 0;
+	bitoff_t bit, bit_cnt;
+	const int word_size = sizeof(bitstr_t) * 8;
+
+	_assert_bitstr_valid(b);
+	_assert_bit_valid(b,start);
+
+	end = MIN(end,_bitstr_bits(b));
+	for ( bit = start; bit < end && bit < ((start+word_size-1)/word_size) * word_size; bit++) {
+		if (bit_test(b, bit))
+			count++;
+	}
+	for (; (bit + word_size) <= end ; bit += word_size) {
+		count += hweight(b[_bit_word(bit)]);
+	}
+	for ( ; bit < end; bit++) {
+		if (bit_test(b, bit))
+			count++;
+	}
 
 	return count;
 }
diff -ru site/src/common/bitstring.h amd64_ubuntu1004/src/common/bitstring.h
--- site/src/common/bitstring.h	2013-03-08 20:29:51.0 +0100
+++ amd64_ubuntu1004/src/common/bitstring.h	2013-03-12 14:09:18.0 +0100
@@ -172,6 +172,7 @@
 void	bit_not(bitstr_t *b);
 void	bit_or(bitstr_t *b1, bitstr_t *b2);
 int	bit_set_count(bitstr_t *b);
+int bit_set_count_range(bitstr_t *b, int start, int end);
 int	bit_clear_count(bitstr_t *b);
 int	bit_nset_max_count(bitstr_t *b);
 int	int_and_set_count(int *i1, int ilen, bitstr_t *b2);
diff -ru site/src/common/slurm_xlator.h amd64_ubuntu1004/src/common/slurm_xlator.h
--- site/src/common/slurm_xlator.h	2013-03-08 20:29:51.0 +0100
+++ amd64_ubuntu1004/src/common/slurm_xlator.h	2013-03-12 12:32:50.0 +0100
@@ -93,6 +93,7 @@
 #define	bit_not			slurm_bit_not
 #define	bit_or			slurm_bit_or
 #define	bit_set_count		slurm_bit_set_count
+#define	bit_set_count_range	slurm_bit_set_count_range
 #define	bit_clear_count		slurm_bit_clear_count
 #define	bit_nset_max_count	slurm_bit_nset_max_count
 #define	bit_and_set_count	slurm_bit_and_set_count
diff -ru site/src/plugins/select/cons_res/select_cons_res.c amd64_ubuntu1004/src/plugins/select/cons_res/select_cons_res.c
--- site/src/plugins/select/cons_res/select_cons_res.c	2013-03-11 11:13:31.0 +0100
+++ amd64_ubuntu1004/src/plugins/select/cons_res/select_cons_res.c	2013-03-12 13:30:06.0 +0100
@@ -2230,7 +2230,7 @@
 	struct part_res_record *p_ptr;
 	struct node_record *node_ptr = NULL;
 	int i=0, n=0, c, start, end;
-	uint16_t tmp, tmp_16 = 0;
+	uint16_t tmp, tmp_16 = 0, tmp_part;
 	static time_t last_set_all = 0;
 	uint32_t node_threads, node_cpus;
 	select_nodeinfo_t *nodeinfo = NULL;
@@ -2275,20 +2275,17 @@
 		for (p_ptr = select_part_record; p_ptr; p_ptr = p_ptr->next) {
 			if (!p_ptr->row)
 continue;
+			tmp_part = 0;
 			for (i = 0; i < p_ptr->num_rows; i++) {
 if (!p_ptr->row[i].row_bitmap)
 	continue;
-tmp = 0;
-for (c = start; c < end; c++) {
-	if (bit_test(p_ptr->row[i].row_bitmap,
-		 c))
-		tmp++;
-}
+tmp = bit_set_count_range(p_ptr->row[i].row_bitmap,
+	start,end);
 /* get the row with the largest cpu
    count on it. */
-if (tmp > tmp_16)
-	tmp_16 = tmp;
+tmp_part = MAX(tmp,tmp_part);
 			}
+			tmp_16 += tmp_part;
 		}
 
 		/* The minimum allocatable unit may a core, so scale


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Licenses verification mechanism

2013-03-08 Thread Magnus Jonsson
Yes, it's possible to check if it's not set.. but for us not all users 
needs an license and it's not as simple not to allow people from 
starting software based on the licence information in slurm.


/Magnus

On 2013-03-08 12:24, Taras Shapovalov wrote:


Hi Magnus,

Thanks, this solution probably will work for us as well.

Also, when a user does not use -L option, than this could be checked (I
believe) in contribs/lua/job_submit.lua in several lines of code (in
slurm_job_submit function).

--
Taras

On 03/08/2013 09:37 AM, Magnus Jonsson wrote:

We have solved this by using the licens handler in slurm and let our
users specify the licences with -L.

Outside of slurm we have a script that periodic check our licence
server (FlexLM) for awailable licenses and used licenses in slurm and
blocks a number of licences with a "licences" reservation that no one
can run in.

It also has the ability to make sure that there are available licenses
if run in the prolog and fail the jobs if there is no licences left.

It's not a perfect solution but seams to work fairly well for us.

The only problem is that a user can grab a licence without specifying
the -L option but this is better then nothing.

If anybody interesting in more details just send me an email and I try
to answer them.

Best Regards,
Magnus

On 2013-03-08 02:58, Taras Shapovalov wrote:

Hi all,

Recently I faced with the case where users use software which requires
licenses. The license server is running somewhere outside several
clusters and jobs from those clusters should check availability of the
licenses periodically. If there is no free licenses, then the job should
be re-queued (so after some time the license availability will be
verified again).

Does anybody have experience with the case where job (or some script)
checks some condition periodically and stay in a queue if the condition
has not been complied yet?

--
Taras




--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Licenses verification mechanism

2013-03-08 Thread Magnus Jonsson
We have solved this by using the licens handler in slurm and let our 
users specify the licences with -L.


Outside of slurm we have a script that periodic check our licence server 
(FlexLM) for awailable licenses and used licenses in slurm and blocks a 
number of licences with a "licences" reservation that no one can run in.


It also has the ability to make sure that there are available licenses 
if run in the prolog and fail the jobs if there is no licences left.


It's not a perfect solution but seams to work fairly well for us.

The only problem is that a user can grab a licence without specifying 
the -L option but this is better then nothing.


If anybody interesting in more details just send me an email and I try 
to answer them.


Best Regards,
Magnus

On 2013-03-08 02:58, Taras Shapovalov wrote:

Hi all,

Recently I faced with the case where users use software which requires
licenses. The license server is running somewhere outside several
clusters and jobs from those clusters should check availability of the
licenses periodically. If there is no free licenses, then the job should
be re-queued (so after some time the license availability will be
verified again).

Does anybody have experience with the case where job (or some script)
checks some condition periodically and stay in a queue if the condition
has not been complied yet?

--
Taras


--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Problem with backfill and patch for solution

2013-03-01 Thread Magnus Jonsson

Hi!


We have what seems to be a similar type of load, and have in periods
experienced the same problem.

There are some parameters that can be used to tune the backfiller.

We have had good results with setting bf_max_job_user to a small value
(between 5 and 10), and bf_resolution to a large value (around 3600).

bf_max_job_user is similar to Maui MAXIJOB limit; the backfiller will
only try this many jobs for each user.  This is especially useful if
some users have many identical or nearly identical jobs in the queue.


I have tried tuning with bf_max_job_user and as you say it's especially 
useful with users having many identical jobs in the queue but I think it 
somewhat bad for the backfill not to look at the whole queue.


Many of our users that have many jobs do have more or less identical 
jobs but not all and then not looking at the complete queue would be bad 
for the user especially if you put in small jobs for testing purposes.



bf_resolution is the time resolution (in seconds) of the time slots used
for estimating when a job can start.  The default, 60 seconds, was way
to low for us.


I will try increasing the resolution value and see if it will pick up 
speed with that.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Problem with backfill and patch for solution

2013-03-01 Thread Magnus Jonsson


Hi!

We have a problem with backfill.

Jobs are not backfilled due to the fact that backfill does not finish 
the complete backlog of jobs in the queue before it's interrupted and 
starts all over again. We sometimes have lots of jobs in the queue of 
various sizes and users and even with idle nodes short job will not 
start because of this.


I have made a patch for backfill with a configuration option 
(bf_continue) to let backfill continue from the last JobID of the last 
cycle.


This will make backfill look at the whole queue eventually.

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff -r -u a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/backfill.c
--- a/src/plugins/sched/backfill/backfill.c	2013-02-05 23:59:05.0 +0100
+++ b/src/plugins/sched/backfill/backfill.c	2013-03-01 10:31:24.0 +0100
@@ -125,6 +125,7 @@
 static int backfill_window = BACKFILL_WINDOW;
 static int max_backfill_job_cnt = 50;
 static int max_backfill_job_per_user = 0;
+static bool backfill_continue = false;
 
 /*** local functions */
 static void _add_reservation(uint32_t start_time, uint32_t end_reserve,
@@ -410,6 +411,18 @@
 		  max_backfill_job_per_user);
 	}
 
+	/* bf_continue=true makes backfill continue where it was if interrupted
+	 */
+	if (sched_params && (strstr(sched_params, "bf_continue="))) {
+		if (strstr(sched_params, "bf_continue=1")) {
+			backfill_continue = true;
+		} else if (strstr(sched_params, "bf_continue=0")) {
+			backfill_continue = false;
+		} else {
+			fatal("Invalid bf_continue (use only 0 or 1)");
+		}
+	}
+
 	xfree(sched_params);
 }
 
@@ -530,6 +543,8 @@
 	uint32_t *uid = NULL, nuser = 0;
 	uint16_t *njobs = NULL;
 	bool already_counted;
+	static uint32_t last_job_id=0;
+	bool last_job_id_found = false;
 
 #ifdef HAVE_CRAY
 	/*
@@ -597,12 +612,33 @@
 		uid = xmalloc(BF_MAX_USERS * sizeof(uint32_t));
 		njobs = xmalloc(BF_MAX_USERS * sizeof(uint16_t));
 	}
+	/*
+	 * Reset last_job_id if not using bf_continue
+	 */
+	if (!backfill_continue) {
+		last_job_id = 0;
+	}
+	if (last_job_id == 0) {
+		last_job_id_found = true;
+	}
 	while ((job_queue_rec = (job_queue_rec_t *)
 list_pop_bottom(job_queue, sort_job_queue2))) {
 		job_test_count++;
 		job_ptr  = job_queue_rec->job_ptr;
 		part_ptr = job_queue_rec->part_ptr;
 		xfree(job_queue_rec);
+
+		/*
+		 * Skip job checked last time
+		 */
+		if (backfill_continue && !last_job_id_found) {
+			if (last_job_id == job_ptr->job_id) {
+last_job_id_found = true;
+last_job_id = 0;
+			}
+			continue;
+		}
+
 		if (!IS_JOB_PENDING(job_ptr))
 			continue;	/* started in other partition */
 		job_ptr->part_ptr = part_ptr;
@@ -783,6 +819,10 @@
 	 "breaking out after testing %d "
 	 "jobs", job_test_count);
 }
+/*
+ * Save last JobID for next turn
+ */
+last_job_id = job_ptr->job_id;
 rc = 1;
 break;
 			}
@@ -865,6 +905,10 @@
 
 		if (node_space_recs >= max_backfill_job_cnt) {
 			/* Already have too many jobs to deal with */
+			/*
+			 * Save last JobID for next turn
+			 */
+			last_job_id = job_ptr->job_id;
 			break;
 		}
 
@@ -890,6 +934,15 @@
 		if (debug_flags & DEBUG_FLAG_BACKFILL)
 			_dump_node_space_table(node_space);
 	}
+
+	/*
+	 * Reset last_job_id pointer if reached end of queue
+	 * without finding anything to do
+	 */
+	if (!last_job_id_found) {
+		debug("backfill: last_job_id=%d (reached end of queue without finding old job)",last_job_id);
+		last_job_id = 0;
+	}
 	xfree(uid);
 	xfree(njobs);
 	FREE_NULL_BITMAP(avail_bitmap);


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Buffer overflow bug + patch.

2013-02-22 Thread Magnus Jonsson

Hi!

I just found a bug in the slurm that creates a buffer overflow if you 
run 'scontrol show config'.


Patch attached to fix the problem.

/Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/src/common/slurm_protocol_defs.c b/src/common/slurm_protocol_defs.c
index adf48e5..45ed46c 100644
--- a/src/common/slurm_protocol_defs.c
+++ b/src/common/slurm_protocol_defs.c
@@ -1163,7 +1163,7 @@ extern uint16_t log_string2num(char *name)
  * NOTE: Not reentrant */
 extern char *sched_param_type_string(uint16_t select_type_param)
 {
-	static char select_str[128];
+	static char select_str[64];
 
 	select_str[0] = '\0';
 	if ((select_type_param & CR_CPU) &&


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1

2013-02-18 Thread Magnus Jonsson

Hi!

For us this is in most cases a bad default behaviour and I have not 
found a way to set a default value either (other then changing the code 
and recompile).


One other thing that I'm still curious about is then will the 
default:-statement of the switch/case in my first mail (see below) happen?


task_dist seems to be initialized to SLURM_DIST_CYCLIC (1) and all the 
other cases of task_dist that comes to this point are defined in the 
switch/case.


I have not bin able to reach it and activate the 
_task_layout_lllp_multi() function.


/Magnus

On 2013-02-15 19:52, martin.pe...@bull.com wrote:

Assuming you're using the default allocation and distribution methods,
the behavior you describe sounds correct.  Available cpus will be
selected cyclically across the sockets for allocation to the job.
  Allocated cpus will be selected cyclically across the sockets for
distribution to tasks for binding.  And each task will be bound to all
of the allocated cpus on each socket from which a cpu was distributed to
it. For -n 8 -c 6, I would expect each of the 8 tasks to be bound to 36
cpus (6 cpus on each of 6 sockets).

See the CPU Management Guide in the Slurm documentation for more info.
  Examples 11 thru 13 illustrate socket binding.

Martin Perry
Bull Phoenix



From: Moe Jette 
To: "slurm-dev" ,
Date: 02/15/2013 10:33 AM
Subject: [slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1





Have you tried the --ntasks-per-socket or --ntasks-per-core options?

Quoting Magnus Jonsson :

 > Hi!
 >
 > I have noticed strange behaviour in the task/affinity plugin if I
 > use --cpu_bind=socket and -c > 1.
 >
 > My task are distributed one on each socket (I have 8) and if I say
 > -c 6 six of my sockets are allocated to my first task. If I have 8
 > tasks each task get 6 of the 8 sockets.
 >
 > This sounds like a bad behaviour but is might be as design?
 >
 > I have traced it down to the lllp_distribution() function in
 > task/affinity/dist_task.c
 >
 > In this switch statement:
 >
 >  switch (req->task_dist) {
 >  case SLURM_DIST_BLOCK_BLOCK:
 >  case SLURM_DIST_CYCLIC_BLOCK:
 >  case SLURM_DIST_PLANE:
 > /* tasks are distributed in blocks within a plane */
 > rc = _task_layout_lllp_block(req, node_id, &masks);
 > break;
 >  case SLURM_DIST_CYCLIC:
 >  case SLURM_DIST_BLOCK:
 >  case SLURM_DIST_CYCLIC_CYCLIC:
 >  case SLURM_DIST_BLOCK_CYCLIC:
 > rc = _task_layout_lllp_cyclic(req, node_id, &masks);
 > break;
 >  default:
 > if (req->cpus_per_task > 1)
 >  rc = _task_layout_lllp_multi(req, node_id, &masks);
 > else
 >  rc = _task_layout_lllp_cyclic(req, node_id, &masks);
 > req->task_dist = SLURM_DIST_BLOCK_CYCLIC;
 > break;
 >  }
 >
 > in the default block there is a diffrent function called if
 > cpus_per_task > 1. Should the cyclic block be the same as the
 > default block?
 >
 > Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default?
 >
 > Best regards,
 > Magnus
 >
 > --
 > Magnus Jonsson, Developer, HPC2N, Umeå Universitet
 >
 >





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: slurmctld prolog delays job start

2013-02-18 Thread Magnus Jonsson

Any news on this?

/Magnus

On 2013-02-06 02:24, Michael Gutteridge wrote:


We have a prolog that the slurm controller runs (pretty
straightforward, just sets up some temporary directories).  However,
since upgrading from 2.3.5 to 2.5.1 we've got a situation where having
any slurmctld prolog configured causes long delays (60-120s)  between
when slurmctl allocates resources and starts the job. It seems to
occur in both srun and sbatch submitted jobs, though with different
symptoms.

I've distilled to a very generic config, using the FIFO scheduler to
eliminate any of that.  I've also reduced the prolog to a two-line
script:

#!/bin/bash
exit 0

The slurmctld.log has this:

[2013-02-05T15:26:27-08:00] debug2: Processing RPC:
REQUEST_SUBMIT_BATCH_JOB from uid=5
[2013-02-05T15:26:27-08:00] debug3: JobDesc: user_id=5 job_id=-1
partition=(null) name=sleeper.sh
[2013-02-05T15:26:27-08:00] debug3:cpus=1-4294967294 pn_min_cpus=-1

 snip

[2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config
containing puck[2-6]
[2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29
idle_nodes 4 share_nodes 5
[2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29
[2013-02-05T15:26:27-08:00] debug2: sched: JobId=29 allocated
resources: NodeList=(null)
[2013-02-05T15:26:27-08:00] _slurm_rpc_submit_batch_job JobId=29 usec=1359
[2013-02-05T15:26:27-08:00] debug:  sched: Running job scheduler
[2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config
containing puck[2-6]
[2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29
idle_nodes 4 share_nodes 5
[2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]:
required cpus: 1, min req boards: 1,
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]: min
req sockets: 1, min avail cores: 7
[2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: using node[0]:
board[0]: socket[1]: 3 cores available
[2013-02-05T15:26:27-08:00] debug3: cons_res: _add_job_to_res: job 29 act 0
[2013-02-05T15:26:27-08:00] debug3: cons_res: adding job 29 to part campus row 0
[2013-02-05T15:26:27-08:00] debug3: sched: JobId=29 initiated
[2013-02-05T15:26:27-08:00] sched: Allocate JobId=29 NodeList=puck2 #CPUs=1
[2013-02-05T15:26:27-08:00] debug3: Writing job id 29 to header record
of job_state file
[2013-02-05T15:26:27-08:00] debug2: prolog_slurmctld job 29 prolog completed

The job shows running, but there are not processes running on the
allocated node (puck2 in this case).  In the allocated node's
slurmd.log there's nothing (despite running with 3 "v" flags).  A
little while later:

[2013-02-05T15:27:27-08:00] error: agent waited too long for nodes to
respond, sending batch request anyway...
[2013-02-05T15:27:27-08:00] Job 29 launch delayed by 60 secs, updating end_time
[2013-02-05T15:27:27-08:00] debug2: Spawning RPC agent for msg_type 4005
[2013-02-05T15:27:27-08:00] debug2: got 1 threads to send out
[2013-02-05T15:27:27-08:00] debug2: Tree head got back 0 looking for 1
[2013-02-05T15:27:27-08:00] debug3: Tree sending to puck2
[2013-02-05T15:27:27-08:00] debug2: Tree head got back 1
[2013-02-05T15:27:27-08:00] debug2: Tree head got them all
[2013-02-05T15:27:27-08:00] Node puck2 now responding
[2013-02-05T15:27:27-08:00] debug2: node_did_resp puck2

and on the allocated node, slurmd.log comes to life:

[2013-02-05T15:27:27-08:00] debug2: got this type of message 4005
[2013-02-05T15:27:27-08:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2013-02-05T15:27:27-08:00] debug:  task_slurmd_batch_request: 29
[2013-02-05T15:27:27-08:00] debug:  Calling /usr/sbin/slurmstepd spank prolog
[2013-02-05T15:27:27-08:00] Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
[2013-02-05T15:27:27-08:00] Running spank/prolog for jobid [29] uid [34152]
[2013-02-05T15:27:27-08:00] spank: opening plugin stack
/etc/slurm-llnl/plugstack.conf
[2013-02-05T15:27:27-08:00] spank: /usr/lib64/slurm-llnl/use-env.so:
no callbacks in this context
[2013-02-05T15:27:27-08:00] Launching batch job 29 for UID 34152
[2013-02-05T15:27:27-08:00] debug level is 6.

and the task starts running.  Removing "PrologSlurmctld" eliminates
this delay, and the job starts immediately.  The fact that the delay
is exactly 60 is suspicious and makes me suspect a misconfiguration.
However, outside of the prolog configuration directive, the config is
straight out of the config generator.

Any pointers would be greatly appreciated- I'm out of ideas...

Thanks

Michael



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1

2013-02-18 Thread Magnus Jonsson

Hi!

This does not make a difference.

And I think it might not do either according to the man page.

/Magnus

On 2013-02-15 18:32, Moe Jette wrote:


Have you tried the --ntasks-per-socket or --ntasks-per-core options?

Quoting Magnus Jonsson :


Hi!

I have noticed strange behaviour in the task/affinity plugin if I
use --cpu_bind=socket and -c > 1.

My task are distributed one on each socket (I have 8) and if I say
-c 6 six of my sockets are allocated to my first task. If I have 8
tasks each task get 6 of the 8 sockets.

This sounds like a bad behaviour but is might be as design?

I have traced it down to the lllp_distribution() function in
task/affinity/dist_task.c

In this switch statement:

switch (req->task_dist) {
case SLURM_DIST_BLOCK_BLOCK:
case SLURM_DIST_CYCLIC_BLOCK:
case SLURM_DIST_PLANE:
/* tasks are distributed in blocks within a plane */
rc = _task_layout_lllp_block(req, node_id, &masks);
break;
case SLURM_DIST_CYCLIC:
case SLURM_DIST_BLOCK:
case SLURM_DIST_CYCLIC_CYCLIC:
case SLURM_DIST_BLOCK_CYCLIC:
rc = _task_layout_lllp_cyclic(req, node_id, &masks);
break;
default:
if (req->cpus_per_task > 1)
rc = _task_layout_lllp_multi(req, node_id, &masks);
else
rc = _task_layout_lllp_cyclic(req, node_id, &masks);
req->task_dist = SLURM_DIST_BLOCK_CYCLIC;
break;
}

in the default block there is a diffrent function called if
cpus_per_task > 1. Should the cyclic block be the same as the
default block?

Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default?

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet






--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] task/affinity, --cpu_bind=socket and -c > 1

2013-02-15 Thread Magnus Jonsson

Hi!

I have noticed strange behaviour in the task/affinity plugin if I use 
--cpu_bind=socket and -c > 1.


My task are distributed one on each socket (I have 8) and if I say -c 6 
six of my sockets are allocated to my first task. If I have 8 tasks each 
task get 6 of the 8 sockets.


This sounds like a bad behaviour but is might be as design?

I have traced it down to the lllp_distribution() function in 
task/affinity/dist_task.c


In this switch statement:

switch (req->task_dist) {
case SLURM_DIST_BLOCK_BLOCK:
case SLURM_DIST_CYCLIC_BLOCK:
case SLURM_DIST_PLANE:
/* tasks are distributed in blocks within a plane */
rc = _task_layout_lllp_block(req, node_id, &masks);
break;
case SLURM_DIST_CYCLIC:
case SLURM_DIST_BLOCK:
case SLURM_DIST_CYCLIC_CYCLIC:
case SLURM_DIST_BLOCK_CYCLIC:
rc = _task_layout_lllp_cyclic(req, node_id, &masks);
break;
default:
if (req->cpus_per_task > 1)
rc = _task_layout_lllp_multi(req, node_id, &masks);
else
rc = _task_layout_lllp_cyclic(req, node_id, &masks);
req->task_dist = SLURM_DIST_BLOCK_CYCLIC;
break;
}

in the default block there is a diffrent function called if 
cpus_per_task > 1. Should the cyclic block be the same as the default block?


Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default?

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Preemptation bug

2013-02-12 Thread Magnus Jonsson
es: _can_job_run_on_node: 48 cpus on 
t-cn1033(0), mem 0/129000
[2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 
e=0 r=-1
[2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - idle 
resources found

[2013-02-12T14:54:48+01:00] no job_resources info for job 241
[2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints

8<---
--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
#
# See the slurm.conf man page for more information.
#
ControlMachine=slurm-kvm
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
DisableRootJobs=YES
EnforcePartLimits=YES
MailProg=/usr/bin/mail 
MpiDefault=openmpi
MpiParams=ports=12000-12999
ProctrackType=proctrack/cgroup
PropagateResourceLimitsExcept=CPU,MEMLOCK 
ReturnToService=1
SlurmctldPort=6817
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup,task/affinity
TmpFs=/scratch
UsePAM=1
HealthCheckInterval=3600
HealthCheckProgram=/var/conf/slurm/hpc2n-healthcheck
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=60
# SCHEDULING 
DefMemPerCPU=2500
FastSchedule=2
MaxMemPerCPU=2500
SchedulerType=sched/backfill
SchedulerParameters=max_job_bf=2000,bf_window=20160,default_queue_depth=2000
#
SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# JOB PRIORITY 
PriorityType=priority/multifactor
PriorityDecayHalfLife=50-0
PriorityWeightFairshare=100 
PriorityWeightPartition=1
# LOGGING AND ACCOUNTING 
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=slurm-kvm
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=slurmtestcluster
DebugFlags=CPU_Bind
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=7
SlurmdDebug=7
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
 
# COMPUTE NODES 
# DEVEL
NodeName=t-cn[1033-1034] RealMemory=129000 Sockets=8 CoresPerSocket=6

# Partition Configurations
PartitionName=devel  Nodes=t-cn103[3,4] Default=YES DefaultTime=30:00 
MaxTime=5-0 Priority=30 PreemptMode=OFF
PartitionName=core   Nodes=t-cn103[3,4] DefaultTime=30:00 
MaxTime=5-0  Priority=20 PreemptMode=OFF 
PartitionName=preemp Nodes=t-cn103[3,4] 
   Priority=10 PreemptMode=CANCEL GraceTime=15
 
PreemptType=preempt/partition_prio
PreemptMode=CANCEL
#!/bin/bash
#SBATCH -p devel
#SBATCH --time=05:00:00
#SBATCH -N1
#SBATCH --exclusive

srun -n1 ./job.pl
#!/bin/bash
#SBATCH -p devel
#SBATCH --time=01:00:00
#SBATCH -N2
#SBATCH --exclusive

srun -n1 ./job.pl
#!/bin/bash
#SBATCH -p preemp
#SBATCH --time=01:00:00
#SBATCH -N1
#SBATCH -n48

srun -n1 ./job.pl
#!/bin/bash
#SBATCH -p devel
#SBATCH --time=04:00:00
#SBATCH --signal USR1@60

#SBATCH -N1
#SBATCH -n48
# #SBATCH --exclusive

if [ "$SLURM_JOBID" = "" ]; then
echo "Using sbatch to submit job"
sbatch $0
exit 0
fi

srun -n1 ./job.pl


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Disable black hole nodes automatically

2013-02-08 Thread Magnus Jonsson
You could put a health check in the epilog of a job so that after every 
job the node is checked. If it's in bad shape you can down it. For the 
normal case with long running jobs this should not be a problem and only 
one job will fail.


/Magnus

On 2013-02-08 10:11, Mario Kadastik wrote:


Hi,

I'm wondering if there's a way to detect a fast churn rate for a node. Last night we had 
one node lose the software area so all jobs that were scheduled failed within a few 
minutes (the jobs use wrappers that do health checking of environment so the job exit 
code was 0, the wrapper propagated the actual error code to the users software). We have 
a self test run by slurm every 5 minutes and it did detect the node failure, but before 
it could the node had "failed" hundreds of jobs in that 5 minute window. We 
assume most jobs would run for at least tens of minutes so if slurm sees a node churning 
through jobs in less than a minute it should disable the node. Is there any way to handle 
this beyond moving self test script execution up from 5 minutes to say every 30 seconds?

Thanks,

Mario Kadastik, PhD
Researcher

---
   "Physics is like sex, sure it may have practical reasons, but that's not why we 
do it"
  -- Richard P. Feynman



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Patch for partition based SelectType (CR_Socket/CR_Core).

2013-02-07 Thread Magnus Jonsson

That's okej.

It will blend on the same node as long as there is cores available.

Because of the CR_ALLOCATE_FULL_SOCKET there will never be non-allocated 
cores that is reserved for jobs.


Why CR_ALLOCATE_FULL_SOCKET is not default I don't understand but I 
guess there is some good historic reason for that.


/Magnus

On 2013-02-07 16:42, Aaron Knister wrote:


That's awesome! (How) does it handle the case of nodes in multiple partitions?

Sent from my iPhone

On Feb 7, 2013, at 8:24 AM, Magnus Jonsson  wrote:



Hi everybody!

Here attached is a patch that enables partition based SelectType (currently 
CR_Socket/CR_Core) in select/cons_res.

The patch requires that CR_ALLOCATE_FULL_SOCKET is enabled to work and also 
this patch from master branch: 
https://github.com/SchedMD/slurm/commit/cdf679d0158a246e7389a15b62f127e5142003fe

It should however be easy to change it to use the old #define if you want to.

We are currently testing this in our development system but will go into 
production later this spring based on needs from some of our users.

One thing that I noticed during the development of this is that if a new option 
is added to the slurm.conf that is not supported with an earlier version of 
slurm programs/libs that are compiled with the earlier version stops working 
due to complaining of errors in slurm.conf.

We have the CR_ALLOCATE_FULL_SOCKET patch in our production system and some 
programs linked with openmpi stop working for some of our users.

It might we wise to try require less reading of the slurm.conf from the core 
parts of slurm and try to put more reading/parsing of the config file from the 
plugins (and other modular parts of slurm).

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Patch for partition based SelectType (CR_Socket/CR_Core).

2013-02-07 Thread Magnus Jonsson


Hi everybody!

Here attached is a patch that enables partition based SelectType 
(currently CR_Socket/CR_Core) in select/cons_res.


The patch requires that CR_ALLOCATE_FULL_SOCKET is enabled to work and 
also this patch from master branch: 
https://github.com/SchedMD/slurm/commit/cdf679d0158a246e7389a15b62f127e5142003fe


It should however be easy to change it to use the old #define if you 
want to.


We are currently testing this in our development system but will go into 
production later this spring based on needs from some of our users.


One thing that I noticed during the development of this is that if a new 
option is added to the slurm.conf that is not supported with an earlier 
version of slurm programs/libs that are compiled with the earlier 
version stops working due to complaining of errors in slurm.conf.


We have the CR_ALLOCATE_FULL_SOCKET patch in our production system and 
some programs linked with openmpi stop working for some of our users.


It might we wise to try require less reading of the slurm.conf from the 
core parts of slurm and try to put more reading/parsing of the config 
file from the plugins (and other modular parts of slurm).


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/src/common/read_config.c b/src/common/read_config.c
index 2a54f69..b8d981b 100644
--- a/src/common/read_config.c
+++ b/src/common/read_config.c
@@ -903,6 +903,7 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 		{"ReqResv", S_P_BOOLEAN}, /* YES or NO */
 		{"Shared", S_P_STRING}, /* YES, NO, or FORCE */
 		{"State", S_P_STRING}, /* UP, DOWN, INACTIVE or DRAIN */
+		{"SelectType", S_P_STRING}, /* CR_Socket, CR_Core */
 		{NULL}
 	};
 
@@ -1125,6 +1126,22 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 		} else
 			p->state_up = PARTITION_UP;
 
+		if (s_p_get_string(&tmp, "SelectType", tbl)) {
+			if (strncasecmp(tmp, "CR_Socket", 9) == 0)
+p->cr_type = CR_SOCKET;
+			else if (strncasecmp(tmp, "CR_Core", 7) == 0)
+p->cr_type = CR_CORE;
+			else {
+error("Bad value \"%s\" for SelectType", tmp);
+_destroy_partitionname(p);
+s_p_hashtbl_destroy(tbl);
+xfree(tmp);
+return -1;
+			}
+			xfree(tmp);
+		} else
+			p->cr_type = 0;
+
 		s_p_hashtbl_destroy(tbl);
 
 		*dest = (void *)p;
diff --git a/src/common/read_config.h b/src/common/read_config.h
index 7d017dc..a39a3c9 100644
--- a/src/common/read_config.h
+++ b/src/common/read_config.h
@@ -227,6 +227,7 @@ typedef struct slurm_conf_partition {
 	uint16_t state_up;	/* for states see PARTITION_* in slurm.h */
 	uint32_t total_nodes;	/* total number of nodes in the partition */
 	uint32_t total_cpus;	/* total number of cpus in the partition */
+	uint16_t cr_type;	/* Custom CR values for partition (if supported by select plugin) */
 } slurm_conf_partition_t;
 
 typedef struct slurm_conf_downnodes {
diff --git a/src/plugins/select/cons_res/select_cons_res.c b/src/plugins/select/cons_res/select_cons_res.c
index 364a683..142f252 100644
--- a/src/plugins/select/cons_res/select_cons_res.c
+++ b/src/plugins/select/cons_res/select_cons_res.c
@@ -1451,8 +1451,18 @@ static int _test_only(struct job_record *job_ptr, bitstr_t *bitmap,
 {
 	int rc;
 
+	uint16_t tmp_cr_type = cr_type;
+	if(job_ptr->part_ptr->cr_type) {
+		if( ( (cr_type & CR_SOCKET) || (cr_type & CR_CORE) ) && (cr_type & CR_ALLOCATE_FULL_SOCKET) ) {
+			tmp_cr_type &= ~(CR_SOCKET|CR_CORE);
+			tmp_cr_type |= job_ptr->part_ptr->cr_type;
+		} else {
+			info("cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET");
+		}
+	}
+
 	rc = cr_job_test(job_ptr, bitmap, min_nodes, max_nodes, req_nodes,
-			 SELECT_MODE_TEST_ONLY, cr_type, job_node_req,
+			 SELECT_MODE_TEST_ONLY, tmp_cr_type, job_node_req,
 			 select_node_cnt, select_part_record,
 			 select_node_usage, NULL);
 	return rc;
@@ -1489,14 +1499,24 @@ static int _run_now(struct job_record *job_ptr, bitstr_t *bitmap,
 	bool remove_some_jobs = false;
 	uint16_t pass_count = 0;
 	uint16_t mode;
+	uint16_t tmp_cr_type = cr_type;
 
 	save_bitmap = bit_copy(bitmap);
 top:	orig_map = bit_copy(save_bitmap);
 	if (!orig_map)
 		fatal("bit_copy: malloc failure");
 
+	if(job_ptr->part_ptr->cr_type) {
+		if( ( (cr_type & CR_SOCKET) || (cr_type & CR_CORE) ) && (cr_type & CR_ALLOCATE_FULL_SOCKET) ) {
+			tmp_cr_type &= ~(CR_SOCKET|CR_CORE);
+			tmp_cr_type |= job_ptr->part_ptr->cr_type;
+		} else {
+			info("cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET");
+		}
+	}
+
 	rc = cr_job_test(job_ptr, bitmap, min_nodes, max_nodes, req_nodes,
-			 SELECT_MODE_RUN_NOW, cr_type, job_node_req,
+			 SELECT_MODE_RUN_NOW, tmp_cr_type, job_node

[slurm-dev] task_affinity bug in 2.5.1 and after..

2013-02-01 Thread Magnus Jonsson


Hi!

We are in the process of upgrading into slurm 2.5.2 but I just found a 
bug in the task_affinity plugin in combination with cgroups.


The commit 
https://github.com/SchedMD/slurm/commit/791322349856e14a3d50aadc4869d40b034a2f37 
which solves some Power7 specific problems breaks task affinity together 
with cgroups on x86_64.


This code seams to be introduced into slurm from 2.5.1.

From our slurm.conf:

TaskPlugin=task/cgroup,task/affinity

from slurmd.log:

[2013-02-01T13:39:12+01:00] [57] sched_setaffinity(12516,128,0x0) 
failed: Invalid argument

[2013-02-01T13:39:12+01:00] [57] sched_getaffinity(12516) = 0xff00

With cgroups activated we get input cpusets: 0xff00 which 
translates into 0x0 in the reset_cpuset function.


If this is a specific problem for Power7 a #ifdef around it might be 
good for solving this problem for other platforms.


Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Is it possible to get hold of parameters from sbatch/salloc in a spank plugin?

2013-01-30 Thread Magnus Jonsson

Hi!

I has not succeeded in getting the parameters from the sbatch 
command/submitscript into my spank plugin.


If you have some example that shows this it would make my life easier.

Best regards,
Magnus

On 2013-01-30 15:34, Karl Schulz wrote:


If you really do want to have control over providing arbitrary strings back to 
the user, then a spank plugin might also be a possibility.  We have used the 
slurm_spank_init_post_opt() callback as a mechanism to create a custom job 
submission filter for srun/sbatch.

It's nothing fancy, but it gives us a way to do some quick sanity checking and 
apply some local site requirements like: verifying user is not over any of 
their disk quotas, verifying user provided a max runlimit, additional ACLs for 
the queue's, maximum jobs per user, etc.  In this approach, stdout will be seen 
by the user and you can customize as desired.

On Jan 29, 2013, at 11:31 AM, Moe Jette  wrote:



I would suggest an job_submit plugin:

http://www.schedmd.com/slurmdocs/job_submit_plugins.html

There is no mechanism to return a string to the user, only an exit
code, but adding a few new exit codes would be simple (see
slurm/slurm_errno.h and src/common/slurm_errno.c). We have also
discussed adding a mechanism to return an arbitrary string to the
user, but this is not possible today.

Quoting Magnus Jonsson :


Hi!

I looking for a way to look at users submitted parameters and if
they are using it in a "bad" way inform them that this might not be
a good usage of the system and point them to documentation about how
slurm works and how to best use it in our system.

I have tried different approaches but failed on every one..

Any hints?

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet






--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.

2013-01-18 Thread Magnus Jonsson

I have CR_ALLOCATE_FULL_SOCKET working correctly on block allocation.

Will fix cyclic after the weekend and supply a patch..

Best regards,
Magnus

On 2013-01-18 16:00, Magnus Jonsson wrote:

This patch fixes the behaviour with allocating 2 cores instead of one
with --ntasks-per-socket=1.

/Magnus

On 2013-01-18 13:59, Magnus Jonsson wrote:

Hi!

I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird
behaviour.

Currently running git/master but have seen the same behaviour on 2.4.3
with the #define.

My slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET



This is my submitscript (the important parts):

#SBATCH -n1
#SBATCH --ntasks-per-socket=1

This gives me (from scontrol show job):

NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=42-3 Mem=15000

If I submit:

#SBATCH -n6
#SBATCH --ntasks-per-socket=3

it gives me (from scontrol show job):

NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000

I think this is caused by how the ntasks-per-socket code is selecting
nodes in job_test.c of the cons_res-plugin.

I will look into the code and see if I can fix this some how otherwise I
can bug test patches.

I have a small part of our cluster available for testing right now
(2 nodes, 8 sockets/node, 6 cores/socket).

Best regards,
Magnus





--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.

2013-01-18 Thread Magnus Jonsson
This patch fixes the behaviour with allocating 2 cores instead of one 
with --ntasks-per-socket=1.


/Magnus

On 2013-01-18 13:59, Magnus Jonsson wrote:

Hi!

I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird
behaviour.

Currently running git/master but have seen the same behaviour on 2.4.3
with the #define.

My slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET


This is my submitscript (the important parts):

#SBATCH -n1
#SBATCH --ntasks-per-socket=1

This gives me (from scontrol show job):

NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=42-3 Mem=15000

If I submit:

#SBATCH -n6
#SBATCH --ntasks-per-socket=3

it gives me (from scontrol show job):

NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000

I think this is caused by how the ntasks-per-socket code is selecting
nodes in job_test.c of the cons_res-plugin.

I will look into the code and see if I can fix this some how otherwise I
can bug test patches.

I have a small part of our cluster available for testing right now
(2 nodes, 8 sockets/node, 6 cores/socket).

Best regards,
Magnus



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet
diff --git a/src/plugins/select/cons_res/job_test.c b/src/plugins/select/cons_res/job_test.c
index 60ec0b1..96b0dfa 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c
@@ -310,7 +310,7 @@ uint16_t _allocate_sockets(struct job_record *job_ptr, bitstr_t *core_map,
 	 *  allocating cores
 	 */
 	cps = num_tasks;
-	if (ntasks_per_socket > 1) {
+	if (ntasks_per_socket >= 1) {
 		cps = ntasks_per_socket;
 		if (cpus_per_task > 1)
 			cps = ntasks_per_socket * cpus_per_task;


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.

2013-01-18 Thread Magnus Jonsson

Err... Wrong...

On 2013-01-18 13:59, Magnus Jonsson wrote:

Hi!

I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird
behaviour.

Currently running git/master but have seen the same behaviour on 2.4.3
with the #define.

My slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET


This is my submitscript (the important parts):

#SBATCH -n1
#SBATCH --ntasks-per-socket=1

This gives me (from scontrol show job):

NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=42-43 Mem=5000


This is the correct output (but wrong :-) Copy'n'paste is hard some times...


If I submit:

#SBATCH -n6
#SBATCH --ntasks-per-socket=3

it gives me (from scontrol show job):

NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:*
  Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000

I think this is caused by how the ntasks-per-socket code is selecting
nodes in job_test.c of the cons_res-plugin.

I will look into the code and see if I can fix this some how otherwise I
can bug test patches.

I have a small part of our cluster available for testing right now
(2 nodes, 8 sockets/node, 6 cores/socket).

Best regards,
Magnus



--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Bugs in CR_ALLOCATE_FULL_SOCKET.

2013-01-18 Thread Magnus Jonsson

Hi!

I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird 
behaviour.


Currently running git/master but have seen the same behaviour on 2.4.3 
with the #define.


My slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET

This is my submitscript (the important parts):

#SBATCH -n1
#SBATCH --ntasks-per-socket=1

This gives me (from scontrol show job):

   NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:*
 Nodes=t-cn1033 CPU_IDs=42-47 Mem=15000

If I submit:

#SBATCH -n6
#SBATCH --ntasks-per-socket=3

it gives me (from scontrol show job):

   NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:*
 Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000

I think this is caused by how the ntasks-per-socket code is selecting 
nodes in job_test.c of the cons_res-plugin.


I will look into the code and see if I can fix this some how otherwise I 
can bug test patches.


I have a small part of our cluster available for testing right now
(2 nodes, 8 sockets/node, 6 cores/socket).

Best regards,
Magnus

--
Magnus Jonsson, Developer, HPC2N, Umeå Universitet



smime.p7s
Description: S/MIME Cryptographic Signature