[slurm-dev] Re: Restrict access for a user group to certain nodes
Hi! You could either setup a partition for your tests with group restrictions or you can use the reservation feature depending on your exact use case. /Magnus On 2016-12-01 15:54, Felix Willenborg wrote: Dear everybody, I'd like to restrict submissions from a certain user group or allow only one certain user group to submit jobs to certain nodes. Does Slurm offer groups which can handle such an occassion? It'd be prefered if there is a linux user group support, because this would save time setting up a new user group environment. The intention is that only administrators can submit jobs to those certain nodes to perform some tests, which might be disturbed by users submitting their jobs to those nodes. Various Search Engines didn't offer answers to my question, which is why I'm writing you here. Looking forward to some answers! Best, Felix Willenborg -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet
[slurm-dev] Re: sacct vs sacct -X
On 2016-03-23 16:17, Skouson, Gary B wrote: Yes, but why are we not getting the job information from sacct when we are running without -X in this case? The problem with running with -X is that we don't get all cumulative statistics for the job. We are missing some of the information like UserCPU. /Magnus The man page says: -X, --allocations Only show cumulative statistics for each job, not the intermediate steps. What's allocated to the job may not match the utilization of the job steps. - Gary Skouson -Original Message----- From: Magnus Jonsson [mailto:mag...@hpc2n.umu.se] Sent: Wednesday, March 23, 2016 7:10 AM To: slurm-dev Subject: [slurm-dev] Re: sacct vs sacct -X The behaviour seems to be diffrent in slurm 15.08 at least. sacct --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 736485100:00:00 16 0 7364851.0 00:00:00 1 0 sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 736485100:00:00 16 0 /Magnus On 2016-03-23 09:29, Magnus Jonsson wrote: Hi! From this simple example could someone explain to me if this is the expected behaviour or a bug? $ srun -n1 --exclusive hostname srun: job 4232239 queued and waiting for resources srun: job 4232239 has been allocated resources host0001.example.com $ sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 48144 $ sacct --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 1144 We are currently running 14.03 but the same behaviour exist in 14.11 as well. I see that the TRES-feature change a lot of this in the 15+ releases but does it change this behaviour (I don't have access to any 15 cluster right now)? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet
[slurm-dev] Re: sacct vs sacct -X
The behaviour seems to be diffrent in slurm 15.08 at least. sacct --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 736485100:00:00 16 0 7364851.0 00:00:00 1 0 sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 7364851 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 736485100:00:00 16 0 /Magnus On 2016-03-23 09:29, Magnus Jonsson wrote: Hi! From this simple example could someone explain to me if this is the expected behaviour or a bug? $ srun -n1 --exclusive hostname srun: job 4232239 queued and waiting for resources srun: job 4232239 has been allocated resources host0001.example.com $ sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 48144 $ sacct --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 1144 We are currently running 14.03 but the same behaviour exist in 14.11 as well. I see that the TRES-feature change a lot of this in the 15+ releases but does it change this behaviour (I don't have access to any 15 cluster right now)? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet
[slurm-dev] sacct vs sacct -X
Hi! From this simple example could someone explain to me if this is the expected behaviour or a bug? $ srun -n1 --exclusive hostname srun: job 4232239 queued and waiting for resources srun: job 4232239 has been allocated resources host0001.example.com $ sacct -X --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 48144 $ sacct --format=JobID,Elapsed,AllocCPUS,CPUTimeRaw -j 4232239 JobIDElapsed AllocCPUS CPUTimeRAW -- -- -- 423223900:00:03 1144 We are currently running 14.03 but the same behaviour exist in 14.11 as well. I see that the TRES-feature change a lot of this in the 15+ releases but does it change this behaviour (I don't have access to any 15 cluster right now)? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: As a user how can I re-order my job submissions
Hi! You could always use the "nice" feature to change the priority of your old jobs. It might be a little bit of a work to make a script that sets the nice value of all your jobs but nothing that some cut/grep/xargs can't fix ;-) /Magnus On 2015-08-28 15:48, Kumar, Amit wrote: Dear SLURM, If I am a regular user and imagine I have ton’s of jobs submitted, then I come up with another batch of jobs that I want to run before that batch I submitted few hours back that is still in the queue waiting for resources and priority. Is there a way to do this? From an admin perspective I wouldn’t want this, because users could misuse this feature. But from a user perspective I could genuinely have some dependencies that I would like to have it addressed before beginning my batch of thousands of jobs. Any help here is greatly appreciated. Regards, Amit -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] sbcast, prolog and SPANK.
Hi everybody. As some of you may now from my presentation in Lugano we are using a SPANK-plugin to give private /tmp directories to our users. One of our users was using the sbcast command to send files to nodes in the allocation. This works badly as the SPANK-plugin is not used at all for sbcast. I'm unsure exactly which part of Slurm that receives the data and how this is implemented at all and if SPANK should be involved at all but the files do not show up where the user expects them to be. Is this solvable in any way with sbcast. For now we just recommended the user to use "srun cp ${PATH_TO_FILES}/* $TMPDIR/" Also this has the side effect that the prolog on the node is not run until you actually send a job to the node. I.e. you can send data to a node with sbcast before the prolog this might not be an expected/wanted behaviour. Best, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: prevent slurm from parsing the full script
A better approach would be to add to SLURM a "#SBATCH END-OF-OPTIONS" or something similar to mark the end of sbatch options and that sbatch can stop parsing from that point. /Magnus On 2015-04-21 14:40, Andy Riebs wrote: Never mind; which I changed "#sbatch" to the correct "#SBATCH", I got 4 tasks. According to the man page, this is a bug. For now, I like Magnus's suggestion :-) On 04/21/2015 08:21 AM, Andy Riebs wrote: Hendryk, what sbatch command line options are you using? How are you determining that job 1 got 2 tasks? I just tried the following script, and it correctly ran just 1 task: $ cat test.sh #!/bin/bash #SBATCH --ntasks=1 srun hostname #sbatch --ntasks=4 ## end of script $ sbatch test.sh Submitted batch job 18720 $ cat slurm-18720.out node09 $ For further discussion on this topic, please 1. Reply to the whole list, not just me 2. Indicate what OS and Slurm versions you are using 3. Provide a copy of your slurm.conf file with any sensitive information, like node names or IP addresses, removed Andy On 04/21/2015 07:50 AM, Hendryk Bockelmann wrote: Hello, is there a way to prevent slurm from parsing the whole jobscript for #SBATCH statements? Assume I have the following jobscript "job1.sh": #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --job-name=job1 srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME" cat > job2.sh < -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: prevent slurm from parsing the full script
Hi! A simple solution would be do: SBATCH="#SBATCH" cat << EOF ... $SBATCH --nodes=1 EOF /Magnus On 2015-04-21 13:50, Hendryk Bockelmann wrote: Hello, is there a way to prevent slurm from parsing the whole jobscript for #SBATCH statements? Assume I have the following jobscript "job1.sh": #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --job-name=job1 srun -l echo "slurm jobid $SLURM_JOB_ID named: $SLURM_JOB_NAME" cat > job2.sh < -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] A way of abuse the priority option in Slurm?
Hi! I just discovered a possible way for a user to abuse the priority in Slurm. This is the scenario: 1. A user has not run any jobs in a long time and has therefore has a high fairshare priority. Lets say: 1. 2. The user submits 1000 jobs into the queue that is far above his fairshare target. 3. The user changes the priority of his job (It's ok for a user to lower the priority of jobs as long as the user is the owner) to lets say: (still a high priority. +-1 is in practice nothing). (scontrol update jobid=1 priority= 4. The users jobs starts and the fairshare priority lowers. But here is the big _BUT_ the jobs with changed priority does not seams to change leaving the users job with maximum priority until all of the jobs are completed. Have I missed something in this scenario? If this is true what do we do about it? Should users be able to change the priority at all? The user can use the 'nice' option to alter the priority of a job within a small limit that does not alter the priority as defined above. Please let me be wrong :-) /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Slurm restart count in SPANK
Hi Aaron, From my spank code: spank_get_item(sp, S_SLURM_RESTART_COUNT, &restartcount) The S_SLURM_RESTART_COUNT "item" was added to the plugstack on my request/patch. But thanks for the concern :-) Best regards, Magnus On 2015-02-27 14:44, Aaron Knister wrote: Hi Magnus, While I can't tell you OTTOMH why the behavior changed, I can suggest a different perhaps more spank-y way to do that. From within your spank function(s) use the spank_get_item call to get the restart count: int restart_count; // sp is the spank_t argument to your SPANK function spank_get_item(sp, S_SLURM_RESTART_COUNT, &restart_count); Hope that helps! Sent from my iPhone On Feb 27, 2015, at 8:14 AM, Magnus Jonsson wrote: It seams that the restart count in SPANK (prolog) is missing in resent versions of Slurm. I always returns 0 even if the jobs ha restarted. It also seams that the "SLURM_RESTART_COUNT" environment is missing in the epilog script (might be related). I'm not sure when this was changed but I'm pretty sure it worked on in 2.6 (it was when we developed our tmpdir spank plugin). "SLURM_RESTART_COUNT" is available in the job user environment. /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Slurm restart count in SPANK
It seams that the restart count in SPANK (prolog) is missing in resent versions of Slurm. I always returns 0 even if the jobs ha restarted. It also seams that the "SLURM_RESTART_COUNT" environment is missing in the epilog script (might be related). I'm not sure when this was changed but I'm pretty sure it worked on in 2.6 (it was when we developed our tmpdir spank plugin). "SLURM_RESTART_COUNT" is available in the job user environment. /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Two patches for jobacct_gather.
Hi! I have attached two patches to the jobacct_gather plugin (common). The first uses Proportional Set Size (PSS) instead of RSS to determinate the memory footprint of a job. More information about PSS can be found here: http://lwn.net/Articles/230975/ Gather the PSS information is a little bit more complicated (and CPU intensive) then just the RSS value and might be problem on some applications. We have a subset of jobs that loads the dataset in the first process and then just do a fork() for the number of cores available and do parallel computation of the data set. This makes the RSS value go sky high as Slurm calculates the sum of all RSS values of the processes in the job and Slurm then kills the job :-( The second patch adds an option not to kill jobs that is over memory limit. This works well for us that have working cgroups memory limits. Best regards, Magnus Jonsson -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5 index 29f730d..cb85598 100644 --- a/doc/man/man5/slurm.conf.5 +++ b/doc/man/man5/slurm.conf.5 @@ -1046,6 +1046,9 @@ Exclude shared memory from accounting. .TP \fBUsePss\fR Use PSS value instead of RSS (saved as RSS) to calculate real usage of memory. +.TP +\fBNoOverMemoryKill\fR +Do not kill process that uses more then requested memory but do JobAcctGather. .RE .TP diff --git a/src/plugins/jobacct_gather/common/common_jag.c b/src/plugins/jobacct_gather/common/common_jag.c index b6204d6..36864b0 100644 --- a/src/plugins/jobacct_gather/common/common_jag.c +++ b/src/plugins/jobacct_gather/common/common_jag.c @@ -671,6 +671,7 @@ extern void jag_common_poll_data( char sbuf[72]; int energy_counted = 0; static int first = 1; + static int no_over_memory_kill = -1; xassert(callbacks); @@ -685,6 +686,15 @@ extern void jag_common_poll_data( } processing = 1; + if (no_over_memory_kill == -1) { + char *acct_params = slurm_get_jobacct_gather_params(); + if (acct_params && strstr(acct_params, "NoOverMemoryKill")) + no_over_memory_kill = 1; + else + no_over_memory_kill = 0; + xfree(acct_params); + } + if (!callbacks->get_precs) callbacks->get_precs = _get_precs; @@ -783,7 +793,9 @@ extern void jag_common_poll_data( } list_iterator_destroy(itr); - jobacct_gather_handle_mem_limit(total_job_mem, total_job_vsize); + if(!no_over_memory_kill) { + jobacct_gather_handle_mem_limit(total_job_mem, total_job_vsize); + } finished: list_destroy(prec_list); diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5 index ee7674b..29f730d 100644 --- a/doc/man/man5/slurm.conf.5 +++ b/doc/man/man5/slurm.conf.5 @@ -1043,6 +1043,9 @@ Acceptable values at present include: .TP 20 \fBNoShared\fR Exclude shared memory from accounting. +.TP +\fBUsePss\fR +Use PSS value instead of RSS (saved as RSS) to calculate real usage of memory. .RE .TP diff --git a/src/plugins/jobacct_gather/common/common_jag.c b/src/plugins/jobacct_gather/common/common_jag.c index 84b6775..b6204d6 100644 --- a/src/plugins/jobacct_gather/common/common_jag.c +++ b/src/plugins/jobacct_gather/common/common_jag.c @@ -95,6 +95,46 @@ static char *_skipdot (char *str) return str; } +/* + * collects the Pss value from /proc//smaps + */ +static int _get_pss(char *proc_smaps_file, jag_prec_t *prec) { +uint64_t pss=0; +char line[128]; + +FILE *fp = fopen(proc_smaps_file, "r"); +if(!fp) { +return -1; +} + fcntl(fileno(fp), F_SETFD, FD_CLOEXEC); +while(fgets(line,sizeof(line),fp)) { +if(strncmp(line,"Pss:",4)) { +continue; +} +int i=4; +for(;i 0 && prec->rss > pss) { +prec->rss = pss; +} +return 0; +} + static int _get_sys_interface_freq_line(uint32_t cpu, char *filename, char * sbuf) { @@ -359,10 +399,11 @@ static int _get_process_io_data_line(int in, jag_prec_t *prec) { return 1; } -static void _handle_stats(List prec_list, char *proc_stat_file, - char *proc_io_file, jag_callbacks_t *callbacks) +static void _handle_stats(List prec_list, char *proc_stat_file, char *proc_io_file, +char *proc_smaps_file, jag_callbacks_t *callbacks) { static int no_share_data = -1; + static int use_pss = -1; FILE *stat_fp = NULL; FILE *io_fp = NULL; int fd, fd2; @@ -374,6 +415,11 @@ static void _handle_stats(List prec_list, char *proc_stat_file, no_share_data = 1; else no_share_data = 0; + + if (acct_params && strstr(acct_params, "UsePss")) + use_pss = 1; + else + use_pss = 0; xfree(acct_params); } @@ -393,22 +439,35 @@ static void _handle_stats(List prec_list, char *proc_stat_file, fcntl(fd, F_SETFD, FD_CLOEXEC); prec = xmalloc(sizeof(jag_prec_t)); - if (_get_process_data_line(fd, prec)) { - if (no_sha
[slurm-dev] Re: Job on wrong node
It would be nice to eliminate most of the slurm.conf on the nodes. Most of the information could as easily be fetched (or not needed at all) from the slurmctld on the master node. An API to make a call to the master node and fetch configuration options could eliminate the need for NO_CONF_HASH :-) All that should be needed is a slim slurm.conf with information where the slurmctld lives (and how to contact (munge/...)). /Magnus On 2015-02-04 20:54, Danny Auble wrote: On 02/04/2015 11:23 AM, Ulf Markwardt wrote: DebugFlags=NO_CONF_HASH But we do have different slurm.conf files due to different energy sensors, prolog/epilog scripts. The NO_CONF_HASH is very dangerous in most systems. It should be avoided at all cost. It is interesting you have different sensors per node. I could understand in this case to have NO_CONF_HASH set. We are thinking of adding a new kind of slurm.conf include that doesn't get added to the hash which you could put node specific information like this and could remove the NO_CONF_HASH. You might be able to get around the pro/epilog issue by having a master pro/epilog that in turn calls different ones depending on the node. Adding the new file would also eliminate this issue as well. This doesn't exist today, but is being thought about. I am guessing the slurm.conf file on your nodes may be insync, but perhaps the slurmd on the troubled nodes may be running with an old version. All show slurm 14.11.3 I meant an older version of the file, not Slurm :). With NO_CONF_HASH set there isn't a real good way to verify the slurmd's are all running the same slurm.conf. I would suggest issuing a "scontrol shutdown" then restarting all your nodes and your controller. If you still see the problem after that then indeed something else is the matter. Perhaps routing tables or something else. U -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Changed behaviour of --exclusive in srun (job step context)
Is no one else affected by this? /Magnus On 2014-09-11 14:46, Magnus Jonsson wrote: Hi! A user found a "strange" new behaviour when using --exclusive with srun. I have an example submit-script[1] that shows this. I have tested this on 2.6.4 with the output [2] & [3] (stderr) and on 14.03.7 with the output [4] & [5] (stderr). In 14.03.7 without --exclusive behaves like 2.6.4 with --exclusive. In 14.03.7 with --exclusive get some kind of "node exclusive" within the job. --overcommit gets the same behaviour on both versions. in 14.03.7 -c3 does not seams to work at all in job step context I see warnings about this in the man page for srun but in 2.6.4 this works as I aspect. Stderr output from srun: "srun: error: Unable to create job step: Requested node configuration is not available" If you need more information please let me know. Best regards, Magnus 1, http://www.hpc2n.umu.se/staff/magnus/slurm/submit.sh 2, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.2.6.4 3, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.2.6.4 4, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.14.03.7 5, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.14.03.7 -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Changed behaviour of --exclusive in srun (job step context)
Hi! A user found a "strange" new behaviour when using --exclusive with srun. I have an example submit-script[1] that shows this. I have tested this on 2.6.4 with the output [2] & [3] (stderr) and on 14.03.7 with the output [4] & [5] (stderr). In 14.03.7 without --exclusive behaves like 2.6.4 with --exclusive. In 14.03.7 with --exclusive get some kind of "node exclusive" within the job. --overcommit gets the same behaviour on both versions. in 14.03.7 -c3 does not seams to work at all in job step context I see warnings about this in the man page for srun but in 2.6.4 this works as I aspect. Stderr output from srun: "srun: error: Unable to create job step: Requested node configuration is not available" If you need more information please let me know. Best regards, Magnus 1, http://www.hpc2n.umu.se/staff/magnus/slurm/submit.sh 2, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.2.6.4 3, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.2.6.4 4, http://www.hpc2n.umu.se/staff/magnus/slurm/stdout.14.03.7 5, http://www.hpc2n.umu.se/staff/magnus/slurm/stderr.14.03.7 -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Killing the backfill...
On 2014-05-20 14:54, Tommi T wrote: On Tuesday, May 20, 2014 1:51 PM, Magnus Jonsson wrote: Hi! While investigating an other matter I found that if you have lots of jobs running with short job steps they killing the backfill very effective. Hi, Do you use bf_continue-flag? http://slurm.schedmd.com/sched_config.html Yes and no. I implemented the first version of bf_continue but it was while debugging some strange behaviour of bf_continue I started looking more into what exactly caused the last_job_update to be updated all the time. I will return with more information about my bf_continue-findings. /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Killing the backfill...
Hi! While investigating an other matter I found that if you have lots of jobs running with short job steps they killing the backfill very effective. As all actions on a job step modifies the last_job_update global variable that effective stops the backfill loop. This could be very simple demonstrated with this simple batch script on a system with some jobs in the queue. 8< #!/bin/bash for n in `seq 120`; do srun sleep 1 done 8< In 2.6.7-version I can only find a few places where last_job_update is used and only one that is directly related to job step. Is there a need to have the code updated the last_job_update for every action of a job step? Should there be a last_job_step_update also? Is there actions of a job step that affects the queue? Could there be an other variable that could be used to trigger a reschedule of the queue based on events that actually affects the scheduling of the queue? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Change node weight based on partition? QOS? other?
Hi! We have a scenario there we would like to use the node weight feature in Slurm to pack groups of job to one half of the machine and other jobs to the other part but overlapping is OK in some degree. Is there a way of altering the node weight for one job, via a partition or via an QOS? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Added spank_item.
I have made a patch for spank to allow to fetch the SLURM_RESTART_COUNT into my spank plugin. The patch is attached (against 2.6.6). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff a/slurm/spank.h b/slurm/spank.h --- a/slurm/spank.h +++ b/slurm/spank.h @@ -169,7 +169,8 @@ enum spank_item { S_JOB_ALLOC_CORES, /* Job allocated cores in list format (char **) */ S_JOB_ALLOC_MEM, /* Job allocated memory in MB (uint32_t *) */ S_STEP_ALLOC_CORES, /* Step alloc'd cores in list format (char **) */ -S_STEP_ALLOC_MEM /* Step alloc'd memory in MB (uint32_t *) */ +S_STEP_ALLOC_MEM,/* Step alloc'd memory in MB (uint32_t *) */ +S_SLURM_RESTART_COUNT/* Job restart count (uint32_t *) */ }; typedef enum spank_item spank_item_t; diff a/src/common/plugstack.c b/src/common/plugstack.c --- a/src/common/plugstack.c +++ b/src/common/plugstack.c @@ -2133,6 +2133,13 @@ spank_err_t spank_get_item(spank_t spank, spank_item_t item, ...) else *p2uint32 = 0; break; + case S_SLURM_RESTART_COUNT: + p2uint32 = va_arg(vargs, uint32_t *); + if (slurmd_job) + *p2uint32 = slurmd_job->restart_cnt; + else + *p2uint32 = 0; + break; case S_SLURM_VERSION: p2vers = va_arg(vargs, char **); *p2vers = SLURM_VERSION_STRING; smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] RE: --exclusive together with --ntasks-per-node not working as expected.
Yes.. but I fail to see the absence. I say that I want 8 tasks per node and not 16. If say I want the node exclusive should not invalidate that in my opinion. As a basic rule we tell our users to not think of nodes and think in terms of task and Slurm will give them the number of nodes they need. This might not be true for more advanced users/use cases but Best regards, Magnus On 2014-02-19 15:49, Rod Schultz wrote: In the absence of other directives, slurm tries to use the minimum number of nodes. Instead of -n16, try -N 2 That tells slurm to use two nodes. Here's a demo case srun -l -N2 --tasks-per-node=2 hostname 1: trek0 0: trek0 2: trek1 3: trek1 -Original Message- From: Magnus Jonsson [mailto:mag...@hpc2n.umu.se] Sent: Wednesday, February 19, 2014 1:28 AM To: slurm-dev Subject: [slurm-dev] --exclusive together with --ntasks-per-node not working as expected. Hi! We have a user that submitted a job that did not start as expected. He was using --exclusive together with --ntasks-per-node but ended up with all task on one node anyway. 8< #SBATCH -n 16 #SBATCH --exclusive #SBATCH --ntasks-per-node=8 8< See the attached files for more information about how the job was submitted. We are currently running version 2.6.3. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] --exclusive together with --ntasks-per-node not working as expected.
Hi! We have a user that submitted a job that did not start as expected. He was using --exclusive together with --ntasks-per-node but ended up with all task on one node anyway. 8< #SBATCH -n 16 #SBATCH --exclusive #SBATCH --ntasks-per-node=8 8< See the attached files for more information about how the job was submitted. We are currently running version 2.6.3. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet JobId=1603907 Name=submit_e UserId=magnus(2066) GroupId=folk(3001) Priority=658834 Account=sysop QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:44 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2014-02-19T09:03:10 EligibleTime=2014-02-19T09:03:10 StartTime=2014-02-19T09:05:45 EndTime=2014-02-19T09:35:45 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=devel AllocNode:Sid=t-mn01:14395 ReqNodeList=(null) ExcNodeList=(null) NodeList=t-cn0304 BatchHost=t-cn0304 NumNodes=6 NumCPUs=48 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn0304 CPU_IDs=0-47 Mem=127200 MinCPUsNode=8 MinMemoryCPU=2650M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/pfs/nobackup/home/m/magnus/y/submit_e WorkDir=/pfs/nobackup/home/m/magnus/y BatchScript= #!/bin/bash #SBATCH -A sysop #SBATCH -p devel #SBATCH -o e.out #SBATCH -n 16 #SBATCH --exclusive #SBATCH --ntasks-per-node=8 scontrol show job -d -d $SLURM_JOBID srun hostname t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se t-cn0304.hpc2n.umu.se JobId=1603906 Name=submit UserId=magnus(2066) GroupId=folk(3001) Priority=658834 Account=sysop QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:01 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2014-02-19T09:03:09 EligibleTime=2014-02-19T09:03:09 StartTime=2014-02-19T09:03:44 EndTime=2014-02-19T09:33:44 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=devel AllocNode:Sid=t-mn01:14395 ReqNodeList=(null) ExcNodeList=(null) NodeList=t-cn[1015,1017] BatchHost=t-cn1015 NumNodes=2 NumCPUs=24 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn[1015,1017] CPU_IDs=0-11 Mem=31800 MinCPUsNode=8 MinMemoryCPU=2650M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/pfs/nobackup/home/m/magnus/y/submit WorkDir=/pfs/nobackup/home/m/magnus/y BatchScript= #!/bin/bash #SBATCH -A sysop #SBATCH -p devel #SBATCH -o o.out #SBATCH -n 16 #SBATCH --ntasks-per-node=8 scontrol show job -d -d $SLURM_JOBID srun hostname t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1015.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se t-cn1017.hpc2n.umu.se smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Only allow some nodes in an partition to run jobs that stay within one node.
Hi! We have a part of our cluster that have limited interconnect. Is there a way of make a part of a partition only to allow jobs that stay within one node without making a new partition? I know I can make a submitplugin that changes the partition if the job seams to fit within the limits but this might also confuse the users. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Bad behaviour of slurm with -c
Got the same behaviour with 2.6.3. /Magnus On 2013-10-04 22:51, Moe Jette wrote: There were bug fixes related to socket-based allocations in both version 2.6.2 and 2.6.3. I am not sure if these changes will fix the problem that you report, but it is probably worth a look. Quoting Magnus Jonsson : Hi! I have a case where slurm allocated less cores then is required. It looks like it's not happening everytime but (right now) 1 of 10 failes this way. Probably because of the layout of the current jobs of the nodes. Here is some information I collected. I also attach our slurm.conf. We are running Slurm 2.6.1. Best regards, Magnus ==> submit <== #!/bin/bash #SBATCH -J 84212 #SBATCH --error=err.%J #SBATCH --output=out.%J #SBATCH -n 16 #SBATCH -c 12 #SBATCH -t 00:05:00 echo --- env | grep ^SLURM echo --- scontrol show job -d -d $SLURM_JOBID echo --- srun echo "" ==> out.1313514 <== --- SLURM_CHECKPOINT_IMAGE_DIR=/pfs/nobackup/home/m/magnus/84212 SLURM_NODELIST=t-cn[0113,0423-0424,0433-0434] SLURM_JOB_NAME=84212 SLURMD_NODENAME=t-cn0113 SLURM_TOPOLOGY_ADDR=t-isw0501.t-isw0101.t-cn0113 SLURM_PRIO_PROCESS=0 SLURM_NODE_ALIASES=(null) SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node SLURM_MEM_PER_CPU=2500 SLURM_NNODES=5 SLURM_JOBID=1313514 SLURM_NTASKS=16 SLURM_TASKS_PER_NODE=3,4(x2),3,2 SLURM_JOB_ID=1313514 SLURM_CPUS_PER_TASK=12 SLURM_NODEID=0 SLURM_SUBMIT_DIR=/pfs/nobackup/home/m/magnus/84212 SLURM_TASK_PID=19827 SLURM_NPROCS=16 SLURM_CPUS_ON_NODE=24 SLURM_PROCID=0 SLURM_JOB_NODELIST=t-cn[0113,0423-0424,0433-0434] SLURM_LOCALID=0 SLURM_JOB_CPUS_PER_NODE=24,48(x2),36,24 SLURM_GTIDS=0 SLURM_SUBMIT_HOST=t-mn01.hpc2n.umu.se SLURM_JOB_NUM_NODES=5 --- JobId=1313514 Name=84212 UserId=magnus(2066) GroupId=folk(3001) Priority=10 Account=default QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2013-10-04T15:16:43 EligibleTime=2013-10-04T15:16:43 StartTime=2013-10-04T15:23:02 EndTime=2013-10-04T15:28:02 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=t-mn01:8853 ReqNodeList=(null) ExcNodeList=(null) NodeList=t-cn[0113,0423-0424,0433-0434] BatchHost=t-cn0113 NumNodes=5 NumCPUs=180 CPUs/Task=12 ReqS:C:T=*:*:* Nodes=t-cn0113 CPU_IDs=12-35 Mem=6 Nodes=t-cn[0423-0424] CPU_IDs=0-47 Mem=12 Nodes=t-cn0433 CPU_IDs=6-41 Mem=9 Nodes=t-cn0434 CPU_IDs=6-11,24-41 Mem=6 MinCPUsNode=12 MinMemoryCPU=2500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/pfs/nobackup/home/m/magnus/84212/submit WorkDir=/pfs/nobackup/home/m/magnus/84212 BatchScript= #!/bin/bash #SBATCH -J 84212 #SBATCH --error=err.%J #SBATCH --output=%J #SBATCH -n 16 #SBATCH -c 12 #SBATCH -t 00:05:00 echo --- env | grep ^SLURM echo --- scontrol show job -d -d $SLURM_JOBID echo --- srun echo "" --- ==> err.1313514 <== srun: error: Unable to create job step: More processors requested than permitted ==> slurm.log <== Oct 4 15:23:02 t-mn02 slurmctld[28426]: backfill test for job 1313514 Oct 4 15:23:02 t-mn02 slurmctld[28426]: error: cons_res: _compute_c_b_task_dist oversubscribe for job 1313514 Oct 4 15:23:02 t-mn02 slurmctld[28426]: backfill: Started JobId=1313514 on t-cn[0113,0423-0424,0433-0434] Oct 4 15:23:03 t-mn02 slurmctld[28426]: _slurm_rpc_job_step_create for job 1313514: More processors requested than permitted Oct 4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514 Oct 4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for JobId=1313514 successful, exit code=256 -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Bad behaviour of slurm with -c
Hi! I have a case where slurm allocated less cores then is required. It looks like it's not happening everytime but (right now) 1 of 10 failes this way. Probably because of the layout of the current jobs of the nodes. Here is some information I collected. I also attach our slurm.conf. We are running Slurm 2.6.1. Best regards, Magnus ==> submit <== #!/bin/bash #SBATCH -J 84212 #SBATCH --error=err.%J #SBATCH --output=out.%J #SBATCH -n 16 #SBATCH -c 12 #SBATCH -t 00:05:00 echo --- env | grep ^SLURM echo --- scontrol show job -d -d $SLURM_JOBID echo --- srun echo "" ==> out.1313514 <== --- SLURM_CHECKPOINT_IMAGE_DIR=/pfs/nobackup/home/m/magnus/84212 SLURM_NODELIST=t-cn[0113,0423-0424,0433-0434] SLURM_JOB_NAME=84212 SLURMD_NODENAME=t-cn0113 SLURM_TOPOLOGY_ADDR=t-isw0501.t-isw0101.t-cn0113 SLURM_PRIO_PROCESS=0 SLURM_NODE_ALIASES=(null) SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node SLURM_MEM_PER_CPU=2500 SLURM_NNODES=5 SLURM_JOBID=1313514 SLURM_NTASKS=16 SLURM_TASKS_PER_NODE=3,4(x2),3,2 SLURM_JOB_ID=1313514 SLURM_CPUS_PER_TASK=12 SLURM_NODEID=0 SLURM_SUBMIT_DIR=/pfs/nobackup/home/m/magnus/84212 SLURM_TASK_PID=19827 SLURM_NPROCS=16 SLURM_CPUS_ON_NODE=24 SLURM_PROCID=0 SLURM_JOB_NODELIST=t-cn[0113,0423-0424,0433-0434] SLURM_LOCALID=0 SLURM_JOB_CPUS_PER_NODE=24,48(x2),36,24 SLURM_GTIDS=0 SLURM_SUBMIT_HOST=t-mn01.hpc2n.umu.se SLURM_JOB_NUM_NODES=5 --- JobId=1313514 Name=84212 UserId=magnus(2066) GroupId=folk(3001) Priority=10 Account=default QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2013-10-04T15:16:43 EligibleTime=2013-10-04T15:16:43 StartTime=2013-10-04T15:23:02 EndTime=2013-10-04T15:28:02 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=batch AllocNode:Sid=t-mn01:8853 ReqNodeList=(null) ExcNodeList=(null) NodeList=t-cn[0113,0423-0424,0433-0434] BatchHost=t-cn0113 NumNodes=5 NumCPUs=180 CPUs/Task=12 ReqS:C:T=*:*:* Nodes=t-cn0113 CPU_IDs=12-35 Mem=6 Nodes=t-cn[0423-0424] CPU_IDs=0-47 Mem=12 Nodes=t-cn0433 CPU_IDs=6-41 Mem=9 Nodes=t-cn0434 CPU_IDs=6-11,24-41 Mem=6 MinCPUsNode=12 MinMemoryCPU=2500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/pfs/nobackup/home/m/magnus/84212/submit WorkDir=/pfs/nobackup/home/m/magnus/84212 BatchScript= #!/bin/bash #SBATCH -J 84212 #SBATCH --error=err.%J #SBATCH --output=%J #SBATCH -n 16 #SBATCH -c 12 #SBATCH -t 00:05:00 echo --- env | grep ^SLURM echo --- scontrol show job -d -d $SLURM_JOBID echo --- srun echo "" --- ==> err.1313514 <== srun: error: Unable to create job step: More processors requested than permitted ==> slurm.log <== Oct 4 15:23:02 t-mn02 slurmctld[28426]: backfill test for job 1313514 Oct 4 15:23:02 t-mn02 slurmctld[28426]: error: cons_res: _compute_c_b_task_dist oversubscribe for job 1313514 Oct 4 15:23:02 t-mn02 slurmctld[28426]: backfill: Started JobId=1313514 on t-cn[0113,0423-0424,0433-0434] Oct 4 15:23:03 t-mn02 slurmctld[28426]: _slurm_rpc_job_step_create for job 1313514: More processors requested than permitted Oct 4 15:23:03 t-mn02 slurmctld[28426]: completing job 1313514 Oct 4 15:23:03 t-mn02 slurmctld[28426]: sched: job_complete for JobId=1313514 successful, exit code=256 -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet # # See the slurm.conf man page for more information. # ControlMachine=t-mn02 #ControlAddr= #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge DisableRootJobs=YES EnforcePartLimits=YES RebootProgram=/sbin/reboot # our cleanup epilog Epilog=/var/conf/slurm/hpc2n-epilog #PrologSlurmctld= #FirstJobId=1 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 ### #Optimization Toolbox 2 #Partial Differential Equation Toolbox 2 #Statistics Toolbox 2 #Image Processing Toolbox 2 #Curve Fitting Toolbox 5 #Signal Processing Toolbox 5 #Communications Toolbox 5 #Parallel Computing Toolbox 15 # One more then we actually have! Licenses=matlab*21,matlab-pct*16,matlab-ct*6,matlab-spt*6,matlab-cft*6,matlab-ipt*3,matlab-st*3,matlab-pdet*3,matlab-ot*3 MailProg=/usr/bin/mail MaxJobCount=2 #MaxTasksPerNode=128 #MpiDefault=none MpiDefault=openmpi #MpiParams=ports=#-# ## needs openmpi-1.5+ MpiParams=ports=12000-12999 #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #P
[slurm-dev] Expected start time far, far away...
We have a reservation for some nodes in our cluster ReservationName=test StartTime=2013-09-20T09:22:50 EndTime=2014-09-20T09:22:50 Duration=365-00:00:00 Nodes=t-cn[0301,0710,0715-0716,0736,0828] NodeCnt=6 CoreCnt=288 Features=(null) PartitionName=(null) Flags=IGNORE_JOBS,SPEC_NODES Users=xxx Accounts=sysop Licenses=(null) State=ACTIVE For some users we get the output from squeue that jobs till start after this reservation ends (2014-09-20) and the users are asking us "Will my jobs not start before that?". Is there a way to prevent this from happening? Maybe expected starttimes that are that far in the future should not be allowed to appear. If the starttime is more then a week or two into the future the starttime will probably not be that accurate anyway. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Bug in squeue command.
Hi! A user reported a strange behaviour of squeue in our newly installed 2.6.1 version. The following submit file: -8<- #!/bin/bash -l #SBATCH -N 1 #SBATCH -n 12 #SBATCH --time=5-00:00:00 hostname -8<- Results in the following output from squeue if I use -j or -u: -8<- % squeue -u magnus JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1190903 batch submit2 magnus PD 0:00 12 (Priority) -8<- -8<- % squeue -j 1190903 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1190903 batch submit2 magnus PD 0:00 12 (Priority) -8<- But if I use grep: -8<- % squeue | grep 1190903 1190903 batch submit2 magnus PD 0:00 1 (Priority) -8<- I get the expected behaviour. I have tracked this down to a commit to "To minimize overhead" https://github.com/SchedMD/slurm/commit/ac44db862c8d1f460e55ad09017d058942ff6499 on line 397/416 in src/squeue/opts.c. max_cpus is used in the _get_node_cnt() to estimate the number of nodes required. Reverting params.max_cpus code I get the expected behaviour. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: cons_res: Can't use Partition SelectType
Hi! To use Partition SelectTypeParameter you must use CR_Socket or (CR_Core and CR_ALLOCATE_FULL_SOCKET) as the default SelectTypeParameter. You are using CR_CPU_Memory. Best regards, Magnus On 2013-08-05 23:13, Eva Hocks wrote: I am getting spam messages in the logs: [2013-08-05T14:04:32.000] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET The slurm.conf settings are: SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory and I have set one partition in partitions.conf to SelectTypeParameters=CR_Core Why does slurm complain? I HAVE set CR_Core and I even checked the spelling. Thanks Eva -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: strftime issue(s).
Works for me :-) /M On 2013-06-17 10:25, Moe Jette wrote: Hi Magnus, Thank you for reporting this problem and providing a patch. Most places that use strftime already test for a return value of zero, but I see two that do not. Perhaps using a macro that can be used from multiple places will be a better solution than changing the code in various places. See attached variation of your patch. Moe Quoting Magnus Jonsson : Hi! We found an issue in sacct that we pined down to a strftime call in 'src/common/parse_time.c' (slurm_make_time_str). Reproducable with (in 2.5.{6,7}): OK: % sacct -Po end%19 -s failed,completed,timeout,cancelled -S 2013-06-13 | head | cat -vet 2013-06-14T11:40:14$ 2013-06-14T11:40:14$ 2013-06-14T11:40:11$ Fail: sacct -Po end%18 -s failed,completed,timeout,cancelled -S 2013-06-13 | head | cat -vet End$ 2013-06-14TVM-^P^?$ 2013-06-14TVM-^P^?$ 2013-06-14TVM-^P^?$ The problem is that the output from strftime with a buffer less the the required length of output is undefined in libc since libc 4.4.4. Different libc implementations seams to implement this differently. On Solaris for example the buffer is truncated at the given length but still returns 0. From man page of strftime (Ubuntu Precise): --8< RETURN VALUE The strftime() function returns the number of characters placed in the array s, not including the terminating null byte, provided the string, including the terminating null byte, fits. Otherwise, it returns 0, and the contents of the array is undefined. (This behavior applies since at least libc 4.4.4; very old versions of libc, such as libc 4.4.1, would return max if the array was too small.) Note that the return value 0 does not necessarily indicate an error; for example, in many locales %p yields an empty string. --8< From man page of strftime (Solaris) --8< If the total number of resulting characters including the terminating null character is more than maxsize, strftime() returns 0 and the contents of the array are indeterminate. --8< The return value of strftime is not checked for in slurm_make_time_str making the returned value undefined. As I see it the problem can be solved in many ways. 1. using a "large" temporary buffer for the output the expected behaviour of sacct end%N will for most normal cases be fine. 2. Only checking the return value and returning an error or an well defined output. I haved attached a patch for case 1 with that sets the output to "#" if the output still not fit into the buffer. There seams to be other places in the slurm code base that uses strftime and not checking the return code. Some of them might be OK due to the format string and size of the buffer. But this might needs to be looked into with more depth. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] strftime issue(s).
Hi! We found an issue in sacct that we pined down to a strftime call in 'src/common/parse_time.c' (slurm_make_time_str). Reproducable with (in 2.5.{6,7}): OK: % sacct -Po end%19 -s failed,completed,timeout,cancelled -S 2013-06-13 | head | cat -vet 2013-06-14T11:40:14$ 2013-06-14T11:40:14$ 2013-06-14T11:40:11$ Fail: sacct -Po end%18 -s failed,completed,timeout,cancelled -S 2013-06-13 | head | cat -vet End$ 2013-06-14TVM-^P^?$ 2013-06-14TVM-^P^?$ 2013-06-14TVM-^P^?$ The problem is that the output from strftime with a buffer less the the required length of output is undefined in libc since libc 4.4.4. Different libc implementations seams to implement this differently. On Solaris for example the buffer is truncated at the given length but still returns 0. From man page of strftime (Ubuntu Precise): --8< RETURN VALUE The strftime() function returns the number of characters placed in the array s, not including the terminating null byte, provided the string, including the terminating null byte, fits. Otherwise, it returns 0, and the contents of the array is undefined. (This behavior applies since at least libc 4.4.4; very old versions of libc, such as libc 4.4.1, would return max if the array was too small.) Note that the return value 0 does not necessarily indicate an error; for example, in many locales %p yields an empty string. --8< From man page of strftime (Solaris) --8< If the total number of resulting characters including the terminating null character is more than maxsize, strftime() returns 0 and the contents of the array are indeterminate. --8< The return value of strftime is not checked for in slurm_make_time_str making the returned value undefined. As I see it the problem can be solved in many ways. 1. using a "large" temporary buffer for the output the expected behaviour of sacct end%N will for most normal cases be fine. 2. Only checking the return value and returning an error or an well defined output. I haved attached a patch for case 1 with that sets the output to "#" if the output still not fit into the buffer. There seams to be other places in the slurm code base that uses strftime and not checking the return code. Some of them might be OK due to the format string and size of the buffer. But this might needs to be looked into with more depth. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff -ru site/src/common/parse_time.c amd64_ubuntu1004/src/common/parse_time.c --- site/src/common/parse_time.c 2013-06-05 21:43:00.0 +0200 +++ amd64_ubuntu1004/src/common/parse_time.c 2013-06-14 16:56:20.0 +0200 @@ -597,6 +597,7 @@ static char fmt_buf[32]; static const char *display_fmt; static bool use_relative_format; + char tmp_string[(size<256?256:size+1)]; if (!display_fmt) { char *fmt = getenv("SLURM_TIME_FORMAT"); @@ -626,7 +627,11 @@ if (use_relative_format) display_fmt = _relative_date_fmt(&time_tm); - strftime(string, size, display_fmt, &time_tm); + if(strftime(tmp_string, sizeof(tmp_string), display_fmt, &time_tm) == 0) { + memset(tmp_string,'#',size); + } + strncpy(string,tmp_string,size); + string[size-1] = 0; } } smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] sbatch --exclusive --mem-per-cpu
Hi! If a user asks for more then the available memory of a core (in our case 2500/core) with -N1 --mem-per-cpu and also add --exclusive. Slurm will allocate all the cores in the node but only account for the number of nodes that fulfil the requirements of --mem-per-cpu. For example if I say --mem-per-cpu=5000 only half of the available cores will be accounted for but all of them will be blocked. This is the (relevant) output of scontrol show job of real job at our system: -8<- JobId=69211 Name=memory NumNodes=1 NumCPUs=30 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn0102 CPU_IDs=0-47 Mem=12 MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Shared=0 Contiguous=0 Licenses=(null) Network=(null) #!/bin/bash #SBATCH --mem-per-cpu=4000 #SBATCH -N 1 #SBATCH -n 1 #SBATCH --exclusive -8<- Total memory/node=128000M, 48 cores, default 2500M/core As I understand it this will give the wrong input for the fair share scheduler and results the wrong priority (to high) for the user. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] issue with task/affinity and srun --exclusive
Hi! We have a problem with task/affinity and srun --exclusive If I submit a job with sbatch that runs srun --exclusive it looks like from the output of hwloc-bind --get that cores are allocated (and binded) to cores before task/affinity gets a chance of distribute them according to the cpu_bind. In the example below I use 'sbatch --exclusive' and gets 48 cores in total. srun -n1 -c6 --cpu_bind=rank_ldom sh -c "hwloc-bind --get | ./hex2bin" results in: 00 00 00 00 00 00 00 11 = 0x003f srun -n1 -c6 --exclusive --cpu_bind=rank_ldom sh -c "hwloc-bind --get | ./hex2bin" results in: 00 00 01 01 01 01 01 01 = 0x41041041 This is also looks like the bitmask that task/affinity gets from slurm. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] cons_res select_p_select_nodeinfo_set_all problem with multiple partitions.
Hi! I found a bug in cons_res/select_p_select_nodeinfo_set_all. If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node. Patch attached to fix the problem. I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff -ru site/src/common/bitstring.c amd64_ubuntu1004/src/common/bitstring.c --- site/src/common/bitstring.c 2013-03-08 20:29:51.0 +0100 +++ amd64_ubuntu1004/src/common/bitstring.c 2013-03-12 14:07:20.0 +0100 @@ -69,6 +69,7 @@ strong_alias(bit_not, slurm_bit_not); strong_alias(bit_or, slurm_bit_or); strong_alias(bit_set_count, slurm_bit_set_count); +strong_alias(bit_set_count_range, slurm_bit_set_count_range); strong_alias(bit_clear_count, slurm_bit_clear_count); strong_alias(bit_nset_max_count,slurm_bit_nset_max_count); strong_alias(int_and_set_count, slurm_int_and_set_count); @@ -662,15 +663,45 @@ _assert_bitstr_valid(b); bit_cnt = _bitstr_bits(b); - for (bit = 0; bit < bit_cnt; bit += word_size) { - if ((bit + word_size - 1) >= bit_cnt) - break; + for (bit = 0; (bit + word_size) <= bit_cnt; bit += word_size) { count += hweight(b[_bit_word(bit)]); } for ( ; bit < bit_cnt; bit++) { if (bit_test(b, bit)) count++; } + return count; +} + +/* + * Count the number of bits set in a range of bitstring. + * b (IN) bitstring to check + * start (IN) first bit to check + * end (IN) last bit to check+1 + * RETURN count of set bits + */ +int +bit_set_count_range(bitstr_t *b, int start, int end) +{ + int count = 0; + bitoff_t bit, bit_cnt; + const int word_size = sizeof(bitstr_t) * 8; + + _assert_bitstr_valid(b); + _assert_bit_valid(b,start); + + end = MIN(end,_bitstr_bits(b)); + for ( bit = start; bit < end && bit < ((start+word_size-1)/word_size) * word_size; bit++) { + if (bit_test(b, bit)) + count++; + } + for (; (bit + word_size) <= end ; bit += word_size) { + count += hweight(b[_bit_word(bit)]); + } + for ( ; bit < end; bit++) { + if (bit_test(b, bit)) + count++; + } return count; } diff -ru site/src/common/bitstring.h amd64_ubuntu1004/src/common/bitstring.h --- site/src/common/bitstring.h 2013-03-08 20:29:51.0 +0100 +++ amd64_ubuntu1004/src/common/bitstring.h 2013-03-12 14:09:18.0 +0100 @@ -172,6 +172,7 @@ void bit_not(bitstr_t *b); void bit_or(bitstr_t *b1, bitstr_t *b2); int bit_set_count(bitstr_t *b); +int bit_set_count_range(bitstr_t *b, int start, int end); int bit_clear_count(bitstr_t *b); int bit_nset_max_count(bitstr_t *b); int int_and_set_count(int *i1, int ilen, bitstr_t *b2); diff -ru site/src/common/slurm_xlator.h amd64_ubuntu1004/src/common/slurm_xlator.h --- site/src/common/slurm_xlator.h 2013-03-08 20:29:51.0 +0100 +++ amd64_ubuntu1004/src/common/slurm_xlator.h 2013-03-12 12:32:50.0 +0100 @@ -93,6 +93,7 @@ #define bit_not slurm_bit_not #define bit_or slurm_bit_or #define bit_set_count slurm_bit_set_count +#define bit_set_count_range slurm_bit_set_count_range #define bit_clear_count slurm_bit_clear_count #define bit_nset_max_count slurm_bit_nset_max_count #define bit_and_set_count slurm_bit_and_set_count diff -ru site/src/plugins/select/cons_res/select_cons_res.c amd64_ubuntu1004/src/plugins/select/cons_res/select_cons_res.c --- site/src/plugins/select/cons_res/select_cons_res.c 2013-03-11 11:13:31.0 +0100 +++ amd64_ubuntu1004/src/plugins/select/cons_res/select_cons_res.c 2013-03-12 13:30:06.0 +0100 @@ -2230,7 +2230,7 @@ struct part_res_record *p_ptr; struct node_record *node_ptr = NULL; int i=0, n=0, c, start, end; - uint16_t tmp, tmp_16 = 0; + uint16_t tmp, tmp_16 = 0, tmp_part; static time_t last_set_all = 0; uint32_t node_threads, node_cpus; select_nodeinfo_t *nodeinfo = NULL; @@ -2275,20 +2275,17 @@ for (p_ptr = select_part_record; p_ptr; p_ptr = p_ptr->next) { if (!p_ptr->row) continue; + tmp_part = 0; for (i = 0; i < p_ptr->num_rows; i++) { if (!p_ptr->row[i].row_bitmap) continue; -tmp = 0; -for (c = start; c < end; c++) { - if (bit_test(p_ptr->row[i].row_bitmap, - c)) - tmp++; -} +tmp = bit_set_count_range(p_ptr->row[i].row_bitmap, + start,end); /* get the row with the largest cpu count on it. */ -if (tmp > tmp_16) - tmp_16 = tmp; +tmp_part = MAX(tmp,tmp_part); } + tmp_16 += tmp_part; } /* The minimum allocatable unit may a core, so scale smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Licenses verification mechanism
Yes, it's possible to check if it's not set.. but for us not all users needs an license and it's not as simple not to allow people from starting software based on the licence information in slurm. /Magnus On 2013-03-08 12:24, Taras Shapovalov wrote: Hi Magnus, Thanks, this solution probably will work for us as well. Also, when a user does not use -L option, than this could be checked (I believe) in contribs/lua/job_submit.lua in several lines of code (in slurm_job_submit function). -- Taras On 03/08/2013 09:37 AM, Magnus Jonsson wrote: We have solved this by using the licens handler in slurm and let our users specify the licences with -L. Outside of slurm we have a script that periodic check our licence server (FlexLM) for awailable licenses and used licenses in slurm and blocks a number of licences with a "licences" reservation that no one can run in. It also has the ability to make sure that there are available licenses if run in the prolog and fail the jobs if there is no licences left. It's not a perfect solution but seams to work fairly well for us. The only problem is that a user can grab a licence without specifying the -L option but this is better then nothing. If anybody interesting in more details just send me an email and I try to answer them. Best Regards, Magnus On 2013-03-08 02:58, Taras Shapovalov wrote: Hi all, Recently I faced with the case where users use software which requires licenses. The license server is running somewhere outside several clusters and jobs from those clusters should check availability of the licenses periodically. If there is no free licenses, then the job should be re-queued (so after some time the license availability will be verified again). Does anybody have experience with the case where job (or some script) checks some condition periodically and stay in a queue if the condition has not been complied yet? -- Taras -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Licenses verification mechanism
We have solved this by using the licens handler in slurm and let our users specify the licences with -L. Outside of slurm we have a script that periodic check our licence server (FlexLM) for awailable licenses and used licenses in slurm and blocks a number of licences with a "licences" reservation that no one can run in. It also has the ability to make sure that there are available licenses if run in the prolog and fail the jobs if there is no licences left. It's not a perfect solution but seams to work fairly well for us. The only problem is that a user can grab a licence without specifying the -L option but this is better then nothing. If anybody interesting in more details just send me an email and I try to answer them. Best Regards, Magnus On 2013-03-08 02:58, Taras Shapovalov wrote: Hi all, Recently I faced with the case where users use software which requires licenses. The license server is running somewhere outside several clusters and jobs from those clusters should check availability of the licenses periodically. If there is no free licenses, then the job should be re-queued (so after some time the license availability will be verified again). Does anybody have experience with the case where job (or some script) checks some condition periodically and stay in a queue if the condition has not been complied yet? -- Taras -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Problem with backfill and patch for solution
Hi! We have what seems to be a similar type of load, and have in periods experienced the same problem. There are some parameters that can be used to tune the backfiller. We have had good results with setting bf_max_job_user to a small value (between 5 and 10), and bf_resolution to a large value (around 3600). bf_max_job_user is similar to Maui MAXIJOB limit; the backfiller will only try this many jobs for each user. This is especially useful if some users have many identical or nearly identical jobs in the queue. I have tried tuning with bf_max_job_user and as you say it's especially useful with users having many identical jobs in the queue but I think it somewhat bad for the backfill not to look at the whole queue. Many of our users that have many jobs do have more or less identical jobs but not all and then not looking at the complete queue would be bad for the user especially if you put in small jobs for testing purposes. bf_resolution is the time resolution (in seconds) of the time slots used for estimating when a job can start. The default, 60 seconds, was way to low for us. I will try increasing the resolution value and see if it will pick up speed with that. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Problem with backfill and patch for solution
Hi! We have a problem with backfill. Jobs are not backfilled due to the fact that backfill does not finish the complete backlog of jobs in the queue before it's interrupted and starts all over again. We sometimes have lots of jobs in the queue of various sizes and users and even with idle nodes short job will not start because of this. I have made a patch for backfill with a configuration option (bf_continue) to let backfill continue from the last JobID of the last cycle. This will make backfill look at the whole queue eventually. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff -r -u a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/backfill.c --- a/src/plugins/sched/backfill/backfill.c 2013-02-05 23:59:05.0 +0100 +++ b/src/plugins/sched/backfill/backfill.c 2013-03-01 10:31:24.0 +0100 @@ -125,6 +125,7 @@ static int backfill_window = BACKFILL_WINDOW; static int max_backfill_job_cnt = 50; static int max_backfill_job_per_user = 0; +static bool backfill_continue = false; /*** local functions */ static void _add_reservation(uint32_t start_time, uint32_t end_reserve, @@ -410,6 +411,18 @@ max_backfill_job_per_user); } + /* bf_continue=true makes backfill continue where it was if interrupted + */ + if (sched_params && (strstr(sched_params, "bf_continue="))) { + if (strstr(sched_params, "bf_continue=1")) { + backfill_continue = true; + } else if (strstr(sched_params, "bf_continue=0")) { + backfill_continue = false; + } else { + fatal("Invalid bf_continue (use only 0 or 1)"); + } + } + xfree(sched_params); } @@ -530,6 +543,8 @@ uint32_t *uid = NULL, nuser = 0; uint16_t *njobs = NULL; bool already_counted; + static uint32_t last_job_id=0; + bool last_job_id_found = false; #ifdef HAVE_CRAY /* @@ -597,12 +612,33 @@ uid = xmalloc(BF_MAX_USERS * sizeof(uint32_t)); njobs = xmalloc(BF_MAX_USERS * sizeof(uint16_t)); } + /* + * Reset last_job_id if not using bf_continue + */ + if (!backfill_continue) { + last_job_id = 0; + } + if (last_job_id == 0) { + last_job_id_found = true; + } while ((job_queue_rec = (job_queue_rec_t *) list_pop_bottom(job_queue, sort_job_queue2))) { job_test_count++; job_ptr = job_queue_rec->job_ptr; part_ptr = job_queue_rec->part_ptr; xfree(job_queue_rec); + + /* + * Skip job checked last time + */ + if (backfill_continue && !last_job_id_found) { + if (last_job_id == job_ptr->job_id) { +last_job_id_found = true; +last_job_id = 0; + } + continue; + } + if (!IS_JOB_PENDING(job_ptr)) continue; /* started in other partition */ job_ptr->part_ptr = part_ptr; @@ -783,6 +819,10 @@ "breaking out after testing %d " "jobs", job_test_count); } +/* + * Save last JobID for next turn + */ +last_job_id = job_ptr->job_id; rc = 1; break; } @@ -865,6 +905,10 @@ if (node_space_recs >= max_backfill_job_cnt) { /* Already have too many jobs to deal with */ + /* + * Save last JobID for next turn + */ + last_job_id = job_ptr->job_id; break; } @@ -890,6 +934,15 @@ if (debug_flags & DEBUG_FLAG_BACKFILL) _dump_node_space_table(node_space); } + + /* + * Reset last_job_id pointer if reached end of queue + * without finding anything to do + */ + if (!last_job_id_found) { + debug("backfill: last_job_id=%d (reached end of queue without finding old job)",last_job_id); + last_job_id = 0; + } xfree(uid); xfree(njobs); FREE_NULL_BITMAP(avail_bitmap); smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Buffer overflow bug + patch.
Hi! I just found a bug in the slurm that creates a buffer overflow if you run 'scontrol show config'. Patch attached to fix the problem. /Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff --git a/src/common/slurm_protocol_defs.c b/src/common/slurm_protocol_defs.c index adf48e5..45ed46c 100644 --- a/src/common/slurm_protocol_defs.c +++ b/src/common/slurm_protocol_defs.c @@ -1163,7 +1163,7 @@ extern uint16_t log_string2num(char *name) * NOTE: Not reentrant */ extern char *sched_param_type_string(uint16_t select_type_param) { - static char select_str[128]; + static char select_str[64]; select_str[0] = '\0'; if ((select_type_param & CR_CPU) && smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1
Hi! For us this is in most cases a bad default behaviour and I have not found a way to set a default value either (other then changing the code and recompile). One other thing that I'm still curious about is then will the default:-statement of the switch/case in my first mail (see below) happen? task_dist seems to be initialized to SLURM_DIST_CYCLIC (1) and all the other cases of task_dist that comes to this point are defined in the switch/case. I have not bin able to reach it and activate the _task_layout_lllp_multi() function. /Magnus On 2013-02-15 19:52, martin.pe...@bull.com wrote: Assuming you're using the default allocation and distribution methods, the behavior you describe sounds correct. Available cpus will be selected cyclically across the sockets for allocation to the job. Allocated cpus will be selected cyclically across the sockets for distribution to tasks for binding. And each task will be bound to all of the allocated cpus on each socket from which a cpu was distributed to it. For -n 8 -c 6, I would expect each of the 8 tasks to be bound to 36 cpus (6 cpus on each of 6 sockets). See the CPU Management Guide in the Slurm documentation for more info. Examples 11 thru 13 illustrate socket binding. Martin Perry Bull Phoenix From: Moe Jette To: "slurm-dev" , Date: 02/15/2013 10:33 AM Subject: [slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1 Have you tried the --ntasks-per-socket or --ntasks-per-core options? Quoting Magnus Jonsson : > Hi! > > I have noticed strange behaviour in the task/affinity plugin if I > use --cpu_bind=socket and -c > 1. > > My task are distributed one on each socket (I have 8) and if I say > -c 6 six of my sockets are allocated to my first task. If I have 8 > tasks each task get 6 of the 8 sockets. > > This sounds like a bad behaviour but is might be as design? > > I have traced it down to the lllp_distribution() function in > task/affinity/dist_task.c > > In this switch statement: > > switch (req->task_dist) { > case SLURM_DIST_BLOCK_BLOCK: > case SLURM_DIST_CYCLIC_BLOCK: > case SLURM_DIST_PLANE: > /* tasks are distributed in blocks within a plane */ > rc = _task_layout_lllp_block(req, node_id, &masks); > break; > case SLURM_DIST_CYCLIC: > case SLURM_DIST_BLOCK: > case SLURM_DIST_CYCLIC_CYCLIC: > case SLURM_DIST_BLOCK_CYCLIC: > rc = _task_layout_lllp_cyclic(req, node_id, &masks); > break; > default: > if (req->cpus_per_task > 1) > rc = _task_layout_lllp_multi(req, node_id, &masks); > else > rc = _task_layout_lllp_cyclic(req, node_id, &masks); > req->task_dist = SLURM_DIST_BLOCK_CYCLIC; > break; > } > > in the default block there is a diffrent function called if > cpus_per_task > 1. Should the cyclic block be the same as the > default block? > > Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default? > > Best regards, > Magnus > > -- > Magnus Jonsson, Developer, HPC2N, Umeå Universitet > > -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: slurmctld prolog delays job start
Any news on this? /Magnus On 2013-02-06 02:24, Michael Gutteridge wrote: We have a prolog that the slurm controller runs (pretty straightforward, just sets up some temporary directories). However, since upgrading from 2.3.5 to 2.5.1 we've got a situation where having any slurmctld prolog configured causes long delays (60-120s) between when slurmctl allocates resources and starts the job. It seems to occur in both srun and sbatch submitted jobs, though with different symptoms. I've distilled to a very generic config, using the FIFO scheduler to eliminate any of that. I've also reduced the prolog to a two-line script: #!/bin/bash exit 0 The slurmctld.log has this: [2013-02-05T15:26:27-08:00] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB from uid=5 [2013-02-05T15:26:27-08:00] debug3: JobDesc: user_id=5 job_id=-1 partition=(null) name=sleeper.sh [2013-02-05T15:26:27-08:00] debug3:cpus=1-4294967294 pn_min_cpus=-1 snip [2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config containing puck[2-6] [2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29 idle_nodes 4 share_nodes 5 [2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29 [2013-02-05T15:26:27-08:00] debug2: sched: JobId=29 allocated resources: NodeList=(null) [2013-02-05T15:26:27-08:00] _slurm_rpc_submit_batch_job JobId=29 usec=1359 [2013-02-05T15:26:27-08:00] debug: sched: Running job scheduler [2013-02-05T15:26:27-08:00] debug2: found 5 usable nodes from config containing puck[2-6] [2013-02-05T15:26:27-08:00] debug3: _pick_best_nodes: job 29 idle_nodes 4 share_nodes 5 [2013-02-05T15:26:27-08:00] debug2: select_p_job_test for job 29 [2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]: required cpus: 1, min req boards: 1, [2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: node[0]: min req sockets: 1, min avail cores: 7 [2013-02-05T15:26:27-08:00] debug3: cons_res: best_fit: using node[0]: board[0]: socket[1]: 3 cores available [2013-02-05T15:26:27-08:00] debug3: cons_res: _add_job_to_res: job 29 act 0 [2013-02-05T15:26:27-08:00] debug3: cons_res: adding job 29 to part campus row 0 [2013-02-05T15:26:27-08:00] debug3: sched: JobId=29 initiated [2013-02-05T15:26:27-08:00] sched: Allocate JobId=29 NodeList=puck2 #CPUs=1 [2013-02-05T15:26:27-08:00] debug3: Writing job id 29 to header record of job_state file [2013-02-05T15:26:27-08:00] debug2: prolog_slurmctld job 29 prolog completed The job shows running, but there are not processes running on the allocated node (puck2 in this case). In the allocated node's slurmd.log there's nothing (despite running with 3 "v" flags). A little while later: [2013-02-05T15:27:27-08:00] error: agent waited too long for nodes to respond, sending batch request anyway... [2013-02-05T15:27:27-08:00] Job 29 launch delayed by 60 secs, updating end_time [2013-02-05T15:27:27-08:00] debug2: Spawning RPC agent for msg_type 4005 [2013-02-05T15:27:27-08:00] debug2: got 1 threads to send out [2013-02-05T15:27:27-08:00] debug2: Tree head got back 0 looking for 1 [2013-02-05T15:27:27-08:00] debug3: Tree sending to puck2 [2013-02-05T15:27:27-08:00] debug2: Tree head got back 1 [2013-02-05T15:27:27-08:00] debug2: Tree head got them all [2013-02-05T15:27:27-08:00] Node puck2 now responding [2013-02-05T15:27:27-08:00] debug2: node_did_resp puck2 and on the allocated node, slurmd.log comes to life: [2013-02-05T15:27:27-08:00] debug2: got this type of message 4005 [2013-02-05T15:27:27-08:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2013-02-05T15:27:27-08:00] debug: task_slurmd_batch_request: 29 [2013-02-05T15:27:27-08:00] debug: Calling /usr/sbin/slurmstepd spank prolog [2013-02-05T15:27:27-08:00] Reading slurm.conf file: /etc/slurm-llnl/slurm.conf [2013-02-05T15:27:27-08:00] Running spank/prolog for jobid [29] uid [34152] [2013-02-05T15:27:27-08:00] spank: opening plugin stack /etc/slurm-llnl/plugstack.conf [2013-02-05T15:27:27-08:00] spank: /usr/lib64/slurm-llnl/use-env.so: no callbacks in this context [2013-02-05T15:27:27-08:00] Launching batch job 29 for UID 34152 [2013-02-05T15:27:27-08:00] debug level is 6. and the task starts running. Removing "PrologSlurmctld" eliminates this delay, and the job starts immediately. The fact that the delay is exactly 60 is suspicious and makes me suspect a misconfiguration. However, outside of the prolog configuration directive, the config is straight out of the config generator. Any pointers would be greatly appreciated- I'm out of ideas... Thanks Michael -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: task/affinity, --cpu_bind=socket and -c > 1
Hi! This does not make a difference. And I think it might not do either according to the man page. /Magnus On 2013-02-15 18:32, Moe Jette wrote: Have you tried the --ntasks-per-socket or --ntasks-per-core options? Quoting Magnus Jonsson : Hi! I have noticed strange behaviour in the task/affinity plugin if I use --cpu_bind=socket and -c > 1. My task are distributed one on each socket (I have 8) and if I say -c 6 six of my sockets are allocated to my first task. If I have 8 tasks each task get 6 of the 8 sockets. This sounds like a bad behaviour but is might be as design? I have traced it down to the lllp_distribution() function in task/affinity/dist_task.c In this switch statement: switch (req->task_dist) { case SLURM_DIST_BLOCK_BLOCK: case SLURM_DIST_CYCLIC_BLOCK: case SLURM_DIST_PLANE: /* tasks are distributed in blocks within a plane */ rc = _task_layout_lllp_block(req, node_id, &masks); break; case SLURM_DIST_CYCLIC: case SLURM_DIST_BLOCK: case SLURM_DIST_CYCLIC_CYCLIC: case SLURM_DIST_BLOCK_CYCLIC: rc = _task_layout_lllp_cyclic(req, node_id, &masks); break; default: if (req->cpus_per_task > 1) rc = _task_layout_lllp_multi(req, node_id, &masks); else rc = _task_layout_lllp_cyclic(req, node_id, &masks); req->task_dist = SLURM_DIST_BLOCK_CYCLIC; break; } in the default block there is a diffrent function called if cpus_per_task > 1. Should the cyclic block be the same as the default block? Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] task/affinity, --cpu_bind=socket and -c > 1
Hi! I have noticed strange behaviour in the task/affinity plugin if I use --cpu_bind=socket and -c > 1. My task are distributed one on each socket (I have 8) and if I say -c 6 six of my sockets are allocated to my first task. If I have 8 tasks each task get 6 of the 8 sockets. This sounds like a bad behaviour but is might be as design? I have traced it down to the lllp_distribution() function in task/affinity/dist_task.c In this switch statement: switch (req->task_dist) { case SLURM_DIST_BLOCK_BLOCK: case SLURM_DIST_CYCLIC_BLOCK: case SLURM_DIST_PLANE: /* tasks are distributed in blocks within a plane */ rc = _task_layout_lllp_block(req, node_id, &masks); break; case SLURM_DIST_CYCLIC: case SLURM_DIST_BLOCK: case SLURM_DIST_CYCLIC_CYCLIC: case SLURM_DIST_BLOCK_CYCLIC: rc = _task_layout_lllp_cyclic(req, node_id, &masks); break; default: if (req->cpus_per_task > 1) rc = _task_layout_lllp_multi(req, node_id, &masks); else rc = _task_layout_lllp_cyclic(req, node_id, &masks); req->task_dist = SLURM_DIST_BLOCK_CYCLIC; break; } in the default block there is a diffrent function called if cpus_per_task > 1. Should the cyclic block be the same as the default block? Or should SLURM_DIST_CYCLIC, SLURM_DIST_BLOCK be the same as default? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Preemptation bug
es: _can_job_run_on_node: 48 cpus on t-cn1033(0), mem 0/129000 [2013-02-12T14:54:48+01:00] cons_res: eval_nodes:0 consec c=48 n=1 b=0 e=0 r=-1 [2013-02-12T14:54:48+01:00] cons_res: cr_job_test: test 1 pass - idle resources found [2013-02-12T14:54:48+01:00] no job_resources info for job 241 [2013-02-12T14:54:48+01:00] debug2: Testing job time limits and checkpoints 8<--- -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet # # See the slurm.conf man page for more information. # ControlMachine=slurm-kvm AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge DisableRootJobs=YES EnforcePartLimits=YES MailProg=/usr/bin/mail MpiDefault=openmpi MpiParams=ports=12000-12999 ProctrackType=proctrack/cgroup PropagateResourceLimitsExcept=CPU,MEMLOCK ReturnToService=1 SlurmctldPort=6817 SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/cgroup,task/affinity TmpFs=/scratch UsePAM=1 HealthCheckInterval=3600 HealthCheckProgram=/var/conf/slurm/hpc2n-healthcheck InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=60 # SCHEDULING DefMemPerCPU=2500 FastSchedule=2 MaxMemPerCPU=2500 SchedulerType=sched/backfill SchedulerParameters=max_job_bf=2000,bf_window=20160,default_queue_depth=2000 # SelectType=select/cons_res SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK # JOB PRIORITY PriorityType=priority/multifactor PriorityDecayHalfLife=50-0 PriorityWeightFairshare=100 PriorityWeightPartition=1 # LOGGING AND ACCOUNTING AccountingStorageEnforce=associations,limits,qos AccountingStorageHost=slurm-kvm AccountingStorageType=accounting_storage/slurmdbd ClusterName=slurmtestcluster DebugFlags=CPU_Bind JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=7 SlurmdDebug=7 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmsched.log # COMPUTE NODES # DEVEL NodeName=t-cn[1033-1034] RealMemory=129000 Sockets=8 CoresPerSocket=6 # Partition Configurations PartitionName=devel Nodes=t-cn103[3,4] Default=YES DefaultTime=30:00 MaxTime=5-0 Priority=30 PreemptMode=OFF PartitionName=core Nodes=t-cn103[3,4] DefaultTime=30:00 MaxTime=5-0 Priority=20 PreemptMode=OFF PartitionName=preemp Nodes=t-cn103[3,4] Priority=10 PreemptMode=CANCEL GraceTime=15 PreemptType=preempt/partition_prio PreemptMode=CANCEL #!/bin/bash #SBATCH -p devel #SBATCH --time=05:00:00 #SBATCH -N1 #SBATCH --exclusive srun -n1 ./job.pl #!/bin/bash #SBATCH -p devel #SBATCH --time=01:00:00 #SBATCH -N2 #SBATCH --exclusive srun -n1 ./job.pl #!/bin/bash #SBATCH -p preemp #SBATCH --time=01:00:00 #SBATCH -N1 #SBATCH -n48 srun -n1 ./job.pl #!/bin/bash #SBATCH -p devel #SBATCH --time=04:00:00 #SBATCH --signal USR1@60 #SBATCH -N1 #SBATCH -n48 # #SBATCH --exclusive if [ "$SLURM_JOBID" = "" ]; then echo "Using sbatch to submit job" sbatch $0 exit 0 fi srun -n1 ./job.pl smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Disable black hole nodes automatically
You could put a health check in the epilog of a job so that after every job the node is checked. If it's in bad shape you can down it. For the normal case with long running jobs this should not be a problem and only one job will fail. /Magnus On 2013-02-08 10:11, Mario Kadastik wrote: Hi, I'm wondering if there's a way to detect a fast churn rate for a node. Last night we had one node lose the software area so all jobs that were scheduled failed within a few minutes (the jobs use wrappers that do health checking of environment so the job exit code was 0, the wrapper propagated the actual error code to the users software). We have a self test run by slurm every 5 minutes and it did detect the node failure, but before it could the node had "failed" hundreds of jobs in that 5 minute window. We assume most jobs would run for at least tens of minutes so if slurm sees a node churning through jobs in less than a minute it should disable the node. Is there any way to handle this beyond moving self test script execution up from 5 minutes to say every 30 seconds? Thanks, Mario Kadastik, PhD Researcher --- "Physics is like sex, sure it may have practical reasons, but that's not why we do it" -- Richard P. Feynman -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Patch for partition based SelectType (CR_Socket/CR_Core).
That's okej. It will blend on the same node as long as there is cores available. Because of the CR_ALLOCATE_FULL_SOCKET there will never be non-allocated cores that is reserved for jobs. Why CR_ALLOCATE_FULL_SOCKET is not default I don't understand but I guess there is some good historic reason for that. /Magnus On 2013-02-07 16:42, Aaron Knister wrote: That's awesome! (How) does it handle the case of nodes in multiple partitions? Sent from my iPhone On Feb 7, 2013, at 8:24 AM, Magnus Jonsson wrote: Hi everybody! Here attached is a patch that enables partition based SelectType (currently CR_Socket/CR_Core) in select/cons_res. The patch requires that CR_ALLOCATE_FULL_SOCKET is enabled to work and also this patch from master branch: https://github.com/SchedMD/slurm/commit/cdf679d0158a246e7389a15b62f127e5142003fe It should however be easy to change it to use the old #define if you want to. We are currently testing this in our development system but will go into production later this spring based on needs from some of our users. One thing that I noticed during the development of this is that if a new option is added to the slurm.conf that is not supported with an earlier version of slurm programs/libs that are compiled with the earlier version stops working due to complaining of errors in slurm.conf. We have the CR_ALLOCATE_FULL_SOCKET patch in our production system and some programs linked with openmpi stop working for some of our users. It might we wise to try require less reading of the slurm.conf from the core parts of slurm and try to put more reading/parsing of the config file from the plugins (and other modular parts of slurm). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Patch for partition based SelectType (CR_Socket/CR_Core).
Hi everybody! Here attached is a patch that enables partition based SelectType (currently CR_Socket/CR_Core) in select/cons_res. The patch requires that CR_ALLOCATE_FULL_SOCKET is enabled to work and also this patch from master branch: https://github.com/SchedMD/slurm/commit/cdf679d0158a246e7389a15b62f127e5142003fe It should however be easy to change it to use the old #define if you want to. We are currently testing this in our development system but will go into production later this spring based on needs from some of our users. One thing that I noticed during the development of this is that if a new option is added to the slurm.conf that is not supported with an earlier version of slurm programs/libs that are compiled with the earlier version stops working due to complaining of errors in slurm.conf. We have the CR_ALLOCATE_FULL_SOCKET patch in our production system and some programs linked with openmpi stop working for some of our users. It might we wise to try require less reading of the slurm.conf from the core parts of slurm and try to put more reading/parsing of the config file from the plugins (and other modular parts of slurm). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff --git a/src/common/read_config.c b/src/common/read_config.c index 2a54f69..b8d981b 100644 --- a/src/common/read_config.c +++ b/src/common/read_config.c @@ -903,6 +903,7 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type, {"ReqResv", S_P_BOOLEAN}, /* YES or NO */ {"Shared", S_P_STRING}, /* YES, NO, or FORCE */ {"State", S_P_STRING}, /* UP, DOWN, INACTIVE or DRAIN */ + {"SelectType", S_P_STRING}, /* CR_Socket, CR_Core */ {NULL} }; @@ -1125,6 +1126,22 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type, } else p->state_up = PARTITION_UP; + if (s_p_get_string(&tmp, "SelectType", tbl)) { + if (strncasecmp(tmp, "CR_Socket", 9) == 0) +p->cr_type = CR_SOCKET; + else if (strncasecmp(tmp, "CR_Core", 7) == 0) +p->cr_type = CR_CORE; + else { +error("Bad value \"%s\" for SelectType", tmp); +_destroy_partitionname(p); +s_p_hashtbl_destroy(tbl); +xfree(tmp); +return -1; + } + xfree(tmp); + } else + p->cr_type = 0; + s_p_hashtbl_destroy(tbl); *dest = (void *)p; diff --git a/src/common/read_config.h b/src/common/read_config.h index 7d017dc..a39a3c9 100644 --- a/src/common/read_config.h +++ b/src/common/read_config.h @@ -227,6 +227,7 @@ typedef struct slurm_conf_partition { uint16_t state_up; /* for states see PARTITION_* in slurm.h */ uint32_t total_nodes; /* total number of nodes in the partition */ uint32_t total_cpus; /* total number of cpus in the partition */ + uint16_t cr_type; /* Custom CR values for partition (if supported by select plugin) */ } slurm_conf_partition_t; typedef struct slurm_conf_downnodes { diff --git a/src/plugins/select/cons_res/select_cons_res.c b/src/plugins/select/cons_res/select_cons_res.c index 364a683..142f252 100644 --- a/src/plugins/select/cons_res/select_cons_res.c +++ b/src/plugins/select/cons_res/select_cons_res.c @@ -1451,8 +1451,18 @@ static int _test_only(struct job_record *job_ptr, bitstr_t *bitmap, { int rc; + uint16_t tmp_cr_type = cr_type; + if(job_ptr->part_ptr->cr_type) { + if( ( (cr_type & CR_SOCKET) || (cr_type & CR_CORE) ) && (cr_type & CR_ALLOCATE_FULL_SOCKET) ) { + tmp_cr_type &= ~(CR_SOCKET|CR_CORE); + tmp_cr_type |= job_ptr->part_ptr->cr_type; + } else { + info("cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET"); + } + } + rc = cr_job_test(job_ptr, bitmap, min_nodes, max_nodes, req_nodes, - SELECT_MODE_TEST_ONLY, cr_type, job_node_req, + SELECT_MODE_TEST_ONLY, tmp_cr_type, job_node_req, select_node_cnt, select_part_record, select_node_usage, NULL); return rc; @@ -1489,14 +1499,24 @@ static int _run_now(struct job_record *job_ptr, bitstr_t *bitmap, bool remove_some_jobs = false; uint16_t pass_count = 0; uint16_t mode; + uint16_t tmp_cr_type = cr_type; save_bitmap = bit_copy(bitmap); top: orig_map = bit_copy(save_bitmap); if (!orig_map) fatal("bit_copy: malloc failure"); + if(job_ptr->part_ptr->cr_type) { + if( ( (cr_type & CR_SOCKET) || (cr_type & CR_CORE) ) && (cr_type & CR_ALLOCATE_FULL_SOCKET) ) { + tmp_cr_type &= ~(CR_SOCKET|CR_CORE); + tmp_cr_type |= job_ptr->part_ptr->cr_type; + } else { + info("cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core and CR_ALLOCATE_FULL_SOCKET"); + } + } + rc = cr_job_test(job_ptr, bitmap, min_nodes, max_nodes, req_nodes, - SELECT_MODE_RUN_NOW, cr_type, job_node_req, + SELECT_MODE_RUN_NOW, tmp_cr_type, job_node
[slurm-dev] task_affinity bug in 2.5.1 and after..
Hi! We are in the process of upgrading into slurm 2.5.2 but I just found a bug in the task_affinity plugin in combination with cgroups. The commit https://github.com/SchedMD/slurm/commit/791322349856e14a3d50aadc4869d40b034a2f37 which solves some Power7 specific problems breaks task affinity together with cgroups on x86_64. This code seams to be introduced into slurm from 2.5.1. From our slurm.conf: TaskPlugin=task/cgroup,task/affinity from slurmd.log: [2013-02-01T13:39:12+01:00] [57] sched_setaffinity(12516,128,0x0) failed: Invalid argument [2013-02-01T13:39:12+01:00] [57] sched_getaffinity(12516) = 0xff00 With cgroups activated we get input cpusets: 0xff00 which translates into 0x0 in the reset_cpuset function. If this is a specific problem for Power7 a #ifdef around it might be good for solving this problem for other platforms. Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Is it possible to get hold of parameters from sbatch/salloc in a spank plugin?
Hi! I has not succeeded in getting the parameters from the sbatch command/submitscript into my spank plugin. If you have some example that shows this it would make my life easier. Best regards, Magnus On 2013-01-30 15:34, Karl Schulz wrote: If you really do want to have control over providing arbitrary strings back to the user, then a spank plugin might also be a possibility. We have used the slurm_spank_init_post_opt() callback as a mechanism to create a custom job submission filter for srun/sbatch. It's nothing fancy, but it gives us a way to do some quick sanity checking and apply some local site requirements like: verifying user is not over any of their disk quotas, verifying user provided a max runlimit, additional ACLs for the queue's, maximum jobs per user, etc. In this approach, stdout will be seen by the user and you can customize as desired. On Jan 29, 2013, at 11:31 AM, Moe Jette wrote: I would suggest an job_submit plugin: http://www.schedmd.com/slurmdocs/job_submit_plugins.html There is no mechanism to return a string to the user, only an exit code, but adding a few new exit codes would be simple (see slurm/slurm_errno.h and src/common/slurm_errno.c). We have also discussed adding a mechanism to return an arbitrary string to the user, but this is not possible today. Quoting Magnus Jonsson : Hi! I looking for a way to look at users submitted parameters and if they are using it in a "bad" way inform them that this might not be a good usage of the system and point them to documentation about how slurm works and how to best use it in our system. I have tried different approaches but failed on every one.. Any hints? Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.
I have CR_ALLOCATE_FULL_SOCKET working correctly on block allocation. Will fix cyclic after the weekend and supply a patch.. Best regards, Magnus On 2013-01-18 16:00, Magnus Jonsson wrote: This patch fixes the behaviour with allocating 2 cores instead of one with --ntasks-per-socket=1. /Magnus On 2013-01-18 13:59, Magnus Jonsson wrote: Hi! I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird behaviour. Currently running git/master but have seen the same behaviour on 2.4.3 with the #define. My slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET This is my submitscript (the important parts): #SBATCH -n1 #SBATCH --ntasks-per-socket=1 This gives me (from scontrol show job): NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=42-3 Mem=15000 If I submit: #SBATCH -n6 #SBATCH --ntasks-per-socket=3 it gives me (from scontrol show job): NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000 I think this is caused by how the ntasks-per-socket code is selecting nodes in job_test.c of the cons_res-plugin. I will look into the code and see if I can fix this some how otherwise I can bug test patches. I have a small part of our cluster available for testing right now (2 nodes, 8 sockets/node, 6 cores/socket). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.
This patch fixes the behaviour with allocating 2 cores instead of one with --ntasks-per-socket=1. /Magnus On 2013-01-18 13:59, Magnus Jonsson wrote: Hi! I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird behaviour. Currently running git/master but have seen the same behaviour on 2.4.3 with the #define. My slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET This is my submitscript (the important parts): #SBATCH -n1 #SBATCH --ntasks-per-socket=1 This gives me (from scontrol show job): NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=42-3 Mem=15000 If I submit: #SBATCH -n6 #SBATCH --ntasks-per-socket=3 it gives me (from scontrol show job): NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000 I think this is caused by how the ntasks-per-socket code is selecting nodes in job_test.c of the cons_res-plugin. I will look into the code and see if I can fix this some how otherwise I can bug test patches. I have a small part of our cluster available for testing right now (2 nodes, 8 sockets/node, 6 cores/socket). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet diff --git a/src/plugins/select/cons_res/job_test.c b/src/plugins/select/cons_res/job_test.c index 60ec0b1..96b0dfa 100644 --- a/src/plugins/select/cons_res/job_test.c +++ b/src/plugins/select/cons_res/job_test.c @@ -310,7 +310,7 @@ uint16_t _allocate_sockets(struct job_record *job_ptr, bitstr_t *core_map, * allocating cores */ cps = num_tasks; - if (ntasks_per_socket > 1) { + if (ntasks_per_socket >= 1) { cps = ntasks_per_socket; if (cpus_per_task > 1) cps = ntasks_per_socket * cpus_per_task; smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Re: Bugs in CR_ALLOCATE_FULL_SOCKET.
Err... Wrong... On 2013-01-18 13:59, Magnus Jonsson wrote: Hi! I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird behaviour. Currently running git/master but have seen the same behaviour on 2.4.3 with the #define. My slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET This is my submitscript (the important parts): #SBATCH -n1 #SBATCH --ntasks-per-socket=1 This gives me (from scontrol show job): NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=42-43 Mem=5000 This is the correct output (but wrong :-) Copy'n'paste is hard some times... If I submit: #SBATCH -n6 #SBATCH --ntasks-per-socket=3 it gives me (from scontrol show job): NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000 I think this is caused by how the ntasks-per-socket code is selecting nodes in job_test.c of the cons_res-plugin. I will look into the code and see if I can fix this some how otherwise I can bug test patches. I have a small part of our cluster available for testing right now (2 nodes, 8 sockets/node, 6 cores/socket). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature
[slurm-dev] Bugs in CR_ALLOCATE_FULL_SOCKET.
Hi! I'm experimenting with CR_ALLOCATE_FULL_SOCKET and found some weird behaviour. Currently running git/master but have seen the same behaviour on 2.4.3 with the #define. My slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_Socket_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ALLOCATE_FULL_SOCKET This is my submitscript (the important parts): #SBATCH -n1 #SBATCH --ntasks-per-socket=1 This gives me (from scontrol show job): NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=42-47 Mem=15000 If I submit: #SBATCH -n6 #SBATCH --ntasks-per-socket=3 it gives me (from scontrol show job): NumNodes=1 NumCPUs=6 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=t-cn1033 CPU_IDs=36-38,42-44 Mem=15000 I think this is caused by how the ntasks-per-socket code is selecting nodes in job_test.c of the cons_res-plugin. I will look into the code and see if I can fix this some how otherwise I can bug test patches. I have a small part of our cluster available for testing right now (2 nodes, 8 sockets/node, 6 cores/socket). Best regards, Magnus -- Magnus Jonsson, Developer, HPC2N, Umeå Universitet smime.p7s Description: S/MIME Cryptographic Signature