[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-17 Thread kesim
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following
CPU load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD
info
node1 0.08  7/0/0/7
node20.01  7/0/0/7
node30.00  7/0/0/7
node42.97  7/0/0/7
node5   0.00  7/0/0/7
node6 0.01  7/0/0/7
node7  0.00  7/0/0/7
node8   0.05  7/0/0/7
node90.07  7/0/0/7
node100.38  7/0/0/7
node11 0.01  0/7/0/7
As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs
on idling node11. Why such simple thing is not a default? What am I
missing???

On Thu, Mar 16, 2017 at 7:53 PM, kesim  wrote:

> Than you for great suggestion. It is working! However the description of
> CR_LLN is misleading "Schedule resources to jobs on the least loaded nodes
> (based upon the number of idle CPUs)" Which I understood that if the two
> nodes has not fully allocated CPUs  the node with smaller number of
> allocated CPUs will take precedence. Therefore the bracketed comment should
> be removed from the description.
>
> On Thu, Mar 16, 2017 at 6:24 PM, Paul Edmon 
> wrote:
>
>> You should look at LLN (least loaded nodes):
>>
>> https://slurm.schedmd.com/slurm.conf.html
>>
>> That should do what you want.
>> -Paul Edmon-
>>
>> On 03/16/2017 12:54 PM, kesim wrote:
>>
>>
>> -- Forwarded message --
>> From: kesim 
>> Date: Thu, Mar 16, 2017 at 5:50 PM
>> Subject: Scheduling jobs according to the CPU load
>> To: slurm-dev@schedmd.com
>>
>>
>> Hi all,
>>
>> I am a new user and I created a small network of 11 nodes 7 CPUs per node
>> out of users desktops.
>> I configured slurm as:
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>> When I submit a task with srun -n70 task
>> It will fill 10 nodes with 7 tasks/node. However, I have no clue what is
>> the algorithm of choosing the nodes. Users run programs on the nodes and
>> some nodes are more busy than others. It seems logical that the scheduler
>> should submit the tasks to the less busy nodes but it is not the case.
>> In the sinfo -N -o '%N %O %C' I can see that the jobs are allocated to
>> the node11 with the load 2.06 leaving the node4 which is totally idling.
>> That somehow make no sense to me.
>> node1 0.00  7/0/0/7
>> node20.26  7/0/0/7
>> node3 0.54  7/0/0/7
>> node40.07  0/7/0/7
>> node5  0.00  7/0/0/7
>> node60.01  7/0/0/7
>> node7   0.00  7/0/0/7
>> node8   0.01  7/0/0/7
>> node90.06  7/0/0/7
>> node10  0.11  7/0/0/7
>> node11  2.06  7/0/0/7
>> How can I configure slurm to be able to fill the node with minimum load
>> first?
>>
>>
>>
>>
>


[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
The file is copied fine. It is just the message error annoying.



On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist 
wrote:

> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep -i jobacct
> >
> > *JobAcct*GatherFrequency  = 30
> >
> > *JobAcct*GatherType   = *jobacct*_gather/cgroup
> >
> > *JobAcct*GatherParams = NoShared
> >
> >
> >
> >
> >
> > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng  > > wrote:
> >
> > I think I tried that. let me try it again. Thank you!
> >
> > On Wed, Mar 15, 2017 at 11:43 AM, Chris Read  > > wrote:
> >
> >
> > We explicitly exclude shared usage from our measurement:
> >
> >
> > JobAcctGatherType=jobacct_gather/cgroup
> > JobAcctGatherParams=NoShare?
> >
> > Chris
> >
> >
> > 
> > From: Wensheng Deng mailto:w...@nyu.edu>>
> > Sent: 15 March 2017 10:28
> > To: slurm-dev
> > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
> >
> > It should be (sorry):
> > we 'cp'ed a 5GB file from scratch to node local disk
> >
> >
> > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng  >  > >> wrote:
> > Hello experts:
> >
> > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> > 5GB job from scratch to node local disk, declared 5 GB memory
> > for the job, and saw error message as below although the file
> > was copied okay:
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> > srun: error: [nodenameXXX]: task 0: Out Of Memory
> >
> > srun: Terminating job step 41.0
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> >
> > From the cgroup document
> > https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> > 
> > Features:
> > - accounting anonymous pages, file caches, swap caches usage and
> > limiting them.
> >
> > It seems that cgroup charges memory "RSS + file caches" to user
> > process like 'cp', in our case, charged to user's jobs. swap is
> > off in this case. The file cache can be small or very big, and
> > it should not be charged to users'  batch jobs in my opinion.
> > How do other sites circumvent this issue? The Slurm version is
> > 16.05.4.
> >
> > Thank you and Best Regards.
> >
> >
> >
> >
>
> Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf
> to some big number? That way the job memory limit will be the cgroup soft
> limit, and the cgroup hard limit which is when the kernel will OOM kill the
> job would be "job_memory_limit * AllowedRamSpace" that is, some large value?
>
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576 || janne.blomqv...@aalto.fi
>
>


[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Shenglong Wang
What kind of error information we will get if applications try to use more 
memory than declared as we did the test before?

Shenglong

> On Mar 17, 2017, at 9:41 AM, Wensheng Deng  wrote:
> 
> The file is copied fine. It is just the message error annoying. 
> 
> 
> 
> On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist  > wrote:
> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep -i jobacct
> >
> > *JobAcct*GatherFrequency  = 30
> >
> > *JobAcct*GatherType   = *jobacct*_gather/cgroup
> >
> > *JobAcct*GatherParams = NoShared
> >
> >
> >
> >
> >
> > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng  > 
> > >> wrote:
> >
> > I think I tried that. let me try it again. Thank you!
> >
> > On Wed, Mar 15, 2017 at 11:43 AM, Chris Read  > 
> > >> wrote:
> >
> >
> > We explicitly exclude shared usage from our measurement:
> >
> >
> > JobAcctGatherType=jobacct_gather/cgroup
> > JobAcctGatherParams=NoShare?
> >
> > Chris
> >
> >
> > 
> > From: Wensheng Deng mailto:w...@nyu.edu> 
> > >>
> > Sent: 15 March 2017 10:28
> > To: slurm-dev
> > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
> >
> > It should be (sorry):
> > we 'cp'ed a 5GB file from scratch to node local disk
> >
> >
> > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng  > 
> > > > 
> >  > Hello experts:
> >
> > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> > 5GB job from scratch to node local disk, declared 5 GB memory
> > for the job, and saw error message as below although the file
> > was copied okay:
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> > srun: error: [nodenameXXX]: task 0: Out Of Memory
> >
> > srun: Terminating job step 41.0
> >
> > slurmstepd: error: Exceeded job memory limit at some point.
> >
> >
> > From the cgroup document
> > https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt 
> > 
> >  > >
> > Features:
> > - accounting anonymous pages, file caches, swap caches usage and
> > limiting them.
> >
> > It seems that cgroup charges memory "RSS + file caches" to user
> > process like 'cp', in our case, charged to user's jobs. swap is
> > off in this case. The file cache can be small or very big, and
> > it should not be charged to users'  batch jobs in my opinion.
> > How do other sites circumvent this issue? The Slurm version is
> > 16.05.4.
> >
> > Thank you and Best Regards.
> >
> >
> >
> >
> 
> Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to 
> some big number? That way the job memory limit will be the cgroup soft limit, 
> and the cgroup hard limit which is when the kernel will OOM kill the job 
> would be "job_memory_limit * AllowedRamSpace" that is, some large value?
> 
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576  || janne.blomqv...@aalto.fi 
> 
> 
> 



[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Benjamin Redling

Re hi,

On 2017-03-17 03:01, Will Dennis wrote:
> My slurm.conf:
> https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw
> 
>> Are you sure the current running config is the one in the file?
>> Did you double check via "scontrol show config"
> 
> Yes, all params set in slurm.conf are showing correctly.

the sacctmgr output from your first mail ("ml-cluster") doesn't fit the
slurm.conf you provided ("test-cluster"). Can you clarify that?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] sreport inconsistency

2017-03-17 Thread Marcin Stolarek
I've observed that utlization and top users listing looks like inconsitent
for me.
Do I understand correctly thatt percent of used by users shoudl sum to
percent of allocated for cluster utilization?

cheers,
Marcin

# sreport cluster utilization Start=2017-03-01 -t percent

Cluster Utilization 2017-03-01T00:00:00 - 2017-03-16T23:59:59
Use reported in Percentage of Total

  Cluster Allocated Down PLND Dow Idle Reserved  Reported
- -     -
slurm_cl+44.16%   34.18%0.00%   20.87%0.80%   100.00%

# sreport user topusage Start=2017-03-01  -t percent

Top 10 Users 2017-03-01T00:00:00 - 2017-03-16T23:59:59 (1382400 secs)
Use reported in Percentage of Total

  Cluster Login Proper Name Account Used   Energy
- - --- ---  
slurm_cl+dXXX RXXroot   33.86%0.00%
slurm_cl+lXXX LXXroot0.44%0.00%
slurm_cl+sXXX  BXXlroot0.20%0.00%
slurm_cl+fXXX NXXroot0.06%0.00%
slurm_cl+qXXXSXXroot0.00%0.00%


[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Hi,

I believe you can get that message ('Exceeded job memory limit at some point') 
even if the job finishes fine.  When the cgroup is created (by SLURM) it 
updates memory.limit_in_bytes with the job memory request coded in the job.  
During the life of the job the kernel updates a number of files within the 
cgroup, one of which is memory.usage_in_bytes - which is the current memory of 
the cgroup.  Periodically, SLURM will check if the cgroup has exceeded its 
limit (i.e. memory.limit_in_bytes) - the frequency of the check is probably set 
by JobAcctGatherFrequency.  It does this by checking if memory.failcnt is 
greater than one.  The memory.failcnt is incremented by the kernel each time 
memory.usage_in_bytes reaches the value set in memory.limit_in_bytes.

This is the code snippet the produces the error (found in task_cgroup_memory.c) 
…
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
{
...
else if (failcnt_non_zero(&step_memory_cg,
  "memory.failcnt"))
/* reports the number of times that the
 * memory limit has reached the value set
 * in memory.limit_in_bytes.
 */
error("Exceeded step memory limit at some point.");
...
else if (failcnt_non_zero(&job_memory_cg,
  "memory.failcnt"))
error("Exceeded job memory limit at some point.");
...
}

Anyway, back to the point.  You can see this message and the job not fail 
because the operating system counter (memory.failcnt) that SLURM checks doesn't 
actually mean the memory limit has been exceeded but means the memory limit has 
been reached - a subtle but an important difference.  Important because OOM 
doesn't terminate jobs upon reaching the memory limit, only if they exceed the 
limit, it means the job isn't terminated.  Note: other cgroup files like 
memory.memsw.xxx are also in play if you are using swap space

As to how to manage this.  You can either not use cgroup and use an alternative 
plugin, you could also try the JobAcctGatherParams parameter NoOverMemoryKill 
(the documentation say use this with caution, see 
https://slurm.schedmd.com/slurm.conf.html), or you can try and account for the 
cache by using the jobacct_gather/cgroup.  Unfortunately, because of a bug this 
plugin does report cache usage either.  I've contributed a bug/fix to address 
this (https://bugs.schedmd.com/show_bug.cgi?id=3531).

---
Samuel Gallop
Computing infrastructure for Science
CiS Support & Development

From: Wensheng Deng [mailto:w...@nyu.edu]
Sent: 17 March 2017 13:42
To: slurm-dev 
Subject: [slurm-dev] Re: Slurm & CGROUP

The file is copied fine. It is just the message error annoying.



On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist 
mailto:janne.blomqv...@aalto.fi>> wrote:
On 2017-03-15 17:52, Wensheng Deng wrote:
> No, it does not help:
>
> $ scontrol show config |grep -i jobacct
>
> *JobAcct*GatherFrequency  = 30
>
> *JobAcct*GatherType   = *jobacct*_gather/cgroup
>
> *JobAcct*GatherParams = NoShared
>
>
>
>
>
> On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng 
> mailto:w...@nyu.edu>
> >> wrote:
>
> I think I tried that. let me try it again. Thank you!
>
> On Wed, Mar 15, 2017 at 11:43 AM, Chris Read 
> mailto:cr...@drw.com>
> >> wrote:
>
>
> We explicitly exclude shared usage from our measurement:
>
>
> JobAcctGatherType=jobacct_gather/cgroup
> JobAcctGatherParams=NoShare?
>
> Chris
>
>
> 
> From: Wensheng Deng mailto:w...@nyu.edu> 
> >>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to node local disk
>
>
> On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng 
> mailto:w...@nyu.edu>
> 
> >
>  Hello experts:
>
> We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> 5GB job from scratch to node local disk, declared 5 GB memory
> for the job, and saw error message as below although the file
> was copied okay:
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
> srun: error: [nodenameXXX]: task 0: Out Of Memory
>
> srun: Terminating job step 41.0
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
>
> From the cgroup document
> https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> 
> Features:
> - accounting anonymous pages, file caches, swap

[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Will Dennis
Yes - I anonymize certain details of what I throw up on paste sites... that's 
one of those :)

-Original Message-
From: Benjamin Redling [mailto:benjamin.ra...@uni-jena.de] 
Sent: Friday, March 17, 2017 9:55 AM
To: slurm-dev
Subject: [slurm-dev] RE: MaxJobs on association not being respected


Re hi,

On 2017-03-17 03:01, Will Dennis wrote:
> My slurm.conf:
> https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYh
> yRLivL9gydE=/raw
> 
>> Are you sure the current running config is the one in the file?
>> Did you double check via "scontrol show config"
> 
> Yes, all params set in slurm.conf are showing correctly.

the sacctmgr output from your first mail ("ml-cluster") doesn't fit the 
slurm.conf you provided ("test-cluster"). Can you clarify that?

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: Fwd: Dependency Problem In Full Queue

2017-03-17 Thread Benjamin Redling

Good examples:
https://hpc.nih.gov/docs/job_dependencies.html

BR

On 2017-03-15 17:37, Álvaro pc wrote:
> Hi again!
> 
> I would really like to know about the behaviour of --dependency argument..
> 
> Nobody know anything?
> 
> *Álvaro Ponce Cabrera.*
> 
> 
> 2017-03-14 12:31 GMT+01:00 Álvaro pc  >:
> 
> Hi,
> 
> I'm having problems trying to launch jobs with dependency of another
> one.
> 
> I'm using '--dependency=afterany:Job_ID' argument. 
> 
> The problem happens when the queue is full and the new job which
> depends on another one (already running) can't enter in the queue
> and need to wait.
> Instead of wait properly to enter in the queue, the job try to enter
> thousands of times per minute. 
> 
> All the tries seems to be waiting to enter  in the queue... Here you
> can see a piece of the queue where you can see the problem:
> 
>  20217   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20218   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20219   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20220   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20221   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20222   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20223   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20224   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>  20225   UPO Macs2_DM alvaropc PD   0:00  1
> (Dependency)
>   4907   UPO notebookpanos  R 64-01:48:56  1
> nodo01
>   6454   UPO valinomy jraviles  R 7-05:45:32  1
> nodo10
>   6492   UPO input_ra  rbueper  R 13-08:44:42  1
> nodo01
>   6493   UPO input_ra  rbueper  R 13-08:44:42  1
> nodo05
>   6823   UPO FELIX-No said  R 13-09:34:42  1
> nodo06
>   7219   UPO input_ra  rbueper  R 13-08:44:42  1
> nodo05
> 
> 
> 
> In addition I'm obtaining this error from the log/out file: 'sbatch:
> error: Slurm temporarily unable to accept job, sleeping and retrying'. 
> The error is repeated thousands of times too, obviously, one per
> each try of the job entering the queue...
> 
> I just want to launch ONE job  which waits untill another one ends... 
> 
> Any ideas?
> 
> Thank you so much.
> 
> 
> 
> *Álvaro Ponce Cabrera.*
> 
> 
> 

-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
For the case of the simple 'cp' test job which copying a 5 GB file, the
issue at the bottom is that how do we distinguish memories used: which is
from RSS, which is from file cache. cgroup reports them as one sum:
memory.memsw.* (we turn on swap off). The file cache can be small or very
big depending on what's required and what's available at the time point.
The file cache should not be charged to users' jobs in the batch job
context in my opinion. Thank you!



On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) 
wrote:

> Hi,
>
>
>
> I believe you can get that message ('Exceeded job memory limit at some
> point') even if the job finishes fine.  When the cgroup is created (by
> SLURM) it updates memory.limit_in_bytes with the job memory request coded
> in the job.  During the life of the job the kernel updates a number of
> files within the cgroup, one of which is memory.usage_in_bytes - which is
> the current memory of the cgroup.  Periodically, SLURM will check if the
> cgroup has exceeded its limit (i.e. memory.limit_in_bytes) - the frequency
> of the check is probably set by JobAcctGatherFrequency.  It does this by
> checking if memory.failcnt is greater than one.  The memory.failcnt is
> incremented by the kernel each time memory.usage_in_bytes reaches the value
> set in memory.limit_in_bytes.
>
>
>
> This is the code snippet the produces the error (found in
> task_cgroup_memory.c) …
>
> extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
>
> {
>
> ...
>
> else if (failcnt_non_zero(&step_memory_cg,
>
>   "memory.failcnt"))
>
> /* reports the number of times that the
>
>  * memory limit has reached the value set
>
>  * in memory.limit_in_bytes.
>
>  */
>
> error("Exceeded step memory limit at some point.");
>
> ...
>
> else if (failcnt_non_zero(&job_memory_cg,
>
>   "memory.failcnt"))
>
> error("Exceeded job memory limit at some point.");
>
> ...
>
> }
>
>
>
> Anyway, back to the point.  You can see this message and the job not fail
> because the operating system counter (memory.failcnt) that SLURM checks
> doesn't actually mean the memory limit has been exceeded but means the
> memory limit has been reached - a subtle but an important difference.
> Important because OOM doesn't terminate jobs upon reaching the memory
> limit, only if they exceed the limit, it means the job isn't terminated.
> Note: other cgroup files like memory.memsw.xxx are also in play if you are
> using swap space
>
>
>
> As to how to manage this.  You can either not use cgroup and use an
> alternative plugin, you could also try the JobAcctGatherParams parameter
> NoOverMemoryKill (the documentation say use this with caution, see
> https://slurm.schedmd.com/slurm.conf.html), or you can try and account
> for the cache by using the jobacct_gather/cgroup.  Unfortunately, because
> of a bug this plugin does report cache usage either.  I've contributed a
> bug/fix to address this (https://bugs.schedmd.com/show_bug.cgi?id=3531).
>
>
>
> *---*
>
> *Samuel Gallop*
>
> *Computing infrastructure for Science*
>
> *CiS Support & Development*
>
>
>
> *From:* Wensheng Deng [mailto:w...@nyu.edu]
> *Sent:* 17 March 2017 13:42
> *To:* slurm-dev 
> *Subject:* [slurm-dev] Re: Slurm & CGROUP
>
>
>
> The file is copied fine. It is just the message error annoying.
>
>
>
>
>
>
>
> On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist 
> wrote:
>
> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep -i jobacct
> >
> > *JobAcct*GatherFrequency  = 30
> >
> > *JobAcct*GatherType   = *jobacct*_gather/cgroup
> >
> > *JobAcct*GatherParams = NoShared
> >
> >
> >
> >
> >
> > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng  > > wrote:
> >
> > I think I tried that. let me try it again. Thank you!
> >
> > On Wed, Mar 15, 2017 at 11:43 AM, Chris Read  > > wrote:
> >
> >
> > We explicitly exclude shared usage from our measurement:
> >
> >
> > JobAcctGatherType=jobacct_gather/cgroup
> > JobAcctGatherParams=NoShare?
> >
> > Chris
> >
> >
> > 
> > From: Wensheng Deng mailto:w...@nyu.edu>>
> > Sent: 15 March 2017 10:28
> > To: slurm-dev
> > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
> >
> > It should be (sorry):
> > we 'cp'ed a 5GB file from scratch to node local disk
> >
> >
> > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng  >  > >> wrote:
> > Hello experts:
> >
> > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> > 5GB job from scratch to node local disk, declared 5 GB memory
> > for the job, and saw error message as below alt

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Yes the memory.usage_in_bytes is one sum, but in memory.stat the two figures 
are split …

# cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat | grep 
-Ew "^rss|^cache"
cache 16758034432
rss 663552

The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to address 
this by recording both.

You can argue either way about whether the cache should be charged to a users' 
jobs.  Based on your stance you may wish to try …
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
TaskPluginParam=Sched

I've not try this myself, and the documentation states proctrack/linuxproc … 
can fail to identify all processes associated with a job since processes can 
become a child of the init process (when the parent process terminates) or 
change their process group.

My personal take is if the user used it, it should be accounted for.

---
Sam Gallop

[Description: Macintosh HD:Users:fretter:Documents:SugarSync Shared 
Folders:NBI:pics for hpc wiki:CiSlogo578x293.png]

Have you tried looking through our Documentation 
Portal
Our documentation isn’t all text, check out our Video 
Tutorials
Have you tried looking through CiS Service Desk
Keep up to date with availability at CiS Service 
Status
More information on our HPC, Linux, Storage at HPC 
Support Site


To speak to us about technical issues feel free to call the Computing 
infrastructure for Science team on group phone extension 2003.
If your request is urgent, please contact the NBIP Computing Helpdesk at 
computing.helpd...@nbi.ac.uk or call phone 
extension 1234.

From: Wensheng Deng [mailto:w...@nyu.edu]
Sent: 17 March 2017 15:06
To: slurm-dev 
Subject: [slurm-dev] Re: Slurm & CGROUP

For the case of the simple 'cp' test job which copying a 5 GB file, the issue 
at the bottom is that how do we distinguish memories used: which is from RSS, 
which is from file cache. cgroup reports them as one sum: memory.memsw.* (we 
turn on swap off). The file cache can be small or very big depending on what's 
required and what's available at the time point. The file cache should not be 
charged to users' jobs in the batch job context in my opinion. Thank you!



On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) 
mailto:sam.gal...@nbi.ac.uk>> wrote:
Hi,

I believe you can get that message ('Exceeded job memory limit at some point') 
even if the job finishes fine.  When the cgroup is created (by SLURM) it 
updates memory.limit_in_bytes with the job memory request coded in the job.  
During the life of the job the kernel updates a number of files within the 
cgroup, one of which is memory.usage_in_bytes - which is the current memory of 
the cgroup.  Periodically, SLURM will check if the cgroup has exceeded its 
limit (i.e. memory.limit_in_bytes) - the frequency of the check is probably set 
by JobAcctGatherFrequency.  It does this by checking if memory.failcnt is 
greater than one.  The memory.failcnt is incremented by the kernel each time 
memory.usage_in_bytes reaches the value set in memory.limit_in_bytes.

This is the code snippet the produces the error (found in task_cgroup_memory.c) 
…
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
{
...
else if (failcnt_non_zero(&step_memory_cg,
  "memory.failcnt"))
/* reports the number of times that the
 * memory limit has reached the value set
 * in memory.limit_in_bytes.
 */
error("Exceeded step memory limit at some point.");
...
else if (failcnt_non_zero(&job_memory_cg,
  "memory.failcnt"))
error("Exceeded job memory limit at some point.");
...
}

Anyway, back to the point.  You can see this message and the job not fail 
because the operating system counter (memory.failcnt) that SLURM checks doesn't 
actually mean the memory limit has been exceeded but means the memory limit has 
been reached - a subtle but an important difference.  Important because OOM 
doesn't terminate jobs upon reaching the memory limit, only if they exceed the 
limit, it means the job isn't terminated.  Note: other cgroup files like 
memory.memsw.xxx are also in play if you are using swap space

As to how to manage this.  You can either not use cgroup and use an alternative 
plugin, you could also try the JobAcctGatherParams parameter NoOverMemoryKill 
(the documentation say use this with caution, see 
https://slurm.schedmd.com/slurm.conf.html), or you can try and account for the 
cache by using the jobacct_gather/cgroup.  Unfortunately, because of a bug this 
plugin does report cache usage either.  I've contributed a bug/fix to address 
this (https://bugs.schedmd.com/show_bug.cgi?id=3531).

--

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
Thank you. I had some doubt about the accuracy of memory.stat. Sam, what
slurm conf parameters do you recommend to try your fix in bug #3531? There
are three places where cgroup plugin could be used:

JobAcctGatherType   = jobacct_gather/*cgroup*

ProctrackType   = proctrack/*cgroup*

TaskPlugin  = task/*cgroup*



On Fri, Mar 17, 2017 at 11:30 AM, Sam Gallop (NBI) 
wrote:

> Yes the memory.usage_in_bytes is one sum, but in memory.stat the two
> figures are split …
>
>
>
> # cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat |
> grep -Ew "^rss|^cache"
>
> cache 16758034432
>
> rss 663552
>
>
>
> The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to
> address this by recording both.
>
>
>
> You can argue either way about whether the cache should be charged to a
> users' jobs.  Based on your stance you may wish to try …
>
> ProctrackType=proctrack/linuxproc
>
> TaskPlugin=task/affinity
>
> TaskPluginParam=Sched
>
>
>
> I've not try this myself, and the documentation states proctrack/linuxproc
> … can fail to identify all processes associated with a job since processes
> can become a child of the init process (when the parent process terminates)
> or change their process group.
>
>
>
> My personal take is if the user used it, it should be accounted for.
>
>
>
> ---
>
> Sam Gallop
>
>
>
> [image: Description: Macintosh HD:Users:fretter:Documents:SugarSync Shared
> Folders:NBI:pics for hpc wiki:CiSlogo578x293.png]
>
> Have you tried looking through our *Documentation Portal*
> 
>
> Our documentation isn’t all text, check out our *Video Tutorials*
> 
>
> Have you tried looking through *CiS Service Desk*
> 
>
> Keep up to date with availability at *CiS Service Status*
> 
>
> More information on our HPC, Linux, Storage at *HPC Support*
> * Site*
>
>
>
> To speak to us about technical issues feel free to call the *Computing
> infrastructure for Science* team on *group phone extension * *2003**.*
>
> If your request is urgent, please contact the *NBIP Computing Helpdesk*
> at computing.helpd...@nbi.ac.uk or call *phone extension **1234**.*
>
>
>
> *From:* Wensheng Deng [mailto:w...@nyu.edu]
> *Sent:* 17 March 2017 15:06
>
> *To:* slurm-dev 
> *Subject:* [slurm-dev] Re: Slurm & CGROUP
>
>
>
> For the case of the simple 'cp' test job which copying a 5 GB file, the
> issue at the bottom is that how do we distinguish memories used: which is
> from RSS, which is from file cache. cgroup reports them as one sum:
> memory.memsw.* (we turn on swap off). The file cache can be small or very
> big depending on what's required and what's available at the time point.
> The file cache should not be charged to users' jobs in the batch job
> context in my opinion. Thank you!
>
>
>
>
>
>
>
> On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) 
> wrote:
>
> Hi,
>
>
>
> I believe you can get that message ('Exceeded job memory limit at some
> point') even if the job finishes fine.  When the cgroup is created (by
> SLURM) it updates memory.limit_in_bytes with the job memory request coded
> in the job.  During the life of the job the kernel updates a number of
> files within the cgroup, one of which is memory.usage_in_bytes - which is
> the current memory of the cgroup.  Periodically, SLURM will check if the
> cgroup has exceeded its limit (i.e. memory.limit_in_bytes) - the frequency
> of the check is probably set by JobAcctGatherFrequency.  It does this by
> checking if memory.failcnt is greater than one.  The memory.failcnt is
> incremented by the kernel each time memory.usage_in_bytes reaches the value
> set in memory.limit_in_bytes.
>
>
>
> This is the code snippet the produces the error (found in
> task_cgroup_memory.c) …
>
> extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
>
> {
>
> ...
>
> else if (failcnt_non_zero(&step_memory_cg,
>
>   "memory.failcnt"))
>
> /* reports the number of times that the
>
>  * memory limit has reached the value set
>
>  * in memory.limit_in_bytes.
>
>  */
>
> error("Exceeded step memory limit at some point.");
>
> ...
>
> else if (failcnt_non_zero(&job_memory_cg,
>
>   "memory.failcnt"))
>
> error("Exceeded job memory limit at some point.");
>
> ...
>
> }
>
>
>
> Anyway, back to the point.  You can see this message and the job not fail
> because the operating system counter (memory.failcnt) that SLURM checks
> doesn't actually mean the memory limit has been exceeded but means the
> memory limit has been reached - a subtle but an important difference.
> Important because OOM doesn't terminate jobs upon reaching the memory
> limit, only if they excee

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Sam Gallop (NBI)
Yep, that's it.  While the fix is specific to the 
JobAcctGatherType=jobacct_gather/cgroup plugin, you would need to be using 
ProctrackType=proctrack/cgroup &
TaskPlugin=task/cgroup for SLURM to be using cgroups.

---
Samuel Gallop
Computing infrastructure for Science
CiS Support & Development

+44 (0)1603 450818
sam.gal...@nbi.ac.uk

NBI Partnership Ltd.
Norwich Research Park
Colney Lane, Norwich
NR4 7UH

The NBI Partnership Ltd provides non-scientific services to the Earlham 
Institute, the Institute of Food Research, the John Innes Centre and The 
Sainsbury Laboratory

From: Wensheng Deng [mailto:w...@nyu.edu]
Sent: 17 March 2017 15:39
To: slurm-dev 
Subject: [slurm-dev] Re: Slurm & CGROUP

Thank you. I had some doubt about the accuracy of memory.stat. Sam, what slurm 
conf parameters do you recommend to try your fix in bug #3531? There are three 
places where cgroup plugin could be used:

JobAcctGatherType   = jobacct_gather/cgroup

ProctrackType   = proctrack/cgroup

TaskPlugin  = task/cgroup



On Fri, Mar 17, 2017 at 11:30 AM, Sam Gallop (NBI) 
mailto:sam.gal...@nbi.ac.uk>> wrote:
Yes the memory.usage_in_bytes is one sum, but in memory.stat the two figures 
are split …

# cat /sys/fs/cgroup/memory/slurm/uid_11253/job_183/step_0/memory.stat | grep 
-Ew "^rss|^cache"
cache 16758034432
rss 663552

The fix (https://bugs.schedmd.com/show_bug.cgi?id=3531) attempts to address 
this by recording both.

You can argue either way about whether the cache should be charged to a users' 
jobs.  Based on your stance you may wish to try …
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
TaskPluginParam=Sched

I've not try this myself, and the documentation states proctrack/linuxproc … 
can fail to identify all processes associated with a job since processes can 
become a child of the init process (when the parent process terminates) or 
change their process group.

My personal take is if the user used it, it should be accounted for.

---
Sam Gallop

[Description: Macintosh HD:Users:fretter:Documents:SugarSync Shared 
Folders:NBI:pics for hpc wiki:CiSlogo578x293.png]

Have you tried looking through our Documentation 
Portal
Our documentation isn’t all text, check out our Video 
Tutorials
Have you tried looking through CiS Service Desk
Keep up to date with availability at CiS Service 
Status
More information on our HPC, Linux, Storage at HPC 
Support Site


To speak to us about technical issues feel free to call the Computing 
infrastructure for Science team on group phone extension 2003.
If your request is urgent, please contact the NBIP Computing Helpdesk at 
computing.helpd...@nbi.ac.uk or call phone 
extension 1234.

From: Wensheng Deng [mailto:w...@nyu.edu]
Sent: 17 March 2017 15:06

To: slurm-dev mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Re: Slurm & CGROUP

For the case of the simple 'cp' test job which copying a 5 GB file, the issue 
at the bottom is that how do we distinguish memories used: which is from RSS, 
which is from file cache. cgroup reports them as one sum: memory.memsw.* (we 
turn on swap off). The file cache can be small or very big depending on what's 
required and what's available at the time point. The file cache should not be 
charged to users' jobs in the batch job context in my opinion. Thank you!



On Fri, Mar 17, 2017 at 10:47 AM, Sam Gallop (NBI) 
mailto:sam.gal...@nbi.ac.uk>> wrote:
Hi,

I believe you can get that message ('Exceeded job memory limit at some point') 
even if the job finishes fine.  When the cgroup is created (by SLURM) it 
updates memory.limit_in_bytes with the job memory request coded in the job.  
During the life of the job the kernel updates a number of files within the 
cgroup, one of which is memory.usage_in_bytes - which is the current memory of 
the cgroup.  Periodically, SLURM will check if the cgroup has exceeded its 
limit (i.e. memory.limit_in_bytes) - the frequency of the check is probably set 
by JobAcctGatherFrequency.  It does this by checking if memory.failcnt is 
greater than one.  The memory.failcnt is incremented by the kernel each time 
memory.usage_in_bytes reaches the value set in memory.limit_in_bytes.

This is the code snippet the produces the error (found in task_cgroup_memory.c) 
…
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
{
...
else if (failcnt_non_zero(&step_memory_cg,
  "memory.failcnt"))
/* reports the number of times that the
 * memory limit has reached the value set
 * in memory.limit_in_bytes.
 */
error("Exceeded step memory limit at some point.");
...
 

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Ryan Cox
usage_in_bytes is not actually usage in bytes, by the way.  It's often 
close but I have seen wildly different values.  See 
https://lkml.org/lkml/2011/3/28/93 and 
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section 
5.5.  memory.stat is what you want for accurate data.


I wrote the code you referenced below.  Now that I know more about 
failcnt, it does have some corner cases that aren't ideal.  If I were to 
start over I would use cgroup.event_control to get OOM events, such as 
in 
https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oom_notifierd.c 
or https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section 
9.  At the time I didn't really feel like learning how to add and clean 
up a thread or something that that would listen for those events.


If someone wants to do the work that would be great :). I have no plans 
to do so myself for the time being.


Ryan

On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:

Re: [slurm-dev] Re: Slurm & CGROUP

Hi,

I believe you can get that message ('Exceeded job memory limit at some 
point') even if the job finishes fine.  When the cgroup is created (by 
SLURM) it updates memory.limit_in_bytes with the job memory request 
coded in the job.  During the life of the job the kernel updates a 
number of files within the cgroup, one of which is 
memory.usage_in_bytes - which is the current memory of the cgroup.  
Periodically, SLURM will check if the cgroup has exceeded its limit 
(i.e. memory.limit_in_bytes) - the frequency of the check is probably 
set by JobAcctGatherFrequency.  It does this by checking if 
memory.failcnt is greater than one.  The memory.failcnt is incremented 
by the kernel each time memory.usage_in_bytes reaches the value set in 
memory.limit_in_bytes.


This is the code snippet the produces the error (found in 
task_cgroup_memory.c) …


extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)

{

...

else if (failcnt_non_zero(&step_memory_cg,

"memory.failcnt"))

/* reports the number of times that the

* memory limit has reached the value set

* in memory.limit_in_bytes.

*/

error("Exceeded step memory limit at some point.");

...

else if (failcnt_non_zero(&job_memory_cg,

"memory.failcnt"))

error("Exceeded job memory limit at some point.");

...

}

Anyway, back to the point.  You can see this message and the job not 
fail because the operating system counter (memory.failcnt) that SLURM 
checks doesn't actually mean the memory limit has been exceeded but 
means the memory limit has been reached - a subtle but an important 
difference.  Important because OOM doesn't terminate jobs upon 
reaching the memory limit, only if they exceed the limit, it means the 
job isn't terminated.  Note: other cgroup files like memory.memsw.xxx 
are also in play if you are using swap space


As to how to manage this.  You can either not use cgroup and use an 
alternative plugin, you could also try the JobAcctGatherParams 
parameter NoOverMemoryKill (the documentation say use this with 
caution, see https://slurm.schedmd.com/slurm.conf.html), or you can 
try and account for the cache by using the jobacct_gather/cgroup.  
Unfortunately, because of a bug this plugin does report cache usage 
either.  I've contributed a bug/fix to address this 
(https://bugs.schedmd.com/show_bug.cgi?id=3531).


**

*---*

*Samuel Gallop*

/Computing infrastructure for Science/

*CiS Support & Development***

*From:*Wensheng Deng [mailto:w...@nyu.edu]
*Sent:* 17 March 2017 13:42
*To:* slurm-dev 
*Subject:* [slurm-dev] Re: Slurm & CGROUP

The file is copied fine. It is just the message error annoying.

On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist 
mailto:janne.blomqv...@aalto.fi>> wrote:


On 2017-03-15 17:52, Wensheng Deng wrote:
> No, it does not help:
>
> $ scontrol show config |grep -i jobacct
>
> *JobAcct*GatherFrequency  = 30
>
> *JobAcct*GatherType   = *jobacct*_gather/cgroup
>
> *JobAcct*GatherParams = NoShared
>
>
>
>
>
> On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng mailto:w...@nyu.edu>
> >> wrote:
>
> I think I tried that. let me try it again. Thank you!
>
> On Wed, Mar 15, 2017 at 11:43 AM, Chris Read mailto:cr...@drw.com>
> >> wrote:
>
>
> We explicitly exclude shared usage from our measurement:
>
>
> JobAcctGatherType=jobacct_gather/cgroup
> JobAcctGatherParams=NoShare?
>
> Chris
>
>
> 
> From: Wensheng Deng mailto:w...@nyu.edu>
>>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Nicholas McCollum
+1 : I tried getting oom_notifierd working in CentOS7 but was
unsuccessful.  I'd be greatly interested if anyone has gotten this to
work.  I've ported some of the other BYU cgroup fencing tools over to
CentOS 7 and added minor functionality improvements if anyone is
interested.

Thank you to Ryan Cox for these excellent tools.


-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Fri, 2017-03-17 at 08:59 -0700, Ryan Cox wrote:
> usage_in_bytes is not actually usage in bytes, by the way.  It's
> often close but I have seen wildly different values.  See https://lkm
> l.org/lkml/2011/3/28/93 and
> https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
> 5.5.  memory.stat is what you want for accurate data.
> 
> I wrote the code you referenced below.  Now that I know more about
> failcnt, it does have some corner cases that aren't ideal.  If I were
> to start over I would use cgroup.event_control to get OOM events,
> such as in https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oo
> m_notifierd.c or https://www.kernel.org/doc/Documentation/cgroup-
> v1/memory.txt section 9.  At the time I didn't really feel like
> learning how to add and clean up a thread or something that that
> would listen for those events.
> 
> If someone wants to do the work that would be great :). I have no
> plans to do so myself for the time being.
> 
> Ryan
> 
> On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:
> > Hi,
> >  
> > I believe you can get that message ('Exceeded job memory limit at
> > some point') even if the job finishes fine.  When the cgroup is
> > created (by SLURM) it updates memory.limit_in_bytes with the job
> > memory request coded in the job.  During the life of the job the
> > kernel updates a number of files within the cgroup, one of which is
> > memory.usage_in_bytes - which is the current memory of the cgroup. 
> > Periodically, SLURM will check if the cgroup has exceeded its limit
> > (i.e. memory.limit_in_bytes) - the frequency of the check is
> > probably set by JobAcctGatherFrequency.  It does this by checking
> > if memory.failcnt is greater than one.  The memory.failcnt is
> > incremented by the kernel each time memory.usage_in_bytes reaches
> > the value set in memory.limit_in_bytes.
> >  
> > This is the code snippet the produces the error (found in
> > task_cgroup_memory.c) …
> > extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
> > {
> > ...
> >     else if (failcnt_non_zero(&step_memory_cg,
> >   "memory.failcnt"))
> >     /* reports the number of times that the
> >  * memory limit has reached the value set
> >  * in memory.limit_in_bytes.
> >  */
> >     error("Exceeded step memory limit at some point.");
> > ...
> >     else if (failcnt_non_zero(&job_memory_cg,
> >   "memory.failcnt"))
> >     error("Exceeded job memory limit at some point.");
> > ...
> > }
> >  
> > Anyway, back to the point.  You can see this message and the job
> > not fail because the operating system counter (memory.failcnt) that
> > SLURM checks doesn't actually mean the memory limit has been
> > exceeded but means the memory limit has been reached - a subtle but
> > an important difference.  Important because OOM doesn't terminate
> > jobs upon reaching the memory limit, only if they exceed the limit,
> > it means the job isn't terminated.  Note: other cgroup files like
> > memory.memsw.xxx are also in play if you are using swap space
> >  
> > As to how to manage this.  You can either not use cgroup and use an
> > alternative plugin, you could also try the JobAcctGatherParams
> > parameter NoOverMemoryKill (the documentation say use this with
> > caution, see https://slurm.schedmd.com/slurm.conf.html), or you can
> > try and account for the cache by using the jobacct_gather/cgroup. 
> > Unfortunately, because of a bug this plugin does report cache usage
> > either.  I've contributed a bug/fix to address this (https://bugs.s
> > chedmd.com/show_bug.cgi?id=3531).
> >  
> > ---
> > Samuel Gallop
> > Computing infrastructure for Science
> > CiS Support & Development
> >  
> > From: Wensheng Deng [mailto:w...@nyu.edu] 
> > Sent: 17 March 2017 13:42
> > To: slurm-dev 
> > Subject: [slurm-dev] Re: Slurm & CGROUP
> >  
> > The file is copied fine. It is just the message error annoying. 
> >  
> >  
> >  
> > On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist  > alto.fi> wrote:
> > On 2017-03-15 17:52, Wensheng Deng wrote:
> > > No, it does not help:
> > >
> > > $ scontrol show config |grep -i jobacct
> > >
> > > *JobAcct*GatherFrequency  = 30
> > >
> > > *JobAcct*GatherType       = *jobacct*_gather/cgroup
> > >
> > > *JobAcct*GatherParams     = NoShared
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng  > > > wrote:
> > >
> > >     I think I tried that. let me try it aga

[slurm-dev] Exclusive socket configuration help

2017-03-17 Thread Cyrus Proctor

Hello,

I currently have a small cluster for testing. Each compute node contains 
2 sockets with 14 cores per CPU and a total of 128 GB RAM. I would like 
to set up Slurm such that two jobs can simultaneously share one compute 
node, effectively giving 1 socket (with binding) and half the total 
memory to each job.


I've tried several iterations of settings, to no avail. It seems that 
whatever I try, I am still only allowed to run one job per node (blocked 
by "resources" reason). I am running Slurm 17.02.1-2, and I am attaching 
my slurm.conf as well as cgroup.conf files. System information includes:

# uname -r
3.10.0-514.10.2.el7.x86_64
# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

I am also attaching logs for slurmd (slurmd.d01.log) and slurmctld 
(slurmctld.log) as I submit three jobs (batch.slurm) in rapid 
succession. With two compute nodes available, I would hope that all 
three start together. Instead, two begin and one waits until a node 
becomes idle to start.


There is likely extra "crud" in the config files simply from prior 
failed attempts. I'm happy to take out / reconfigure as necessary but 
not sure what exactly is the right combination of settings to get this 
to work. I'm hoping that's where you all can help.


Thanks,
Cyrus
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=50
MaxRAMPercent=50
TaskAffinity=yes
ClusterName=linux
ControlMachine=cowboy
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/cgroup
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
TaskPlugin=task/cgroup,task/affinity
TaskPluginParam=autobind=sockets,Verbose
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
SelectTypeParameters=CR_Socket_Memory
FastSchedule=0
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=10
#PriorityWeightAge=1000
#PriorityWeightPartition=1
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=9
#SlurmctldLogFile=
SlurmdDebug=9
#SlurmdLogFile=
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30
##AcctGatherEnergyType=acct_gather_energy/ipmi
##AcctGatherNodeFreq=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=cowboy
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=password
AccountingStorageUser=slurm
#
# COMPUTE NODES
#PropagateResourceLimitsExcept=MEMLOCK
SlurmdLogFile=/var/log/slurm.log
SlurmctldLogFile=/var/log/slurmctld.log
Epilog=/etc/slurm/slurm.epilog.clean
NodeName=d0[1,2] Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 State=UNKNOWN
PartitionName=normal  Nodes=d0[1,2] Default=YES OverSubscribe=FORCE:2 SelectTypeParameters=CR_Socket_Memory QoS=part_shared MaxCPUsPerNode=28 DefMemPerNode=128530 MaxMemPerNode=128530 MaxTime=48:00:00 State=UP
ReturnToService=1
[root@cowboy ~]# tail -f /var/log/slurmctld.log 
[2017-03-17T11:20:08.500] debug:  sched: Running job scheduler
[2017-03-17T11:20:08.571] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2017-03-17T11:20:08.573] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2017-03-17T11:20:08.573] debug2: _slurm_rpc_dump_partitions, size=189 usec=69
[2017-03-17T11:20:10.582] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2017-03-17T11:20:10.584] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2017-03-17T11:20:10.584] debug2: _slurm_rpc_dump_partitions, size=189 usec=59
[2017-03-17T11:20:12.593] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2017-03-17T11:20:12.594] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2017-03-17T11:20:12.595] debug2: _slurm_rpc_dump_partitions, size=189 usec=59
[2017-03-17T11:20:14.603] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2017-03-17T11:20:14.605] debug2: Processing RPC: REQUEST_PARTITION_INFO uid=0
[2017-03-17T11:20:14.605] debug2: _slurm_rpc_dump_partitions, size=189 usec=62
[2017-03-17T11:20:16.614] debug3: Processing RPC: REQUEST_JOB_INFO from uid=0
[2017-03-17T11:20:16.616] debug2: Processing 

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
Thank you for the descriptions, community! When tasks in step_extern and
tasks in step_batch are active at the same time, how is the memory
accounting and summary done? when memory is over the limit, which one to be
killed?


On Fri, Mar 17, 2017 at 12:18 PM, Nicholas McCollum 
wrote:

> +1 : I tried getting oom_notifierd working in CentOS7 but was
> unsuccessful.  I'd be greatly interested if anyone has gotten this to
> work.  I've ported some of the other BYU cgroup fencing tools over to
> CentOS 7 and added minor functionality improvements if anyone is
> interested.
>
> Thank you to Ryan Cox for these excellent tools.
>
>
> --
> Nicholas McCollum
> HPC Systems Administrator
> Alabama Supercomputer Authority
>
> On Fri, 2017-03-17 at 08:59 -0700, Ryan Cox wrote:
> > usage_in_bytes is not actually usage in bytes, by the way.  It's
> > often close but I have seen wildly different values.  See https://lkm
> > l.org/lkml/2011/3/28/93 and
> > https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
> > 5.5.  memory.stat is what you want for accurate data.
> >
> > I wrote the code you referenced below.  Now that I know more about
> > failcnt, it does have some corner cases that aren't ideal.  If I were
> > to start over I would use cgroup.event_control to get OOM events,
> > such as in https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oo
> > m_notifierd.c or https://www.kernel.org/doc/Documentation/cgroup-
> > v1/memory.txt section 9.  At the time I didn't really feel like
> > learning how to add and clean up a thread or something that that
> > would listen for those events.
> >
> > If someone wants to do the work that would be great :). I have no
> > plans to do so myself for the time being.
> >
> > Ryan
> >
> > On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:
> > > Hi,
> > >
> > > I believe you can get that message ('Exceeded job memory limit at
> > > some point') even if the job finishes fine.  When the cgroup is
> > > created (by SLURM) it updates memory.limit_in_bytes with the job
> > > memory request coded in the job.  During the life of the job the
> > > kernel updates a number of files within the cgroup, one of which is
> > > memory.usage_in_bytes - which is the current memory of the cgroup.
> > > Periodically, SLURM will check if the cgroup has exceeded its limit
> > > (i.e. memory.limit_in_bytes) - the frequency of the check is
> > > probably set by JobAcctGatherFrequency.  It does this by checking
> > > if memory.failcnt is greater than one.  The memory.failcnt is
> > > incremented by the kernel each time memory.usage_in_bytes reaches
> > > the value set in memory.limit_in_bytes.
> > >
> > > This is the code snippet the produces the error (found in
> > > task_cgroup_memory.c) …
> > > extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
> > > {
> > > ...
> > > else if (failcnt_non_zero(&step_memory_cg,
> > >   "memory.failcnt"))
> > > /* reports the number of times that the
> > >  * memory limit has reached the value set
> > >  * in memory.limit_in_bytes.
> > >  */
> > > error("Exceeded step memory limit at some point.");
> > > ...
> > > else if (failcnt_non_zero(&job_memory_cg,
> > >   "memory.failcnt"))
> > > error("Exceeded job memory limit at some point.");
> > > ...
> > > }
> > >
> > > Anyway, back to the point.  You can see this message and the job
> > > not fail because the operating system counter (memory.failcnt) that
> > > SLURM checks doesn't actually mean the memory limit has been
> > > exceeded but means the memory limit has been reached - a subtle but
> > > an important difference.  Important because OOM doesn't terminate
> > > jobs upon reaching the memory limit, only if they exceed the limit,
> > > it means the job isn't terminated.  Note: other cgroup files like
> > > memory.memsw.xxx are also in play if you are using swap space
> > >
> > > As to how to manage this.  You can either not use cgroup and use an
> > > alternative plugin, you could also try the JobAcctGatherParams
> > > parameter NoOverMemoryKill (the documentation say use this with
> > > caution, see https://slurm.schedmd.com/slurm.conf.html), or you can
> > > try and account for the cache by using the jobacct_gather/cgroup.
> > > Unfortunately, because of a bug this plugin does report cache usage
> > > either.  I've contributed a bug/fix to address this (https://bugs.s
> > > chedmd.com/show_bug.cgi?id=3531).
> > >
> > > ---
> > > Samuel Gallop
> > > Computing infrastructure for Science
> > > CiS Support & Development
> > >
> > > From: Wensheng Deng [mailto:w...@nyu.edu]
> > > Sent: 17 March 2017 13:42
> > > To: slurm-dev 
> > > Subject: [slurm-dev] Re: Slurm & CGROUP
> > >
> > > The file is copied fine. It is just the message error annoying.
> > >
> > >
> > >
> > > On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist  > > alto.fi> w