Re: [slurm-users] [ext] Re: Cleanup of job_container/tmpfs

2023-03-07 Thread Niels Carl W. Hansen

That was exactly the bit I was missing. Thank you very much, Magnus!

Best
Niels Carl



On 3/7/23 3:13 PM, Hagdorn, Magnus Karl Moritz wrote:

I just upgrade slurm to 23.02 on our test cluster to try out the new
job_container/tmpfs stuff. I can confirm it works with autofs (hurrah!)
but you need to set the Shared=true option in the job_container.conf
file.
Cheers
magnus

On Tue, 2023-03-07 at 09:19 +0100, Ole Holm Nielsen wrote:

Hi Brian,

Presumably the users' home directory is NFS automounted using autofs,
and
therefore it doesn't exist when the job starts.

The job_container/tmpfs plugin ought to work correctly with autofs,
but
maybe this is still broken in 23.02?

/Ole


On 3/6/23 21:06, Brian Andrus wrote:

That looks like the users' home directory doesn't exist on the
node.

If you are not using a shared home for the nodes, your onboarding
process
should be looked at to ensure it can handle any issues that may
arise.

If you are using a shared home, you should do the above and have
the node
ensure the shared filesystems are mounted before allowing jobs.

-Brian Andrus

On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:

Hi all

Seems there still are some issues with the autofs -
job_container/tmpfs
functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated
node(s)
before jobstart, we get:

slurmstepd: error: couldn't chdir to `/users/lutest': No such
file or
directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such
file or
directory: going to /tmp instead

An easy workaround however, is to include this line in the slurm
prolog
on the slurmd -nodes:

/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true

-but there might exist a better way to solve the problem?





Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-06 Thread Niels Carl W. Hansen

Hi all

Seems there still are some issues with the autofs - job_container/tmpfs 
functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated node(s) 
before jobstart, we get:


slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead


An easy workaround however, is to include this line in the slurm prolog 
on the slurmd -nodes:


/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true

-but there might exist a better way to solve the problem?

Best
Niels Carl





On 3/2/23 12:27 AM, Jason Ellul wrote:


Thanks so much Ole for the info and link,

Your documentation is extremely useful.

Prior to moving to 22.05 we had been using slurm-spank-private-tmpdir 
with an epilog to clean-up the folders on job completion, but we were 
hoping to move to the inbuilt functionality to ensure future 
compatibility and reduce complexity.


Will try 23.02 and if that does not resolve our issue consider moving 
back to slurm-spank-private-tmpdir or auto_tmpdir.


Thanks again,

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Center

*From: *slurm-users  on behalf 
of Ole Holm Nielsen 

*Date: *Wednesday, 1 March 2023 at 8:29 pm
*To: *slurm-users@lists.schedmd.com 
*Subject: *Re: [slurm-users] Cleanup of job_container/tmpfs

! EXTERNAL EMAIL: Think before you click. If suspicious send to 
cyberrep...@petermac.org


Hi Jason,

IMHO, the job_container/tmpfs is not working well in Slurm 22.05, but
there may be some significant improvements included in 23.02 (announced
yesterday).  I've documented our experiences in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories 


This page contains links to bug reports against the job_container/tmpfs
plugin.

We're using the auto_tmpdir SPANK plugin with great success in Slurm 
22.05.


Best regards,
Ole


On 01-03-2023 03:27, Jason Ellul wrote:
> We have recently moved to slurm 22.05.8 and have configured
> job_container/tmpfs to allow private tmp folders.
>
> job_container.conf contains:
>
> AutoBasePath=true
>
> BasePath=/slurm
>
> And in slurm.conf we have set
>
> JobContainerType=job_container/tmpfs
>
> I can see the folders being created and they are being used but when a
> job completes the root folder is not being cleaned up.
>
> Example of running job:
>
> [root@papr-res-compute204 ~]# ls -al /slurm/14292874
>
> total 32
>
> drwx--   3 root  root    34 Mar  1 13:16 .
>
> drwxr-xr-x 518 root  root 16384 Mar  1 13:16 ..
>
> drwx--   2 mzethoven root 6 Mar  1 13:16 .14292874
>
> -r--r--r--   1 root  root 0 Mar  1 13:16 .ns
>
> Example once job completes /slurm/ remains:
>
> [root@papr-res-compute204 ~]# ls -al /slurm/14292794
>
> total 32
>
> drwx--   2 root root 6 Mar  1 09:33 .
>
> drwxr-xr-x 518 root root 16384 Mar  1 13:16 ..
>
> Is this to be expected or should the folder /slurm/ also be 
removed?

>
> Do I need to create an epilog script to remove the directory that is 
left?



*Disclaimer: *This email (including any attachments or links) may 
contain confidential and/or legally privileged information and is 
intended only to be read or used by the addressee. If you are not the 
intended addressee, any use, distribution, disclosure or copying of 
this email is strictly prohibited. Confidentiality and legal privilege 
attached to this email (including any attachments) are not waived or 
lost by reason of its mistaken delivery to you. If you have received 
this email in error, please delete it and notify us immediately by 
telephone or email. Peter MacCallum Cancer Centre provides no 
guarantee that this transmission is free of virus or that it has not 
been intercepted or altered and will not be liable for any delay in 
its receipt.






Re: [slurm-users] slurm continously log _remove_accrue_time_internal and something underflow error

2022-06-16 Thread Niels Carl W. Hansen
I had that too. It disappeared when I deleted "ACCRUE_ALWAYS" from 
PriorityFlags.


/Niels Carl


On 6/16/22 1:51 PM, taleinterve...@sjtu.edu.cn wrote:


Hi all:

We found out slurmctld keep log error message as

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS 
normal accrue_cnt underflow


[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS 
normal acct acct-ioomj accrue_cnt underflow


[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS 
normal user 3901 accrue_cnt underflow


[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: 
assoc_id 2676(acct-ioomj/ioomj-stu3/(null)) accrue_cnt underflow


[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: 
assoc_id 2623(acct-ioomj/(null)/(null)) accrue_cnt underflow


[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: 
assoc_id 1(root/(null)/(null)) accrue_cnt underflow


But slurm itself seem to work well and using sacctmgr query the 
reported user/account all seem to be ok.


So what is the underflow mean? Is it imply some kind of mismatch 
between slurmdb datebase records?


How can we fix the problem and stop slurm to report such message?





Re: [slurm-users] getting started with job_submit_lua

2020-09-16 Thread Niels Carl W. Hansen
If you explicitely specify the account, f.ex. 'sbatch -A myaccount'
then 'slurm.log_info("submit -- account %s", job_desc.account)'
works.


/Niels Carl


On Wed, 16 Sep 2020, Mark Dixon wrote:

> On Wed, 16 Sep 2020, Diego Zuccato wrote:
> ...
> > From the source it seems these fields are available:
> >account
> >comment
> >direct_set_prio
> >gres
> >job_id  Always nil ? Maybe no JobID yet?
> >job_state
> >licenses
> >max_cpus
> >max_nodes
> >min_cpus
> >min_nodes
> >nice
> >partition
> >priority
> >req_switch
> >time_limit
> >time_min
> >wait4switch
> >wckey
> >
> > If you access 'em directly, you'll find that some are actually populated.
>
> Hi Diego, thanks for replying :)
>
> I gave this alternative a go:
>
>   function slurm_job_submit( job_desc, part_list, submit_uid )
>
>  slurm.log_info("submit called lua plugin")
>  slurm.log_info("submit -- account %s", job_desc.account)
>  slurm.log_info("submit -- gres %s", job_desc.gres)
>  slurm.log_info("submit completed lua plugin")
>
>  return slurm.SUCCESS
>   end
>
>   function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
>  slurm.log_info("submit called lua plugin2")
>  return slurm.SUCCESS
>   end
>
> And got:
>
>   Sep 16 09:36:58 quack1 slurmctld[9617]: job_submit.lua: submit called lua
> plugin
>   Sep 16 09:36:58 quack1 slurmctld[9617]: error: job_submit/lua:
> /usr/local/slurm/19.05.7-1/etc/job_submit.lua: [string "slurm.log (0,
> string.format(unpack({...})))"]:1: bad argument #2 to 'format' (no value)
>
> It seems "pairs" wasn't lying, job_desc really is empty?
>
> A job_submit function isn't much use without any information about the job!
>
> Please help!
>
> Mark
>