[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> Hi Robert,
>
> On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
>
> > We switched over from using systemctl for tmp.mount and change to zram,
> > e.g.,
> > modprobe zram
> > echo 20GB > /sys/block/zram0/disksize
> > mkfs.xfs /dev/zram0
> > mount -o discard /dev/zram0 /tmp
> [...]
>  > [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward:
> failed to create temporary XAUTHORITY file: Permission denied
>
> Where do you set the permissions on /tmp ?  What do you set them to?
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=dmeaMvnkyzcOflY8XQKXwHbYw7wooGy71JGyj1fwEKHls6zdAR5Q2C5DxN-CFzsa&s=REC8OGrY-7z6qJAyYetQhVU6LQdDBV6ajjKgtqH0_jU&e=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Chris Samuel via slurm-users

On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:

For now I just set it to chmod 777 on /tmp and that fixed the errors. Is 
there a better option?


Traditionally /tmp and /var/tmp have been 1777 (that "1" being the 
sticky bit, originally invented to indicate that the OS should attempt 
to keep a frequently used binary in memory but then adopted to indicate 
special handling of a world writeable directory so users can only unlink 
objects they own and not others).


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>
> > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is
> > there a better option?
>
> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
> sticky bit, originally invented to indicate that the OS should attempt
> to keep a frequently used binary in memory but then adopted to indicate
> special handling of a world writeable directory so users can only unlink
> objects they own and not others).
>
> Hope that helps!
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW&s=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o&e=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] FAQ describing how to hold a job ignores scontrol subcommands specifically for that purpose

2024-02-24 Thread urbanjost via slurm-users
There are scontrol subcommands uhold/hold/release/requeuehold that are ignored 
when describing how to place a job on hold in FAQ 21; and it is never explained 
why the method described therein is the best method, it just states it is. Does 
anyone know why the FAQ method is better than using the subcommands? Is it 
because the PRIORITY and/or NICE values are not altered (maybe)? The question 
is also about Running but the answer is just
about Starting and not Suspending which is not quite as clear (I think 
"running" should be "starting" to make that clear; and/or how to suspend should 
be described as well).

If the answer is not clear to anyone, I might turn this into a request for 
clarification in the
Slurm bugzilla as a documentation change request but wanted to see if this was 
already clear to anyone and I am missing something.

From FAQ:

21. How can I temporarily prevent a job from running (e.g. place it into a hold 
state)?

The easiest way to do this is to change a job's earliest begin time
(optionally set at job submit time using the --begin option). The example
below places a job into hold state (preventing its initiation for 30 days)
and later permitting it to start now.


$ scontrol update JobId=1234 StartTime=now+30days
... later ...
$ scontrol update JobId=1234 StartTime=now

Note: Empirically in METHOD I the JobId can be a  , which I
initially thought required single JobIDs.

No explanation is given on why METHOD I is best; and there are other methods
that seem more intuitive. I wonder what is
undesirable about the following method which I have been using -- using the 
scontrol(1) subcommands hold/uhold/release/requeuehold.


$ scontrol hold  # advantage to administrator as user cannot change
$ scontrol uhold 
$ scontrol release 

Examples:
$ scontrol uhold jobname=JOB_NAME
$ scontrol uhold '[100-200],300,500'

Using uhold the "Reason" changes to something easily identifying the
job is being held, as "Reason=None" became "Reason=JobHeldUser which
seems better that Method I in that regard.

The downside might be PRIORITY changed to zero and then went to a
very large value when released?

Another method appears to be that setting PRIORITY to zero also
places jobs in hold.


$ scontrol update jobid=373 Priority=0
$ scontrol release jobid=373 # sets to a very high value
$ scontrol update jobid=373 Priority=1 # put back to lower desired value

Once lowered, does an optional setting prevent a user from raising PRIORITY(?)
The manual says

Only the Slurm administrator or root can increase job's priority.

At least on my machine the "release" buts the priority to a very high value, 
and a regular user can lower the value back to the (probably) lower original 
value.

I did not see it happening but there are some statements in the documentation 
that make me think not only PRIORITY but perhaps the NICE value might be 
changed by METHOD II and METHOD III, although I could not get the NICE value to 
be inadvertently changed.

Sent with [Proton Mail](https://proton.me/) secure email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
Now what would be causing this? The srun just hangs and these are the only
logs from slurmctld:
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node007
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node006
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node005
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node009
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node008

[2024-02-24T23:43:21.183] _slurm_rpc_complete_job_allocation: JobId=563
error Job/step already completing or completed

[465.extern] error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze'
for writing: Permission denied

On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba  wrote:

> <<
>
> Ah yes thanks for pointing that out. Hope this helps someone down the
> line...perhaps the error detection could be more explicit in slurmctld?
>
> On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>>
>> > For now I just set it to chmod 777 on /tmp and that fixed the errors.
>> Is
>> > there a better option?
>>
>> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
>> sticky bit, originally invented to indicate that the OS should attempt
>> to keep a frequently used binary in memory but then adopted to indicate
>> special handling of a world writeable directory so users can only unlink
>> objects they own and not others).
>>
>> Hope that helps!
>>
>> All the best,
>> Chris
>> --
>> Chris Samuel  :
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_&d=DwICAg&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW&s=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o&e=
>>  :  Berkeley, CA, USA
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com