[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
Now what would be causing this? The srun just hangs and these are the only
logs from slurmctld:
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node007
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node006
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node005
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node009
[2024-02-24T23:23:26.003] error: Orphan StepId=463.extern reported on node
node008

[2024-02-24T23:43:21.183] _slurm_rpc_complete_job_allocation: JobId=563
error Job/step already completing or completed

[465.extern] error: common_file_write_content: unable to open
'/sys/fs/cgroup/system.slice/slurmstepd.scope/job_463/step_extern/user/cgroup.freeze'
for writing: Permission denied

On Sat, Feb 24, 2024 at 12:09 PM Robert Kudyba  wrote:

> <<
>
> Ah yes thanks for pointing that out. Hope this helps someone down the
> line...perhaps the error detection could be more explicit in slurmctld?
>
> On Sat, Feb 24, 2024, 12:07 PM Chris Samuel via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>>
>> > For now I just set it to chmod 777 on /tmp and that fixed the errors.
>> Is
>> > there a better option?
>>
>> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
>> sticky bit, originally invented to indicate that the OS should attempt
>> to keep a frequently used binary in memory but then adopted to indicate
>> special handling of a world writeable directory so users can only unlink
>> objects they own and not others).
>>
>> Hope that helps!
>>
>> All the best,
>> Chris
>> --
>> Chris Samuel  :
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_=DwICAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o=
>>  :  Berkeley, CA, USA
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:
>
> > For now I just set it to chmod 777 on /tmp and that fixed the errors. Is
> > there a better option?
>
> Traditionally /tmp and /var/tmp have been 1777 (that "1" being the
> sticky bit, originally invented to indicate that the OS should attempt
> to keep a frequently used binary in memory but then adopted to indicate
> special handling of a world writeable directory so users can only unlink
> objects they own and not others).
>
> Hope that helps!
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_=DwICAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=1dr8K8YEcCyc4UDmIvmXWNuOled6fEZ424zSwluePPfhXD2Q5JVklrCrDUQU-mSW=ZbSiWLCu-81ZY1xhscjqczszYgOmqxUbVa6f2qUEd-o=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Chris Samuel via slurm-users

On 24/2/24 06:14, Robert Kudyba via slurm-users wrote:

For now I just set it to chmod 777 on /tmp and that fixed the errors. Is 
there a better option?


Traditionally /tmp and /var/tmp have been 1777 (that "1" being the 
sticky bit, originally invented to indicate that the OS should attempt 
to keep a frequently used binary in memory but then adopted to indicate 
special handling of a world writeable directory so users can only unlink 
objects they own and not others).


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-24 Thread Robert Kudyba via slurm-users
<< wrote:

> Hi Robert,
>
> On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
>
> > We switched over from using systemctl for tmp.mount and change to zram,
> > e.g.,
> > modprobe zram
> > echo 20GB > /sys/block/zram0/disksize
> > mkfs.xfs /dev/zram0
> > mount -o discard /dev/zram0 /tmp
> [...]
>  > [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward:
> failed to create temporary XAUTHORITY file: Permission denied
>
> Where do you set the permissions on /tmp ?  What do you set them to?
>
> All the best,
> Chris
> --
> Chris Samuel  :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.csamuel.org_=DwICAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=dmeaMvnkyzcOflY8XQKXwHbYw7wooGy71JGyj1fwEKHls6zdAR5Q2C5DxN-CFzsa=REC8OGrY-7z6qJAyYetQhVU6LQdDBV6ajjKgtqH0_jU=
>  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users

Hi Robert,

On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:

We switched over from using systemctl for tmp.mount and change to zram, 
e.g.,

modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp

[...]
> [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: 
failed to create temporary XAUTHORITY file: Permission denied


Where do you set the permissions on /tmp ?  What do you set them to?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com