PropagateResourceLimitsExcept won't do it?

________________________________________
Od: Dj Merrill via slurm-users <slurm-users@lists.schedmd.com>
Poslano: sreda, 15. maj 2024 09:43
Za: slurm-users@lists.schedmd.com
Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness

Thank you Hemann and Tom!  That was it.

The new cluster has a virtual memory limit on the login host, and the
old cluster did not.

It doesn't look like there is any way to set a default to override the
srun behaviour of passing those resource limits to the shell, so I may
consider removing those limits on the login host so folks don't have to
manually specify this every time.

I really appreciate the help!

-Dj


On 5/15/24 07:20, greent10--- via slurm-users wrote:
> Hi,
>
> When we first migrated to Slurm from PBS one of the strangest issues we hit 
> was that ulimit settings are inherited from the submission host which could 
> explain the different between ssh'ing into the machine (and the default 
> ulimit being applied) and with running a job via srun.
>
> You could use:
>
> srun --propagate=NONE --mem=32G --pty bash
>
> I still find Slurm inheriting ulimit and environment variables from the 
> submission host an odd default behaviour.
>
> Tom
>
> --
> Thomas Green                         Senior Programmer
> ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB
> Tel: +44 (0)29 208 79269             Fax: +44 (0)29 208 70734
> Email: green...@cardiff.ac.uk        Web: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cardiff.ac.uk_arcca&d=DwIGaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=897kjkV-MEeU1IVizIfc5Q&m=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo&s=zHnwNoh0Qk3EBsMpU-Mum-ARPhKLa65Arp1ndQvw4cU&e=
>
> Thomas Green                         Uwch Raglennydd
> ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB
> Ffôn: +44 (0)29 208 79269            Ffacs: +44 (0)29 208 70734
> E-bost: green...@caerdydd.ac.uk      Gwefan: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.caerdydd.ac.uk_arcca&d=DwIGaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=897kjkV-MEeU1IVizIfc5Q&m=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo&s=2DevPnVhkvH0gqoWZ8tnKTPTLUPLaYGn_4zx70McYxg&e=
>
> -----Original Message-----
> From: Hermann Schwärzler via slurm-users <slurm-users@lists.schedmd.com>
> Sent: Wednesday, May 15, 2024 9:45 AM
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] Re: srun weirdness
>
> External email to Cardiff University - Take care when replying/opening 
> attachments or links.
> Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
> atodiadau neu ddolenni.
>
>
>
> Hi Dj,
>
> could be a memory-limits related problem. What is the output of
>
>    ulimit -l -m -v -s
>
> in both interactive job-shells?
>
> You are using cgroups-v1 now, right?
> In that case what is the respective content of
>
>    /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes
>
> in both shells?
>
> Regards,
> Hemann
>
>
> On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
>> I'm running into a strange issue and I'm hoping another set of brains
>> looking at this might help.  I would appreciate any feedback.
>>
>> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
>> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
>> 23.11.6 on Rocky Linux 9.4 machines.
>>
>> This works perfectly fine on the first cluster:
>>
>> $ srun --mem=32G --pty /bin/bash
>>
>> srun: job 93911 queued and waiting for resources
>> srun: job 93911 has been allocated resources
>>
>> and on the resulting shell on the compute node:
>>
>> $ /mnt/local/ollama/ollama help
>>
>> and the ollama help message appears as expected.
>>
>> However, on the second cluster:
>>
>> $ srun --mem=32G --pty /bin/bash
>> srun: job 3 queued and waiting for resources
>> srun: job 3 has been allocated resources
>>
>> and on the resulting shell on the compute node:
>>
>> $ /mnt/local/ollama/ollama help
>> fatal error: failed to reserve page summary memory runtime stack:
>> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>>       runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
>> pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>>       runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
>> sp=0x7ffe6be32648 pc=0x456b7c
>> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>>       runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8
>> sp=0x7ffe6be326b8
>> pc=0x454565
>> runtime.(*mheap).init(0x127b47e0)
>>       runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
>> pc=0x451885
>> runtime.mallocinit()
>>       runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
>> pc=0x434f97
>> runtime.schedinit()
>>       runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
>> pc=0x464397
>> runtime.rt0_go()
>>       runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8
>> sp=0x7ffe6be327d0 pc=0x49421c
>>
>>
>> If I ssh directly to the same node on that second cluster (skipping
>> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
>> command, it works perfectly fine.
>>
>>
>> My first thought was that it might be related to cgroups.  I switched
>> the second cluster from cgroups v2 to v1 and tried again, no
>> difference.  I tried disabling cgroups on the second cluster by
>> removing all cgroups references in the slurm.conf file but that also
>> made no difference.
>>
>>
>> My guess is something changed with regards to srun between these two
>> Slurm versions, but I'm not sure what.
>>
>> Any thoughts on what might be happening and/or a way to get this to
>> work on the second cluster?  Essentially I need a way to request an
>> interactive shell through Slurm that is associated with the requested
>> resources.  Should we be using something other than srun for this?
>>
>>
>> Thank you,
>>
>> -Dj
>>
>>
>>


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to