Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
<slurm-users@lists.schedmd.com> wrote:
>
> I'm running into a strange issue and I'm hoping another set of brains
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory
> runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>      runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> pc=0x4605dc
> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>      runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>      runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>      runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>      runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>      runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>      runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched
> the second cluster from cgroups v2 to v1 and tried again, no
> difference.  I tried disabling cgroups on the second cluster by removing
> all cgroups references in the slurm.conf file but that also made no
> difference.
>
>
> My guess is something changed with regards to srun between these two
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to work
> on the second cluster?  Essentially I need a way to request an
> interactive shell through Slurm that is associated with the requested
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to