Re: [slurm-users] SLURM Array Job BASH scripting within python subprocess

2022-11-28 Thread Feng Zhang
Not sure if it works, but you can try using "\${SLURM_ARRAY_JOB_ID}.
The "\" to escape the early evaluation of the env variables.

On Thu, Nov 10, 2022 at 6:53 PM Chase Schuette  wrote:
>
> Due to needing to support existing HPC workflows. I have a need to pass a 
> bash script within a python subprocess. It was working great with openpbs, 
> now I need to convert it to SLURM. I have it largely working in SLURM hosted 
> on Ubuntu 20.04 except that the job array is not being populated.
>
> I've read from another user that BASH may try to evaluate variables before 
> they are defined by the SLURM job. I've also seen that errors in SBATCH 
> directives, such as a non-alphanumeric job name, can cause SLURM to not 
> evaluate the following directives. Can someone advise me on when SLURM 
> populates variables?
>
> I have a StackOverflow post here 
> https://stackoverflow.com/questions/74323372/slurm-array-job-bash-scripting-within-python-subprocess
>
> Regards,
> --
>
> Chase Schuette Pronouns: He/Him/His | Caterpillar
>
> Autonomy High Performance Computing | Iowa State University Relations
>
> Mobile: 507-475-1949 | Email: chase.schue...@gmail.com | LinkedIn
> Schedule 15mins here: https://calendly.com/chaseschuette



[slurm-users] run issue

2022-11-30 Thread Feng Zhang
hello all,

I am doing some tests using the Slurm.  Just found that when I run the
srun command with -n and -c options, when the -n and -c are odd
numbers, srun job hangs and no shell is given to me.  When I check
using "squeue", it reports that this job is actually running.

 When -C = even number, it works fine(like -n 3 -c 2).

Did any of you see this strange behavior? Or did I do something wrong?

$ srun -n 3 -c 1  --mem-per-cpu=2000 -p test --pty bash -i
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create step for job 67: Job/step already
completing or completed


$ srun -n 3 -c 3  --mem-per-cpu=2000 -p test --pty bash -i
^Csrun: Cancelled pending job step with signal 2
srun: error: Unable to create step for job 68: Job/step already
completing or completed

Also, -n 5 -c 1 and -n 5 -c 3 do not work.

My node has 16 cores.

4.18.0-372.26.1.el8_6.x86_64
slurm 22.05.5

Best,

Feng



Re: [slurm-users] Problem with Cuda program in multi-cluster

2023-07-05 Thread Feng Zhang
Mohamad,

It seems you need to upgrade the GCC on the GPU nodes of cluster A and C.
The error message says that the srun needs newer GCC libs. Or you can
downgrade your SLURM(like recompile it using GCC 2.27 or older) on cluster
A/C.

Best,

Feng


On Tue, Jul 4, 2023 at 2:46 PM mohammed shambakey 
wrote:

> Hi
>
> I work on 3 clusters: A, B, C. Each of Clusters A and C has 3 compute
> nodes and the head node. One of the 3 compute nodes has an old GPU in each
> cluster of A and C. All nodes, on all clusters, have Ubuntu 22.04 except
> for the 2 nodes with GPU (both of them have Ubuntu 18.04 to suit the old
> GPU card). The installed slurm version (on all clusters) is slurm
> 23.11.0-0rc1.
>
> Cluster B has only 2 compute nodes and the head node. I tried to submit a
> sbatch script from cluster B (with a CUDA program) to be executed in any of
> clusters A or C (where a GPU node resides). Previously, this used to work,
> but after updating the system, I get the following error:
>
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found
> (required by srun)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found
> (required by srun)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
> srun: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found
> (required by /hpcshared/slurm_vm/usr/lib/slurm/libslurmfull.so)
>
> The installed glibc is 2.35 on all nodes, except for the 2 GPU nodes
> (glibc version 2.27). I tried to run the same sbatch script on each of
> clusters A and C, and it works fine. The problem happens only when trying
> to use the "sbatch -Mall" form cluster B. Just to be sure, I tried to run
> another sbatch program (with the multicluster option) that does NOT involve
> CUDA program, and it worked fine.
>
> Should I install the same glibc6 on all nodes (2.33 or 2.33 or 2.34), or
> what?
>
> Regards
>
> --
> Mohammed
>


Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-14 Thread Feng Zhang
Very interesting issue.

I am guessing there might be a workaround: SInce oryx has 2 gpus
instead, you can define both of them, but disable the GT 710? Does
Slurm support this?

Best,

Feng

Best,

Feng


On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M  wrote:
>
> Hi,
>
> I manually configure the GPUs in our Slurm configuration (AutoDetect=off in 
> gres.conf) and everything works fine when all the GPUs in a node are 
> configured in gres.conf and available to Slurm.  But we have some nodes where 
> a GPU is reserved for running the display and is specifically not configured 
> in gres.conf.  In these cases, Slurm includes this unconfigured GPU and makes 
> it available to Slurm jobs.  Using a simple Slurm job that executes 
> "nvidia-smi -L", it will display the unconfigured GPU along with as many 
> configured GPUs as requested by the job.
>
> For example, in a node configured with this line in slurm.conf:
> NodeName=oryx CoreSpecCount=2 CPUs=8 RealMemory=64000 Gres=gpu:RTX2080TI:1
> and this line in gres.conf:
> Nodename=oryx Name=gpu Type=RTX2080TI File=/dev/nvidia1
> I will get the following results from a job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA GeForce GT 710 (UUID: 
> GPU-21fe15f0-d8b9-b39e-8ada-8c1c8fba8a1e)
> GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: 
> GPU-0dc4da58-5026-6173-1156-c4559a268bf5)
>
> But in another node that has all GPUs configured in Slurm like this in 
> slurm.conf:
> NodeName=beluga CoreSpecCount=1 CPUs=16 RealMemory=128500 
> Gres=gpu:TITANX:2
> and this line in gres.conf:
> Nodename=beluga Name=gpu Type=TITANX File=/dev/nvidia[0-1]
> I get the expected results from the job running "nvidia-smi -L" that 
> requested a single GPU:
> GPU 0: NVIDIA RTX A5500 (UUID: GPU-3754c069-799e-2027-9fbb-ff90e2e8e459)
>
> I'm running Slurm 22.05.5.
>
> Thanks in advance for any suggestions to help correct this problem!
>
> Steve



Re: [slurm-users] slurm sinfo format memory

2023-07-20 Thread Feng Zhang
Looks like Slurm itself only supports that format(in MB unit). Slurm
commands output format is not very user friendly to me. If it can add
some easy options, like for the output info of sinfo command in this email
thread, how about adding support for lazy options, like sinfo -ABC, etc.

For the desired format in GB, one workaround may be to prepare a
wrapper shell script, read in the  sinfo output, and convert the MB to GB
and print out to the screen.


On Thu, Jul 20, 2023 at 12:28 PM Arsene Marian Alain 
wrote:

>
>
> Dear slurm users,
>
>
>
> I would like to see the following information of my nodes "hostname, total
> mem, free mem and cpus". So, I used  ‘sinfo -o "%8n %8m %8e %C"’ but in the
> output it shows me the memory in MB like "190560" and I need it in GB
> (without decimals if possible) like "190GB". Any ideas or suggestions on
> how I can do that?
>
>
>
> current output:
>
>
>
> HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
>
> node01   190560   125249   60/4/0/64
>
> node02   190560   171944   40/24/0/64
>
> node05   93280 91584 0/40/0/40
>
> node06   513120   509448   0/96/0/96
>
> node07   513120   512086   0/96/0/96
>
> node08   513120   512328   0/96/0/96
>
> node09   513120   512304   0/96/0/96
>
>
>
> desired output:
>
>
>
> HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
>
> node01   190GB   125GB   60/4/0/64
>
> node02   190GB   171GB   40/24/0/64
>
> node05   93GB  91GB0/40/0/40
>
> node06   512GB   500GB   0/96/0/96
>
> node07   512GB   512GB   0/96/0/96
>
> node08   512GB   512GB   0/96/0/96
>
> node09   512GB   512GB   0/96/0/96
>
>
>
>
>
>
>
> I would appreciate any help.
>
>
>
> Thank you.
>
>
>
> Best Regards,
>
>
>
> Alain
>


Re: [slurm-users] Granular or dynamic control of partitions?

2023-08-04 Thread Feng Zhang
You can try command as:

scontrol update partition mypart  Nodes=node[1-90],ab,ac  #exclude the
one you want to remove

"Changing the Nodes in a partition has no effect upon jobs that have
already begun execution."


Best,

Feng

On Fri, Aug 4, 2023 at 10:47 AM Pacey, Mike  wrote:
>
> Hi folks,
>
>
>
> We’re currently moving our cluster from Grid Engine to SLURM, and I’m having 
> trouble finding the best way to perform a specific bit of partition 
> maintenance. I’m not sure if I’m simply missing something in the manual or if 
> I need to be thinking in a more SLURM-centric way. My basic question: is it 
> possible to ‘disable’ specific partition/node combinations rather than whole 
> nodes or whole partitions? Here’s an example of the sort of thing I’m looking 
> to do:
>
>
>
> I have node ‘node1’ with two partitions ‘x’ and ‘y’. I’d like to remove 
> partition ‘y’, but there are currently user jobs in that partition on that 
> node. With Grid Engine, I could disable specific queue instances (ie, I could 
> just run “qmod -d y@node1’ to disable queue/partition y on node1 and wait for 
> the jobs to complete and then remove the partition. That would be the least 
> disruptive option because:
>
> Queue/partition ‘y’ on other nodes would be unaffected
> User jobs for queue/partition ‘x’ would still be able to launch on node1 the 
> whole time
>
>
>
> I can’t seem to find a functional equivalent of this in SLURM:
>
> I can set the whole node to Drain
> I can set the whole partition to Inactive
>
>
>
> Is there some way to ‘disable’ partition y just on node1?
>
>
>
> Regards,
>
> Mike



Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Feng Zhang
Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.

Best,

Feng

On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:
>
> Hello
>
> I have a job on my system which is running more than its time, more than
> 4 days.
>
> 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047
>
> I'm trying to cancel it
>
> [@arc7-node ~]# scancel 1808851
>
> I get no message as if the job was canceled but when getting information
> about the job, the job is still there
>
> [@arc7-node ~]# squeue | grep awn-047
> 1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047
>
> Can I do any other thinks to kill end the job?
>
> Thank you
>
> Felix
>
>
> --
> Dr. Eng. Farcas Felix
> National Institute of Research and Development of Isotopic and Molecular 
> Technology,
> IT - Department - Cluj-Napoca, Romania
> Mobile: +40742195323
>



Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Feng Zhang
👍

Best,

Feng


On Wed, Sep 20, 2023 at 7:29 AM Wagner, Marcus 
wrote:

> Even after rebooting, sometimes nodes are stuck because of "completing
> jobs".
>
> What helps then is to set the node down and resume it afterwards:
>
> scontrol update nodename= state=drain reason=stuck; scontrol
> update nodename= state=resume
>
>
> Best
> Marcus
>
> Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> > On 9/20/23 01:39, Feng Zhang wrote:
> >> Restarting the slurmd dameon of the compute node should work, if the
> >> node is still online and normal.
> >
> > Probably not.  If the filesystem used by the job is hung, the node
> > must probably be rebooted, and the filesystem must be checked.
> >
> > /Ole
> >
> >> On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:
> >>>
> >>> Hello
> >>>
> >>> I have a job on my system which is running more than its time, more
> >>> than
> >>> 4 days.
> >>>
> >>> 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047
> >>>
> >>> I'm trying to cancel it
> >>>
> >>> [@arc7-node ~]# scancel 1808851
> >>>
> >>> I get no message as if the job was canceled but when getting
> >>> information
> >>> about the job, the job is still there
> >>>
> >>> [@arc7-node ~]# squeue | grep awn-047
> >>>  1808851 debug  gridjob  atlas01 CG 4-00:00:19 1
> >>> awn-047
> >>>
> >>> Can I do any other thinks to kill end the job?
> >
>


Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.

Best,

Feng

Best,

Feng


On Thu, Sep 21, 2023 at 11:27 AM Jason Simms  wrote:
>
> I personally don't think that we should assume users will always know which 
> partitions are available to them. Ideally, of course, they would, but I think 
> it's fine to assume users should be able to submit a list of partitions that 
> they would be fine running their jobs on, and if one is forbidden for 
> whatever reason, Slurm just selects another one of the choices. I'd expect 
> similar behavior if a particular partition were down or had been removed; as 
> long as there is an acceptable specified partition available, run it there, 
> and don't kill the job. Seems really reasonable to me.
>
> Jason
>
> On Thu, Sep 21, 2023 at 10:40 AM David  wrote:
>>
>> That's not at all how I interpreted this man page description.  By "If the 
>> job can use more than..." I thought it was completely obvious (although 
>> perhaps wrong, if your interpretation is correct, but it never crossed my 
>> mind) that it referred to whether the _submitting user_ is OK with it using 
>> more than one partition. The partition where the user is forbidden (because 
>> of the partition's allowed account) should just be _not_ the earliest 
>> initiation (because it'll never initiate there), and therefore not run 
>> there, but still be able to run on the other partitions listed in the batch 
>> script.
>>
>> > that's fair. I was considering this only given the fact that we know the 
>> > user doesn't have access to a partition (this isn't the surprise here) and 
>> > that slurm communicates that as the reason pretty clearly. I can see how 
>> > if a user is submitting against multiple partitions they might hope that 
>> > if a job couldn't run in a given partition, given the number of others 
>> > provided, the scheduler might consider all of those *before* dying 
>> > outright at the first rejection.
>>
>> On Thu, Sep 21, 2023 at 10:28 AM Bernstein, Noam CIV USN NRL (6393) 
>> Washington DC (USA)  wrote:
>>>
>>> On Sep 21, 2023, at 9:46 AM, David  wrote:
>>>
>>> Slurm is working as it should. From your own examples you proved that; by 
>>> not submitting to b4 the job works. However, looking at man sbatch:
>>>
>>>-p, --partition=
>>>   Request  a  specific partition for the resource allocation.  
>>> If not specified, the default behavior is to allow the slurm controller to 
>>> select
>>>   the default partition as designated by the system 
>>> administrator. If the job can use more than one partition, specify their 
>>> names  in  a  comma
>>>   separate  list and the one offering earliest initiation will 
>>> be used with no regard given to the partition name ordering (although 
>>> higher pri‐
>>>   ority partitions will be considered first).  When the job is 
>>> initiated, the name of the partition used will be placed first in the job  
>>> record
>>>   partition string.
>>>
>>> In your example, the job can NOT use more than one partition (given the 
>>> restrictions defined on the partition itself precluding certain accounts 
>>> from using it). This, to me, seems either like a user education issue (i.e. 
>>> don't have them submit to every partition), or you can try the job submit 
>>> lua route - or perhaps the hidden partition route (which I've not tested).
>>>
>>>
>>> That's not at all how I interpreted this man page description.  By "If the 
>>> job can use more than..." I thought it was completely obvious (although 
>>> perhaps wrong, if your interpretation is correct, but it never crossed my 
>>> mind) that it referred to whether the _submitting user_ is OK with it using 
>>> more than one partition. The partition where the user is forbidden (because 
>>> of the partition's allowed account) should just be _not_ the earliest 
>>> initiation (because it'll never initiate there), and therefore not run 
>>> there, but still be able to run on the other partitions listed in the batch 
>>> script.
>>>
>>> I think it's completely counter-intuitive that submitting saying it's OK to 
>>> run on one of a few partitions, and one partition happening to be forbidden 
>>> to the submitting user, means that it won't run at all.  What if you list 
>>> multiple partitions, and increase the number of nodes so that there aren't 
>>> enough in one of the partitions, but not realize this problem?  Would you 
>>> expect that to prevent the job from ever running on any partition?
>>>
>>> Noam
>>
>>
>>
>> --
>> David Rhey
>> ---
>> Advanced Research Computing
>> University of Michigan
>
>
>
> --
> Jason L. Simms, Ph.D., M.P.H.
> Manager of Research Computing
> Swarthmore College
> Information Technology Services
> (610) 328-8102
> Schedule a meeting: https://calendly.com/jlsimms



Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
As I said I am not sure, but it depends on the algorithm and the code
structure of the slurm(no chance to dig into...). My imagination
is(for the way slurm works...):

Check limits on b1, ok,b2: ok: b3,ok; then b4, nook...(or any order by slurm)

If it works with the EnforcePartLimits=ANY or NO,  yeah it's a surprise...

(This use case might not be included in the original design of slurm, I guess)

"NOTE: The partition limits being considered are its configured
MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes,
AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold."

Best,

Feng

On Thu, Sep 21, 2023 at 11:48 AM Bernstein, Noam CIV USN NRL (6393)
Washington DC (USA)  wrote:
>
> On Sep 21, 2023, at 11:37 AM, Feng Zhang  wrote:
>
> Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.
>
>
> Hmm, interesting, but it looks like this is just a check at submission time. 
> The slurm.conf web page doesn't indicate that it affects the actual queuing 
> decision, just whether or not a job that will never run (at all, or just on 
> some of the listed partitions) can be submitted.  If it does help then I 
> think that the slurm.conf description is misleading.
>
> Noam



Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread Feng Zhang
As I read again on the pasted slurm.conf info, it includes
"AllowAccounts, AllowGroups,", so it seems slurm actually takes this
into account. So  I think it should work...

Best,

Feng

On Thu, Sep 21, 2023 at 2:33 PM Feng Zhang  wrote:
>
> As I said I am not sure, but it depends on the algorithm and the code
> structure of the slurm(no chance to dig into...). My imagination
> is(for the way slurm works...):
>
> Check limits on b1, ok,b2: ok: b3,ok; then b4, nook...(or any order by slurm)
>
> If it works with the EnforcePartLimits=ANY or NO,  yeah it's a surprise...
>
> (This use case might not be included in the original design of slurm, I guess)
>
> "NOTE: The partition limits being considered are its configured
> MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes,
> AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold."
>
> Best,
>
> Feng
>
> On Thu, Sep 21, 2023 at 11:48 AM Bernstein, Noam CIV USN NRL (6393)
> Washington DC (USA)  wrote:
> >
> > On Sep 21, 2023, at 11:37 AM, Feng Zhang  wrote:
> >
> > Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not 
> > sure.
> >
> >
> > Hmm, interesting, but it looks like this is just a check at submission 
> > time. The slurm.conf web page doesn't indicate that it affects the actual 
> > queuing decision, just whether or not a job that will never run (at all, or 
> > just on some of the listed partitions) can be submitted.  If it does help 
> > then I think that the slurm.conf description is misleading.
> >
> > Noam



Re: [slurm-users] Two gpu types on one node: gres/gpu count reported lower than configured (1 < 5)

2023-10-16 Thread Feng Zhang
Try

scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"

and then

scontrol update NodeName=heimdall state=RESUME

to see if it will work. Probably just SLURM daemon having a hiccup
after you made changes.

Best,

Feng

On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken
 wrote:
>
> Hi,
>
> We have a ubuntu server (22.04) with currently 5 GPUs (1 x l40 and 4 x 
> rtx_a5000).
> I am trying to configure slurm such that a user can select either the l40 or 
> a5000 gpus for a particular job.
> I have configured my slurm.conf and gres.conf files similar as in this old 
> thread:
> https://groups.google.com/g/slurm-users/c/fc-eoHpTNwU
> I have pasted the contents of the two files below.
>
> Unfortunately, my node is always on “drain” and scontrol shows this error:
> Reason=gres/gpu count reported lower than configured (1 < 5)
>
> Any idea what I am doing wrong?
> Cheers and thanks for your help!
> Gregor
>
> Here are my slurm.conf and gres.conf files.
>
> AutoDetect=off
> NodeName=heimdall Name=gpu Type=l40  File=/dev/nvidia0
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia1
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia2
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia3
> NodeName=heimdall Name=gpu Type=a5000  File=/dev/nvidia4
>
>
> # slurm.conf file generated by configurator.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> SlurmdDebug=debug2
> #
> ClusterName=heimdall
> SlurmctldHost=localhost
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=2
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> #
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core
> GresTypes=gpu
> #
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/none
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurm/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurm/slurmd.log
> #
> # COMPUTE NODES
> NodeName=heimdall CPUs=128 Gres=gpu:l40:1,gpu:a5000:4 Boards=1 
> SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=773635 
> State=UNKNOWN
> PartitionName=heimdall Nodes=ALL Default=YES MaxTime=INFINITE State=UP 
> DefMemPerCPU=8000 DefCpuPerGPU=16
>
>



[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
 wrote:
>
> I'm running into a strange issue and I'm hoping another set of brains
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory
> runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> pc=0x4605dc
> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched
> the second cluster from cgroups v2 to v1 and tried again, no
> difference.  I tried disabling cgroups on the second cluster by removing
> all cgroups references in the slurm.conf file but that also made no
> difference.
>
>
> My guess is something changed with regards to srun between these two
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to work
> on the second cluster?  Essentially I need a way to request an
> interactive shell through Slurm that is associated with the requested
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
Not sure, very strange, while the two linux-vdso.so.1 looks different:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7ffde81ee000)


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7fffa66ff000)

Best,

Feng

On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
 wrote:
>
> Hi Feng,
> Thank you for replying.
>
> It is the same binary on the same machine that fails.
>
> If I ssh to a compute node on the second cluster, it works fine.
>
> It fails when running in an interactive shell obtained with srun on that
> same compute node.
>
> I agree that it seems like a runtime environment difference between the
> SSH shell and the srun obtained shell.
>
> This is the ldd from within the srun obtained shell (and gives the error
> when run):
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
>  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
>  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
>  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
>  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
>  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
>
> This is the ldd from the same exact node within an SSH shell which runs
> fine:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
>  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
>  libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000)
>  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
>  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
>  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
>
>
> -Dj
>
>
>
> On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > Looks more like a runtime environment issue.
> >
> > Check the binaries:
> >
> > ldd  /mnt/local/ollama/ollama
> >
> > on both clusters and comparing the output may give some hints.
> >
> > Best,
> >
> > Feng
> >
> > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> >  wrote:
> >> I'm running into a strange issue and I'm hoping another set of brains
> >> looking at this might help.  I would appreciate any feedback.
> >>
> >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> >> 23.11.6 on Rocky Linux 9.4 machines.
> >>
> >> This works perfectly fine on the first cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >>
> >> srun: job 93911 queued and waiting for resources
> >> srun: job 93911 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >>
> >> and the ollama help message appears as expected.
> >>
> >> However, on the second cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >> srun: job 3 queued and waiting for resources
> >> srun: job 3 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >> fatal error: failed to reserve page summary memory
> >> runtime stack:
> >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> >> pc=0x4605dc
> >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
> >>   runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> >> sp=0x7ffe6be32648 pc=0x456b7c
> >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
> >>   runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> >> pc=0x454565
> >> runtime.(*mheap).init(0x127b47e0)
> >>   runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> >> pc=0x451885
> >> runtime.mallocinit()
> >>   runtime/malloc.go:454 +0xd7 fp=0x7ffe

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users
Do you have containers setting?

On Tue, May 14, 2024 at 3:57 PM Feng Zhang  wrote:
>
> Not sure, very strange, while the two linux-vdso.so.1 looks different:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>
> Best,
>
> Feng
>
> On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
>  wrote:
> >
> > Hi Feng,
> > Thank you for replying.
> >
> > It is the same binary on the same machine that fails.
> >
> > If I ssh to a compute node on the second cluster, it works fine.
> >
> > It fails when running in an interactive shell obtained with srun on that
> > same compute node.
> >
> > I agree that it seems like a runtime environment difference between the
> > SSH shell and the srun obtained shell.
> >
> > This is the ldd from within the srun obtained shell (and gives the error
> > when run):
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7ffde81ee000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
> >  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
> >  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
> >  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
> >
> > This is the ldd from the same exact node within an SSH shell which runs
> > fine:
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7fffa66ff000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
> >  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000)
> >  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
> >  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
> >
> >
> > -Dj
> >
> >
> >
> > On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > > Looks more like a runtime environment issue.
> > >
> > > Check the binaries:
> > >
> > > ldd  /mnt/local/ollama/ollama
> > >
> > > on both clusters and comparing the output may give some hints.
> > >
> > > Best,
> > >
> > > Feng
> > >
> > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> > >  wrote:
> > >> I'm running into a strange issue and I'm hoping another set of brains
> > >> looking at this might help.  I would appreciate any feedback.
> > >>
> > >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> > >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> > >> 23.11.6 on Rocky Linux 9.4 machines.
> > >>
> > >> This works perfectly fine on the first cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >>
> > >> srun: job 93911 queued and waiting for resources
> > >> srun: job 93911 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >>
> > >> and the ollama help message appears as expected.
> > >>
> > >> However, on the second cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >> srun: job 3 queued and waiting for resources
> > >> srun: job 3 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >> fatal error: failed to reserve page summary memory
> > >> runtime stack:
> > >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> > >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> >

[slurm-users] maxrss reported by sachet is wrong

2024-06-07 Thread Feng Zhang via slurm-users
Hi All,

I am having trouble calculating the real RSS memory usage by some kind
of users' jobs. Which the sacct returned wrong numbers.

Rocky Linux release 8.5, Slurm 21.08

(slurm.conf)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux

The troubling jobs are like:

1. python spawn multithreading 96 threads;

2. Each thread uses SKlearn which again spawns 96 threads using openmp.

Which is obviously over running the node, and I want to address it.

The node has 300GB RAM, but the "sacct" (and seff) reports 1.2TB
MaxRSS(also AveRSS). This does not look correct.


I am suspecting that whether the SLurm+jobacct_gather/linux repeatedly
sums up the memory used by all these threads, multiple counted the
same thing many times.

For the openMP part, maybe it is fine for slurm; while for
python/multithreading, maybe it can not work well with Slurm for
memory accounting?

So, if this is the case, maybe 1.2TB/96= 12GB MaxRSS?

I want to get the right MaxRSS to report to users.

Thanks!

Best,

Feng

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Feng Zhang via slurm-users
yes, the algorithm should be like that 1 cpu (core) per job(task).
Like someone mentioned already, need to to --oversubscribe=10 on cpu
cores, meaning 10 jobs on each core for you case. Slurm.conf.
Best,

Feng

On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users
 wrote:
>
> > Every job will need at least 1 core just to run
> > and if there are only 4 cores on the machine,
> > one would expect a max of 4 jobs to run.
>
> I have 3500+ GPU cores available. You mean each GPU job requires at least one 
> CPU? Can't we run a job with just GPU without any CPUs? This sbatch script 
> requires 100 GPU cores, can;t we run 35 in parallel?
>
> #! /usr/bin/env bash
>
> #SBATCH --output="%j.out"
> #SBATCH --error="%j.error"
> #SBATCH --partition=pgpu
> #SBATCH --gres=shard:100
>
> sleep 10
> echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")"
> echo "Running..."
> sleep 10
>
>
>
>
>
>
> On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users 
>  wrote:
>>
>> Well, if I am reading this right, it makes sense.
>>
>> Every job will need at least 1 core just to run and if there are only 4
>> cores on the machine, one would expect a max of 4 jobs to run.
>>
>> Brian Andrus
>>
>> On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
>> > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+
>> > cores.  I want to run around 10 jobs in parallel on the GPU (mostly
>> > are CUDA based jobs).
>> >
>> > PROBLEM: Each job asks for only 100 shards (runs usually for a minute
>> > or so), then I should be able to run 3500/100 = 35 jobs in
>> > parallel but slurm runs only 4 jobs in parallel keeping the rest in
>> > the queue.
>> >
>> > I have this in slurm.conf and gres.conf:
>> >
>> > # GPU
>> > GresTypes=gpu,shard
>> > # COMPUTE NODES
>> > PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`
>> > PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP
>> > NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500
>> > CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
>> > RealMemory=64255 State=UNKNOWN
>> > --
>> > Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1
>> > Name=shard Count=3500  File=/dev/nvidia0
>> >
>> >
>> >
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Feng Zhang via slurm-users
You can also check https://github.com/prod-feng/slurm_tools

slurm_job_perf_show.py may be helpful.

I used to try to use slurm_job_perf_show_email.py to send emails to
users to summarize their usage, like monthly. While some users seemed
to get confused, so stopped.

Best,

Feng

On Fri, Aug 9, 2024 at 11:13 AM Paul Edmon via slurm-users
 wrote:
>
> We are working to make our users more aware of their usage. One of the
> ideas we came up with was to having some basic usage stats printed at
> login (usage over past day, fairshare, job efficiency, etc). Does anyone
> have any scripts or methods that they use to do this? Before baking my
> own I was curious what other sites do and if they would be willing to
> share their scripts and methodology.
>
> -Paul Edmon-
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com