Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Alex Chekholko
Hi David, There are several approaches to have a shared filesystem namespace without an actual shared filesystem. One issue you will have to contend with is how to handle any kind of filesystem caching (how much room to allocate for local cache, how to handle cache inconsistencies). examples: gcs

Re: [slurm-users] How to queue jobs based on non-existent features

2020-07-10 Thread Alex Chekholko
Hey Raj, To me this all sounds, at a high level, a job for some kind of lightweight middleware on top of SLURM. E.g. makefiles or something like that. Where each pipeline would be managed outside of slurm and would maybe submit a job to install some software, then submit a job to run something o

Re: [slurm-users] Simultaneously running multiple jobs on same node

2020-11-23 Thread Alex Chekholko
Hi, Your job does not request any specific amount of memory, so it gets the default request. I believe the default request is all the RAM in the node. Try something like: $ scontrol show config | grep -i defmem DefMemPerNode = 64000 Regards, Alex On Mon, Nov 23, 2020 at 12:33 PM Jan

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Alex Chekholko
This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to

Re: [slurm-users] How to assign temporary priority bonuses or penalties?

2020-12-10 Thread Alex Chekholko
Hi Luke, Yes, I think your request is unusual. I believe in the past there have been a number of middle-wares that helped with this kind of bureaucracy, things like http://docs.adaptivecomputing.com/gold/ Regards, Alex On Thu, Dec 10, 2020 at 9:23 AM Luke Yeager wrote: > (originally posted at

Re: [slurm-users] Burst to AWS cloud

2020-12-15 Thread Alex Chekholko
Hey Sajesh, Each public cloud vendor provides a standard way to create a virtual private network in their infrastructure and connect that private network to your existing private network for your cluster. The devil is in the networking details. So in that case, you can just treat it as a new rac

Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-18 Thread Alex Chekholko
Hi Jason, Ultimately each site decides how/why to do it; in my case I tend to do big "forklift upgrades", so I'm running 18.08 on the current cluster and will go to latest SLURM for my next cluster build. But you may have good reasons to upgrade slurm more often on your existing cluster. I don't

Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Alex Chekholko
In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk. I haven't thought about it too hard, but the default NHC scripts do not notice that. Similarly, Paul's proposed script might need to also check that the s

Re: [slurm-users] Picocluster 5H Jetson Nano - SLURM A57+GPU

2021-08-02 Thread Alex Chekholko
I don't have specific answers to your questions but one thing you can do is run the slurmd on one of your "nodes" and see what hardware specs SLURM auto-detects. Run "slurmd -C"; from the man page: -C Print actual hardware configuration and exit. The format of output is the same as used

Re: [slurm-users] Available gpus ?

2018-03-16 Thread Alex Chekholko
There was a previous thread where someone recommended a third-party script: "pestat -G" that will parse the outputs of 'scontrol shown node' and 'scontrol show job' and add up the used GPUs perhaps? https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat On Fri, Mar 16, 2018 at 11:44 AM,

Re: [slurm-users] Running Slurm on a single host?

2018-04-06 Thread Alex Chekholko
The thing you are describing is possible in both theory and practice. Plenty of people use a scheduler on a single large host. The challenge will be in enforcing user practices so they don't just run commands directly but through the scheduler. On Fri, Apr 6, 2018 at 10:00 AM, Patrick Goetz wrot

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Alex Chekholko
Hey Will, It maybe just as easy in your case to just build it directly, it's just one c file and makefile https://github.com/SchedMD/slurm/tree/master/contribs/pam Regards, Alex On Fri, May 4, 2018 at 2:11 PM, Will Dennis wrote: > I just tried unpacking the original archive, and running “./co

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-16 Thread Alex Chekholko
Add a logging rule to your iptables and look at what traffic is actually being blocked? On Wed, May 16, 2018 at 11:11 AM Sean Caron wrote: > Hi all, > > Does anyone use SLURM in a scenario where there is an iptables firewall on > the compute nodes on the same network it uses to communicate with

[slurm-users] "fatal: can't stat gres.conf"

2018-07-23 Thread Alex Chekholko
Hi all, I have a few working GPU compute nodes. I bought a couple of more identical nodes. They are all diskless; so they all boot from the same disk image. For some reason slurmd refuses to start on the new nodes; and I'm not able to find any differences in hardware or software. Google search

Re: [slurm-users] "fatal: can't stat gres.conf"

2018-07-23 Thread Alex Chekholko
:41 PM Bill wrote: > Hi Alex, > > Try run nvidia-smi before start slurmd, I also found this issue. I have to > run nvidia-smi before slurmd when I reboot system. > Regards, > Bill > > > -- Original -- > *From:* Alex Chekholko > *D

Re: [slurm-users] "fatal: can't stat gres.conf"

2018-07-26 Thread Alex Chekholko
Hello all, My error was indeed just the comma in my gres.conf. I was confused because I had the same file on my running nodes but that's just because slurmd started before the erroneous comma was added to the config. So the error message was in fact directly correct, it could not find the device

Re: [slurm-users] Unable to contact slurm controller

2018-07-31 Thread Alex Chekholko
Seems like your slurmctld is not running. Have you checked its log to see why? On Tue, Jul 31, 2018 at 8:35 AM Mahmood Naderan wrote: > Hi, > It seems that squeue is broken due to the following error: > > [root@rocks7 ~]# squeue > slurm_load_jobs error: Unable to contact slurm controller (conne

[slurm-users] changing JobAcctGatherType on busy cluster?

2018-08-14 Thread Alex Chekholko
Hi, Right now I have a cluster running SLURM v17.02.7 with: JobAcctGatherType = jobacct_gather/none The documentation says "NOTE: Changing this configuration parameter changes the contents of the messages between Slurm daemons. Any previously running job steps are managed by a slurmstepd d

Re: [slurm-users] RFC: Slurm Tool to Automate and Track Large Job Arrays

2019-01-18 Thread Alex Chekholko
Almost every place I worked built some site-specific tools for managing jobs that some people found very useful. E.g. https://github.com/StanfordBioinformatics/SJM http://clusterjob.org/ There have also been some efforts to standardize this sort of thing: https://www.commonwl.org/ I have not use

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Alex Chekholko
Hi Will, You have bumped into the old adage: "HPC is just about moving the bottlenecks around". If your bottleneck is now your network, you may want to upgrade the network. Then the disks will become your bottleneck :) For GPU training-type jobs that load the same set of data over and over agai

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Alex Chekholko
Hey Graziano, To make your decision more "data-driven", you can pipe your SLURM accounting logs into a tool like XDMOD which will make you pie charts of usage by user, group, job, gres, etc. https://open.xdmod.org/8.0/index.html You may also consider assigning this task to one of your "machine

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Alex Chekholko
any millions of jobs you want to process. I'm not aware of command-line tools that produce pretty graphs suitable for consumption by upper management :) Regards, Alex On Thu, Mar 21, 2019 at 10:03 AM Noam Bernstein wrote: > On Mar 21, 2019, at 12:38 PM, Alex Chekholko wrote: > >

Re: [slurm-users] Slurm 1 CPU

2019-04-04 Thread Alex Chekholko
Hi Chris, re: "can't run more than 1 job per node at a time. " try "scontrol show config" and grep for defmem IIRC by default the memory request for any job is all the memory in a node. Regards, Alex On Thu, Apr 4, 2019 at 4:01 PM Andy Riebs wrote: > in slurm.conf, on the line(s) starting "

[slurm-users] sstat: symbol lookup error: /usr/lib/slurm/auth_munge.so: undefined symbol: slurm_debug

2019-04-12 Thread Alex Chekholko
Hi all, I'm running on Ubuntu 18.04.2 LTS munge is from the Ubuntu package slurm v18.08.7 I compile myself with ./configure --prefix=/tmp/slurm-build-7 --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm Then I make a deb with fpm and ins

Re: [slurm-users] combine RAM between different nodes

2019-04-17 Thread Alex Chekholko
Hey Suzanne, In order to "combine" RAM between different systems, you will need a hardware/software solution like ScaleMP, or you need a software framework like OpenMPI. If your software is already written to use MPI then, in a sense, it is "combining" the memory. SLURM is a resource manager and

[slurm-users] user-provided epilog does not always run

2019-04-25 Thread Alex Chekholko
Hi all, My expectation is that the epilog script gets run no matter what happens to the job (fails, canceled, timeout, etc). Is that true, or are there corner cases? I hope I correctly understand the intended behavior. My OS is Ubuntu 18.04.2 LTS and my SLURM is 18.08.7 built from source. The e

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Alex Chekholko
I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf E.g. try 'srun --nodelist=cn7 ping -c 1 cn7' On Wed, May 29, 2019 at 6:00 AM Alexander Åhman wrote: > Hi, > Have a very strange problem. The cluster has been working just

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Alex Chekholko
Hey Samuel, Can't you just adjust the existing "cpu" limit numbers using those same multipliers? Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs. Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs because the CPUs are normalized to the old performance. Since it

Re: [slurm-users] Question About Restarting Slurmctld and Slurmd

2019-07-24 Thread Alex Chekholko
Hi Chad, Here is the most generally useful process I ended up with, implemented in a local custom utility script. #Update slurm.conf everywhere #Stop slurmctld #Restart all slurmd processes #Start slurmctld per: https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes I think you only will

Re: [slurm-users] Fwd: Slurm/cgroups on a single head/compute node

2019-08-21 Thread Alex Chekholko
Hey David, Which distro? Which kernel version? Which systemd version? Which SLURM version? Based on some paths in your varialbles, I'm guessing Ubuntu distro with Debian SLURM packages? Regards, Alex On Wed, Aug 21, 2019 at 5:24 AM David da Silva Pires < david.pi...@butantan.gov.br> wrote: >

Re: [slurm-users] Fwd: Slurm/cgroups on a single head/compute node

2019-08-22 Thread Alex Chekholko
Hi David, I actually don't know much about cgroups, and I don't have a single-node cluster. Here are some cgroup-related settings from my regular Ubuntu 18.04 cluster, running SLURM 18.08.7 root@cb-admin:~# cat /etc/slurm/slurm.conf | grep -i cgr ProctrackType=proctrack/cgroup TaskPlugin=task/cg

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

2019-08-29 Thread Alex Chekholko
Sounds like maybe you didn't correctly roll out / update your slurm.conf everywhere as your RealMemory value is back to your large wrong number. You need to update your slurm.conf everywhere and restart all the slurm daemons. I recommend the "safe procedure" from here: https://wiki.fysik.dtu.dk/ni

Re: [slurm-users] Scheduling GPUS

2019-11-07 Thread Alex Chekholko
Hi Mike, IIRC if you have the default config, jobs get all the memory in the node, thus you can only run one job at a time. Check: root@admin:~# scontrol show config | grep DefMemPerNode DefMemPerNode = 64000 Regards, Alex On Thu, Nov 7, 2019 at 1:21 PM Mike Mosley wrote: > Greetings

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Alex Chekholko
Hi, I had asked a similar question recently (maybe a year ago) and also got crickets. I think in our case we were not able to ensure that the epilog always ran for different types of job failures, so we just had the users add some more cleanup code to the end of their jobs _and_ also run separate

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Alex Chekholko
ion J > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com > ] *On Behalf Of *Alex Chekholko > *Sent:* Monday, December 9, 2019 12:53 PM > *To:* Slurm User Community List > > *Subject:* Re: [slurm-users] Timeout and Epilogue > > > > H

Re: [slurm-users] sched

2019-12-12 Thread Alex Chekholko
Hey Steve, I think it doesn't just "power down" the nodes but deletes the instances. So then when you need a new node, it creates one, then provisions the config, then updates the slurm cluster config... That's how I understand it, but I haven't tried running it myself. Regards, Alex On Thu, De

Re: [slurm-users] Can't get node out of drain state

2020-01-23 Thread Alex Chekholko
Hey Dean, Does 'scontrol show node https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons Also check that slurmd daemons on the compute nodes can talk to each other (not just to the master). e.g. bottom of https://slurm.schedmd.com/big_sys.html Regards, Alex

Re: [slurm-users] Why does the make install path get hard coded into the slurmd binary?

2020-02-18 Thread Alex Chekholko
Hey Dean, Here is what I found in my build notes which are now outdated by 1 year at least, but probably there are some more configure parameters you want to specify with relevant directories: ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-li

Re: [slurm-users] How to get the Average number of CPU cores used by jobs per day?

2020-04-03 Thread Alex Chekholko
Hey Sudeep, Which flags to sreport have you tried? Which information was missing? Regards, Alex On Thu, Apr 2, 2020 at 10:29 PM Sudeep Narayan Banerjee < snbaner...@iitgn.ac.in> wrote: > Dear Steven: Yes, but am unable to get the desired data. Not sure which > flags to use. > > Thanks & Regard

Re: [slurm-users] Slurm queue seems to be completely blocked

2020-05-11 Thread Alex Chekholko
You will want to look at the output of 'sinfo' and 'scontrol show node' to see what slurmctld thinks about your compute nodes; then on the compute nodes you will want to check the status of the slurmd service ('systemctl status -l slurmd') and possibly read through the slurmd logs as well. On Mon,

Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Alex Chekholko
Any time a node goes into DRAIN state you need to manually intervene and put it back into service. scontrol update nodename=ip-172-31-80-232 state=resume On Mon, May 11, 2020 at 11:40 AM Joakim Hove wrote: > > You’re on the right track with the DRAIN state. The more specific answer >> is in the

Re: [slurm-users] Gres GPU Resource Issue

2020-05-17 Thread Alex Chekholko
Hi Andrew, I think maybe something is wrong with your slurmd, maybe something missing from your install? On the node (where slurmd is running), you should see a message similar to this in slurmd.log [2020-05-11T14:29:17.766] Gres Name=gpu Type=titanrtx Count=4 ID=7696487 File=/dev/nvidia[0-3] (n

[slurm-users] trying to figure out how to troubleshoot cloud node resume/suspend

2024-08-23 Thread Alex Chekholko via slurm-users
Hi all, I have a cloud cluster running in GCP that seems to have gotten stuck in a state where the slurmctld will not start/stop compute nodes, it just sits there with thousands of jobs in the queue and only a few compute nodes up and running (out of thousands). I can try to kick it by setting no