Re: [slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

2022-12-02 Thread Michael Robbert
ensorsTemp=n/s > > Where as this command shows only one node on which job is running: > > *(base) [nousheen@nousheen slurm]$ squeue -j* > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 109 debug SRBD-4 nousheen

Re: [slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

2022-12-01 Thread Michael Robbert
I believe that the error you need to pay attention to for this issue is this line: Dec 01 16:17:19 nousheen slurmctld[1631]: slurmctld: error: Check for out of sync clocks It looks like your compute nodes clock is a full day ahead of your controller node. Dec. 2 instead of Dec. 1. The

Re: [slurm-users] [External] Re: Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread Michael Robbert
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.htmlIn particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty

Re: [slurm-users] [External] Re: [EXT] Software and Config for Job submission host only

2022-05-12 Thread Michael Robbert
Don’t forget about munge. You need to have munged running with the same key as the rest of the cluster in order to authenticate. Mike RobbertCyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research ComputingInformation and Technology Solutions (ITS)303-273-3786 | 

Re: [slurm-users] [EXTERNAL] slurm-users Digest, Vol 55, Issue 5

2022-05-04 Thread Michael Robbert
Jim,I’m glad you got your problem solved. Here is an additional tip that will make it easier to fix in the future. You don’t need to put scrontrol into a loop, the NodeName parameter will take a node range _expression_. So, you can use NodeName=sjc01enadsapp[01-08]. A SysAdmin in training saw me

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2022-02-07 Thread Michael Robbert
They moved Arbiter2 to Github. Here is the new official repo: https://github.com/CHPC-UofU/arbiter2 Mike On 2/7/22, 06:51, "slurm-users" wrote: Hi, I've just noticed that the repository https://gitlab.chpc.utah.edu/arbiter2 seems is down. Does someone know more? Thank you! Best, Stefan Am

Re: [slurm-users] [External] Re: srun : Communication connection failure

2022-01-20 Thread Michael Robbert
It looks like it could be some kind of network problem but could be DNS. Can you ping and do DNS resolution for the host involved? What does slurmctld.log say? How about slurmd.log on the node in question? Mike From: slurm-users on behalf of Durai Arasan Date: Thursday, January 20, 2022 at

Re: [slurm-users] EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

2021-08-26 Thread Michael Robbert
e could look at the number of gpu passed?), but where do i set up that function and where do i call it? Thanks, Fritz Ratnasamy Data Scientist Information Technology The University of Chicago Booth School of Business 5807 S. Woodlawn Chicago, Illinois 60637 Phone: +(1) 773-834-4556 O

Re: [slurm-users] [External] scancel gpu jobs when gpu is not requested

2021-08-25 Thread Michael Robbert
I doubt that it is a problem with your script and suspect that there is some weird interaction with scancel on interactive jobs. If you wanted to get to the bottom of that I’d suggest disabling the prolog and test by manually cancelling some interactive jobs. Another suggestion is to try a

Re: [slurm-users] [External] Re: Preemption not working for jobs in higher priority partition

2021-08-24 Thread Michael Robbert
I can confirm that we do preemption based on partition for one of our clusters. I will say that we are not using time-based partitions, ours are always up and they are based on group node ownership. I wonder if Slurm is refusing to preempt a job in a DOWN partition. Maybe try leaving the

[slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Michael Robbert
I’m wondering if others in the Slurm community have any tips or best practices for the development and testing of Lua job submit plugins. Is there anything that can be done prior to deployment on a production cluster that will help to ensure the code is going to do what you think it does or at

Re: [slurm-users] [External] slurmd -C vs lscpu - which do I use to populate slurm.conf?

2021-04-28 Thread Michael Robbert
I think that you want to use the output of slurmd -C, but if that isn’t telling you the truth then you may not have built slurm with the correct libraries. I believe that you need to build with hwloc in order to get the most accurate details of the CPU topology. Make sure you have hwloc-devel

Re: [slurm-users] [External] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

2021-03-22 Thread Michael Robbert
I haven't tried configless setup yet, but the problem you're hitting looks like it could be a DNS issue. Can you do a dns lookup of n26 from the login node? The way that non-interactive batch jobs are started may not require that, but I believe that it is required for interactive jobs. Mike

Re: [slurm-users] [External] Preemption not working in 20.11

2021-02-26 Thread Michael Robbert
We saw something that sounds similar to this. See this bug report: https://bugs.schedmd.com/show_bug.cgi?id=10196 SchedMD never found the root cause. They thought it might have something to do with a timing problem on Prolog scripts, but the thing that fixed it for us was to set GraceTime=0 on

Re: [slurm-users] [External] Fwd: Slurm MySQL database configuration

2020-07-23 Thread Michael Robbert
Peter, I believe that the answer to your database question is that you don't have two MySQL/MariaDB servers running at the same time. The only way that I know of to run MySQL/MariaDB in an active-active setup, which is what you appear to be describing, is with replication. The other setup is to

Re: [slurm-users] [External] Re: Problem with permisions. CentOS 7.8

2020-06-02 Thread Michael Robbert
Those files in /run/system/generator.late/ look like they came from older SystemV init scripts. Can you check to make sure you don't have a slurm service script in /etc/init.d/? Also, note that there is a difference between the "slurm" service and the "slurmd" service. The former was the older

Re: [slurm-users] [External] slurm only looking in "default" partition during scheduling

2020-05-12 Thread Michael Robbert
You have defined both of your partitions with “Default=YES”, but Slurm can have only one default partition. You can see from * on the compute partition in your sinfo output that Slurm selected that one as the default. When you use srun or sbatch it will only look at the default partition unless

Re: [slurm-users] [External] Re: Slurm queue seems to be completely blocked

2020-05-11 Thread Michael Robbert
You’re on the right track with the DRAIN state. The more specific answer is in the “Reason=” description on the last line. It looks like your node has less memory than what you’ve defined for the node in slurm.conf Mike From: slurm-users on behalf of Joakim Hove Reply-To: Slurm User

Re: [slurm-users] [External] Defining a default --nodes=1

2020-05-08 Thread Michael Robbert
Manuel, You may want to instruct your users to use ‘-c’ or ‘—cpus-per-task’ to define the number of cpus that they need. Please correct me if I’m wrong, but I believe that will restrict the jobs to a singe node whereas ‘-n’ or ‘—ntasks’ is really for multi process jobs which can be spread

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Michael Robbert
at 1:43 PM Michael Robbert wrote: It looks like you have hyper-threading turned on, but haven’t defined the ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or changed the definition of ThreadsPerCore in slurm.conf. Nice find. node003 has hyper threading enabled

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Michael Robbert
It looks like you have hyper-threading turned on, but haven’t defined the ThreadsPerCore=2. You either need to turn off Hyper-threading in the BIOS or changed the definition of ThreadsPerCore in slurm.conf. Mike From: slurm-users on behalf of Robert Kudyba Reply-To: Slurm User

Re: [slurm-users] [External] Another question about partition and node allocation

2020-04-15 Thread Michael Robbert
The more flexible way to do this is with QoS. (PreemptType=preempt/qos) You'll need to have Accounting enabled and you'll probably want qos listed in AccountingStorageEnforce. Once you do that you create a "shared" for the scavenger jobs, a QoS for each group that buys into resources. Assign

Re: [slurm-users] Intel MPI startup

2019-04-30 Thread Michael Robbert
Samuel wrote: On Monday, 29 April 2019 8:47:49 AM PDT Michael Robbert wrote: Intel has supposedly supported PMI-2 since their 2017 release and that is what SchedMD suggested we use in a recent bug report to them, but I found that it no longer works in Intel MPI 2019. I opened a bug report

[slurm-users] Intel MPI startup

2019-04-29 Thread Michael Robbert
I was curious what startup method other sites are using with Intel MPI? According to the documentation srun with Slurm's PMI is the recommended way ( https://slurm.schedmd.com/mpi_guide.html#intel_srun ). Intel has supposedly supported PMI-2 since their 2017 release and that is what SchedMD

Re: [slurm-users] Does latest slurm version still work on CentOS 6?

2019-02-11 Thread Michael Robbert
Cola, You need to use the legacy spec file from the contribs directory: ls -l slurm-18.08.5/contribs/slurm.spec-legacy -rw-r--r-- 1 mrobbert mrobbert 38574 Jan 30 11:59 slurm-18.08.5/contribs/slurm.spec-legacy Mike On 2/11/19 9:26 AM, Colas Rivière wrote: > Hello, > > I'm trying to update

Re: [slurm-users] [Slurm 18.08.4] sacct/seff Inaccurate usercpu values

2019-01-16 Thread Michael Robbert
Andreas, Look again. I just looked and a commit to the source code was posted to the bug yesterday afternoon. It looks like that patch applies to the cgroup plugin. It won't show up until the next release, but at least there is a fix available. Mike Robbert On 1/15/19 11:43 PM, Henkel,

Re: [slurm-users] possible to set memory slack space before killing jobs?

2018-12-10 Thread Michael Robbert
If you want to detect lost DIMMs or anything like that use a Node Health Check script. I recommend and use this one: https://github.com/mej/nhc It has an option to generate a configuration file that will watch way more than you probably need, but if you want to know if something on your nodes

Re: [slurm-users] Having a possible cgroup issue?

2018-12-06 Thread Michael Robbert
Wes, You didn't list the Slurm command that you used to get your interactive session. In particular did you ask Slurm for access to all 14 cores? Also note that since Matlab is using threads to distribute work among cores you don't want to ask for multiple tasks (-n or --ntasks) as that will

Re: [slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-26 Thread Michael Robbert
I believe that fragmentation only happens on routers when passing traffic from one subnet to another. Since this traffic was all on a single subnet there was no router involved to fragment the packets. Mike On 11/26/18 1:49 PM, Kenneth Roberts wrote: D’oh! The compute nodes had different MTU

Re: [slurm-users] srun: error: Unable to allocate resources: Invalid partition name specified

2018-07-26 Thread Michael Robbert
The line that you list from your slurm.conf shows the "course" partition being set as the default partition, but on our system the sinfo command shows our default partition with a * at the end and your output doesn't show that so I'm wondering if you've got another partition that is getting

Re: [slurm-users] Issue with salloc

2018-05-14 Thread Michael Robbert
Mahmood, You need to put all the options to srun before the executable that you want to run, which in this case is /bin/bash. So, it should look more like: srun -l -a em1 -p IACTIVE --mem=4GB --pty -u /bin/bash The way you have it most of your srun options are being interpreted as bash

Re: [slurm-users] execute job regardless the exit status of dependent jobs

2018-01-19 Thread Michael Robbert
George, I haven't tested or used this, but why won't afterany do what you want?   afterany:job_id[:jobid...] This job can begin execution after the specified jobs have terminated. Mike On 1/19/18 11:09 AM, Hwa, George wrote: I have a “reaper” job that