Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
ype=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup Any ideas? Cheers, El mié., 21 oct. 2020 a las 15:17, Riebs, Andy (mailto:andy.ri...@hpe.com>>) escribió: Also, of course, any of the information that you can provide about how the system is configured:

Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
Also, of course, any of the information that you can provide about how the system is configured: scheduler choices, QOS options, and the like, would also help in answering your question. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Riebs, Andy Sent: Wednesday

Re: [slurm-users] Array jobs vs Fairshare

2020-10-21 Thread Riebs, Andy
Stephan (et al.), There are probably 6 versions of Slurm in common use today, across multiple versions each of Debian/Ubuntu, SuSE/SLES, and RedHat/CentOS/Fedora. You are more likely to get a good answer if you offer some hints about what you are running! Regards, Andy From: slurm-users [mail

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-06 Thread Riebs, Andy
nibo.it] Sent: Tuesday, October 6, 2020 3:13 AM To: Riebs, Andy ; Slurm User Community List Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ??? Il 05/10/20 14:18, Riebs, Andy ha scritto: Tks for considering my query. > You need to provide some hints! What we know so far: > 1.

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-05 Thread Riebs, Andy
You need to provide some hints! What we know so far: 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x backtrace. 2. Your decision to address this to the Slurm mailing list suggests that you think that Slurm might be involved. 3. You have something (a job? a program?) t

Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Riebs, Andy
Relu, There are a number of ways to run an open source project. In the case of Slurm, the code is managed by SchedMD. As a rule, one presumes that they have plenty on their plate, and little time to respond to the mailing list. Hence the suggestion that one get a support contract to get their a

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Riebs, Andy
Check for Ethernet problems. This happens often enough that I have the following definition in my .bashrc file to help track these down: alias flaky_eth='su -c "ssh slurmctld-node grep responding /var/log/slurm/slurmctld.log"' Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co

Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Riebs, Andy
Frankly, it's hard to tell what you might be doing wrong if you don't tell us what you're doing! That notwithstanding, the "--uid" message suggests that something in your process is trying to submit a job with the "--uid" option, but you don't have sufficient privs to use it. Andy From: slurm

Re: [slurm-users] Meaning of "defunct" in description of Slurm parameters

2020-07-20 Thread Riebs, Andy
Ummm... unless I'm missing something obvious, though the choice of the term "defunct" might not be my choice (I would have expected "deprecated"), it seems quite clear that the new "SlurmctldHost" parameter has subsumed the 4 that you've listed. I wasn't privy to the decision to the discussion a

Re: [slurm-users] How to exclude nodes in sbatch/srun?

2020-06-22 Thread Riebs, Andy
In fairness to our friends at SchedMD, this was filed as an enhancement request, not a bug. Since this is an open source project, there are 2 good ways to make it happen: 1. Fund someone, like SchedMD, to make the change. 2. Make the changes yourself, and submit the changes. Alter

Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Riebs, Andy
David, I've been using Slurm for nearly 20 years, and while I can imagine some clever work-arounds, like staging your job in /var/tmp on all of the nodes before trying to run it, it's hard to imagine a cluster serving a useful purpose without a shared user file system, whether or not Slurm is i

Re: [slurm-users] unable to start slurmd process.

2020-06-13 Thread Riebs, Andy
processes. slurmd started without any issues. Regards Navin. On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Short of getting on the system and kicking the tires myself, I’m fresh out of ideas. Does “sinfo -R” offer any hints? From: slurm-users [mailto

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
:38 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: So there seems to be a failure to communicate between slurmctld and the oled3 slurmd. From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon. From the head node, try “scontrol show node oled3”, and then pi

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
OLED* up infinite 1 drain* oled3 while checking the node i feel node is healthy. Regards Navin On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to interpret

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
ping but IP is pingable. could be one of the reason? but other nodes having the same config and there i am able to start the slurmd. so bit of confusion. Regards Navin. Regards Navin. On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: If you omitt

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
6:06 PM Riebs, Andy mailto:andy.ri...@hpe.com>> wrote: Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Riebs, Andy
Navin, As you can see, systemd provides very little service-specific information. For slurm, you really need to go to the slurm logs to find out what happened. Hint: A quick way to identify problems like this with slurmd and slurmctld is to run them with the “-Dvvv” option, causing them to log

Re: [slurm-users] Intermittent problem at 32 CPUs

2020-06-05 Thread Riebs, Andy
Diego, I'm *guessing* that you are tripping over the use of "--tasks 32" on a heterogeneous cluster, though your comment about the node without InfiniBand troubles me. If you drain that node, or exclude it in your command line, that might correct the problem. I wonder if OMPI and PMIx have deci

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-04 Thread Riebs, Andy
Geoffrey, A lot depends on what you mean by “failure on the current machine”. If it’s a failure that Slurm recognizes as a failure, Slurm can be configured to remove the node from the partition, and you can follow Rodrigo’s suggestions for the requeue options. If the user job simply decides it

Re: [slurm-users] Node suspend / Power saving - for *idle* nodes only?

2020-05-15 Thread Riebs, Andy
And if you're willing to buy a support contract with SchedMD, and/or provide a fix, it will be fixed. Otherwise, you'll have to accept that you've got a large group of users, just like you, who are willing to share their expertise and experience, even if it's not our "day job" -- or even our "ni

Re: [slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-07 Thread Riebs, Andy
Alternatively, you could switch to MariaDB; I've been using that for years. Andy -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Marcus Wagner Sent: Thursday, May 7, 2020 8:55 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users]

Re: [slurm-users] Munge decode failing on new node

2020-04-17 Thread Riebs, Andy
A couple of quick checks to see if the problem is munge: 1. On the problem node, try $ echo foo | munge | unmunge 2. If (1) works, try this from the node running slurmctld to the problem node slurm-node$ echo foo | ssh node munge | unmunge From: slurm-users [mailto:slurm-users-boun.

Re: [slurm-users] Munge decode failing on new node

2020-04-15 Thread Riebs, Andy
Two trivial things to check: 1. Permissions on /etc/munge and /etc/munge.key 2. Is munged running on the problem node? Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Dean Schulze Sent: Wednesday, April 15, 2020 1:57 PM To: Slurm User Community Li

Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Riebs, Andy
When you say “distinct compute nodes,” are they at least on the same network fabric? If so, the first thing I’d try would be to create a new partition that encompasses all of the nodes of the other two partitions. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf

Re: [slurm-users] SLURM with OpenMPI

2019-12-15 Thread Riebs, Andy
Agreed -- I do this frequently. (Be sure you've exported those variables, though!) Andy -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Paul Edmon Sent: Sunday, December 15, 2019 2:05 PM To: slurm-users@lists.schedmd.com Subject: Re: [slu

Re: [slurm-users] Timeout and Epilogue

2019-12-09 Thread Riebs, Andy
At the risk of stating the obvious… these seem like the sort of questions that could be answered with a 2 minute test. Better yet, not just answered, but with answers specific to your configuration ☺ From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Alex Chekholko Se

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-23 Thread Riebs, Andy
oing a daily patrol for them to clean them up.  > Most of them time you can just reopen the node but sometimes this indicates > something is wedged. > > -Paul Edmon- > > On 10/22/2019 5:22 PM, Riebs, Andy wrote: > > A common reason for seeing this is if a process is

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Riebs, Andy
A common reason for seeing this is if a process is dropping core -- the kernel will ignore job kill requests until that is complete, so the job isn't being killed as quickly as Slurm would like. I typically recommend increasing the UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avo

[slurm-users] Using the OpenSHMEM reference implementation with Slurm

2019-09-09 Thread Riebs, Andy
Has anyone tried to use the Open SHMEM 1.4 reference implementation (see https://github.com/openshmem-org/osss-ucx) with Slurm? It appears to me that the Slurm PMI-x implementation needs a few more calls ("publish" and "lookup"), but I'd be delighted to be proven wrong! Andy -- Andy Riebs and

Re: [slurm-users] sacct thinks slurmctld is not up

2019-07-18 Thread Riebs, Andy
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to hear if there’s a proper fix. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Brian Andrus Sent: Thursday, July 18, 2019 11:01 AM To: Slurm User Community List Subject: [slurm-use

Re: [slurm-users] Counting total number of cores specified in the sbatch file

2019-06-08 Thread Riebs, Andy
A quick & easy way to see what your options might be for Slurm environment variables is to try a job like this: $ srun --nodes 2 --ntasks-per-node 6 --pty env | grep SLURM Or, perhaps, use the “env | grep SLURM” in your batch script. Andy From: slurm-users [mailto:slurm-users-boun...@lists.sch

Re: [slurm-users] final stages of cloud infrastructure set up

2019-05-19 Thread Riebs, Andy
Just looking at this quickly, have you tried specifying “hint=multithread” as an sbatch parameter? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of nathan norton Sent: Saturday, May 18, 2019 6:03 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] final stage

Re: [slurm-users] job startup timeouts?

2019-05-02 Thread Riebs, Andy
This proved to be a scaling problem in PMIX; thanks to Artem Polyakov for tracking this down (and submitting a fix<https://bugs.schedmd.com/show_bug.cgi?id=6932>). Thanks for all the suggestions folks! Andy From: Riebs, Andy Sent: Friday, April 26, 2019 11:24 AM To: slurm

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Riebs, Andy
Thanks for the quick response Doug! Unfortunately, I can't be specific about the cluster size, other than to say it's got more than a thousand nodes. In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement. Andy -

Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-02-01 Thread Riebs, Andy
Given the extreme amount of output that will be generated for potentially a couple hundred job runs, I was hoping that someone would say “Seen it, here’s how to fix it.” Guess I’ll have to go with the “high output” route. Thanks Doug! Andy From: slurm-users [mailto:slurm-users-boun...@lists.sc

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Riebs, Andy
The /etc/munge/ munge.key is different on the systems. Try md5sum /etc/munge/munge.key on both systems to see if they are the same... -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise +1 404 648 9024 From: slurm-users on behalf of Eric F. Alemany Sent