ype=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
Any ideas?
Cheers,
El mié., 21 oct. 2020 a las 15:17, Riebs, Andy
(mailto:andy.ri...@hpe.com>>) escribió:
Also, of course, any of the information that you can provide about how the
system is configured:
Also, of course, any of the information that you can provide about how the
system is configured: scheduler choices, QOS options, and the like, would also
help in answering your question.
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Riebs, Andy
Sent: Wednesday
Stephan (et al.),
There are probably 6 versions of Slurm in common use today, across multiple
versions each of Debian/Ubuntu, SuSE/SLES, and RedHat/CentOS/Fedora. You are
more likely to get a good answer if you offer some hints about what you are
running!
Regards,
Andy
From: slurm-users [mail
nibo.it]
Sent: Tuesday, October 6, 2020 3:13 AM
To: Riebs, Andy ; Slurm User Community List
Subject: Re: [slurm-users] Segfault with 32 processes, OK with 30 ???
Il 05/10/20 14:18, Riebs, Andy ha scritto:
Tks for considering my query.
> You need to provide some hints! What we know so far:
> 1.
You need to provide some hints! What we know so far:
1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x
backtrace.
2. Your decision to address this to the Slurm mailing list suggests that you
think that Slurm might be involved.
3. You have something (a job? a program?) t
Relu,
There are a number of ways to run an open source project. In the case of Slurm,
the code is managed by SchedMD. As a rule, one presumes that they have plenty
on their plate, and little time to respond to the mailing list. Hence the
suggestion that one get a support contract to get their a
Check for Ethernet problems. This happens often enough that I have the
following definition in my .bashrc file to help track these down:
alias flaky_eth='su -c "ssh slurmctld-node grep responding
/var/log/slurm/slurmctld.log"'
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co
Frankly, it's hard to tell what you might be doing wrong if you don't tell us
what you're doing!
That notwithstanding, the "--uid" message suggests that something in your
process is trying to submit a job with the "--uid" option, but you don't have
sufficient privs to use it.
Andy
From: slurm
Ummm... unless I'm missing something obvious, though the choice of the term
"defunct" might not be my choice (I would have expected "deprecated"), it seems
quite clear that the new "SlurmctldHost" parameter has subsumed the 4 that
you've listed. I wasn't privy to the decision to the discussion a
In fairness to our friends at SchedMD, this was filed as an enhancement
request, not a bug.
Since this is an open source project, there are 2 good ways to make it happen:
1. Fund someone, like SchedMD, to make the change.
2. Make the changes yourself, and submit the changes.
Alter
David,
I've been using Slurm for nearly 20 years, and while I can imagine some clever
work-arounds, like staging your job in /var/tmp on all of the nodes before
trying to run it, it's hard to imagine a cluster serving a useful purpose
without a shared user file system, whether or not Slurm is i
processes. slurmd started
without any issues.
Regards
Navin.
On Thu, Jun 11, 2020 at 9:23 PM Riebs, Andy
mailto:andy.ri...@hpe.com>> wrote:
Short of getting on the system and kicking the tires myself, I’m fresh out of
ideas. Does “sinfo -R” offer any hints?
From: slurm-users
[mailto
:38 PM Riebs, Andy
mailto:andy.ri...@hpe.com>> wrote:
So there seems to be a failure to communicate between slurmctld and the oled3
slurmd.
From oled3, try “scontrol ping” to confirm that it can see the slurmctld daemon.
From the head node, try “scontrol show node oled3”, and then pi
OLED* up infinite 1 drain* oled3
while checking the node i feel node is healthy.
Regards
Navin
On Thu, Jun 11, 2020 at 7:21 PM Riebs, Andy
mailto:andy.ri...@hpe.com>> wrote:
Weird. “slurmd -Dvvv” ought to report a whole lot of data; I can’t guess how to
interpret
ping but IP is pingable.
could be one of the reason?
but other nodes having the same config and there i am able to start the slurmd.
so bit of confusion.
Regards
Navin.
Regards
Navin.
On Thu, Jun 11, 2020 at 6:44 PM Riebs, Andy
mailto:andy.ri...@hpe.com>> wrote:
If you omitt
6:06 PM Riebs, Andy
mailto:andy.ri...@hpe.com>> wrote:
Navin,
As you can see, systemd provides very little service-specific information. For
slurm, you really need to go to the slurm logs to find out what happened.
Hint: A quick way to identify problems like this with slurmd and slurmctld
Navin,
As you can see, systemd provides very little service-specific information. For
slurm, you really need to go to the slurm logs to find out what happened.
Hint: A quick way to identify problems like this with slurmd and slurmctld is
to run them with the “-Dvvv” option, causing them to log
Diego,
I'm *guessing* that you are tripping over the use of "--tasks 32" on a
heterogeneous cluster, though your comment about the node without InfiniBand
troubles me. If you drain that node, or exclude it in your command line, that
might correct the problem. I wonder if OMPI and PMIx have deci
Geoffrey,
A lot depends on what you mean by “failure on the current machine”. If it’s a
failure that Slurm recognizes as a failure, Slurm can be configured to remove
the node from the partition, and you can follow Rodrigo’s suggestions for the
requeue options.
If the user job simply decides it
And if you're willing to buy a support contract with SchedMD, and/or provide a
fix, it will be fixed. Otherwise, you'll have to accept that you've got a large
group of users, just like you, who are willing to share their expertise and
experience, even if it's not our "day job" -- or even our "ni
Alternatively, you could switch to MariaDB; I've been using that for years.
Andy
-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Marcus Wagner
Sent: Thursday, May 7, 2020 8:55 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users]
A couple of quick checks to see if the problem is munge:
1. On the problem node, try
$ echo foo | munge | unmunge
2. If (1) works, try this from the node running slurmctld to the problem
node
slurm-node$ echo foo | ssh node munge | unmunge
From: slurm-users [mailto:slurm-users-boun.
Two trivial things to check:
1. Permissions on /etc/munge and /etc/munge.key
2. Is munged running on the problem node?
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Dean Schulze
Sent: Wednesday, April 15, 2020 1:57 PM
To: Slurm User Community Li
When you say “distinct compute nodes,” are they at least on the same network
fabric?
If so, the first thing I’d try would be to create a new partition that
encompasses all of the nodes of the other two partitions.
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf
Agreed -- I do this frequently. (Be sure you've exported those variables,
though!)
Andy
-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Paul Edmon
Sent: Sunday, December 15, 2019 2:05 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slu
At the risk of stating the obvious… these seem like the sort of questions that
could be answered with a 2 minute test. Better yet, not just answered, but with
answers specific to your configuration ☺
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Alex Chekholko
Se
oing a daily patrol for them to clean them up.
> Most of them time you can just reopen the node but sometimes this indicates
> something is wedged.
>
> -Paul Edmon-
>
> On 10/22/2019 5:22 PM, Riebs, Andy wrote:
> > A common reason for seeing this is if a process is
A common reason for seeing this is if a process is dropping core -- the kernel
will ignore job kill requests until that is complete, so the job isn't being
killed as quickly as Slurm would like. I typically recommend increasing the
UnkillableTaskWait from 60 seconds to 120 or 180 seconds to avo
Has anyone tried to use the Open SHMEM 1.4 reference implementation (see
https://github.com/openshmem-org/osss-ucx) with Slurm? It appears to me that
the Slurm PMI-x implementation needs a few more calls ("publish" and "lookup"),
but I'd be delighted to be proven wrong!
Andy
--
Andy Riebs
and
Brian, FWIW, we just restart slurmctld when this happens. I’ll be interested to
hear if there’s a proper fix.
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Brian Andrus
Sent: Thursday, July 18, 2019 11:01 AM
To: Slurm User Community List
Subject: [slurm-use
A quick & easy way to see what your options might be for Slurm environment
variables is to try a job like this:
$ srun --nodes 2 --ntasks-per-node 6 --pty env | grep SLURM
Or, perhaps, use the “env | grep SLURM” in your batch script.
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.sch
Just looking at this quickly, have you tried specifying “hint=multithread” as
an sbatch parameter?
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
nathan norton
Sent: Saturday, May 18, 2019 6:03 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] final stage
This proved to be a scaling problem in PMIX; thanks to Artem Polyakov for
tracking this down (and submitting a
fix<https://bugs.schedmd.com/show_bug.cgi?id=6932>).
Thanks for all the suggestions folks!
Andy
From: Riebs, Andy
Sent: Friday, April 26, 2019 11:24 AM
To: slurm
Thanks for the quick response Doug!
Unfortunately, I can't be specific about the cluster size, other than to say
it's got more than a thousand nodes.
In a separate test that I had missed, even "srun hostname" took 5 minutes to
run. So there was no remote file system or MPI involvement.
Andy
-
Given the extreme amount of output that will be generated for potentially a
couple hundred job runs, I was hoping that someone would say “Seen it, here’s
how to fix it.” Guess I’ll have to go with the “high output” route.
Thanks Doug!
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.sc
The /etc/munge/ munge.key is different on the systems. Try
md5sum /etc/munge/munge.key on both systems to see if they are the same...
--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
+1 404 648 9024
From: slurm-users on behalf of Eric F.
Alemany
Sent
36 matches
Mail list logo