from:"byron"

[slurm-users] Recommended amount of memory for the database server

2022-09-25 Thread byron

Hi

Does anyone know what is the recommended amount of memory to give slurms
mariadb database server?

I seem to remember reading a simple estimate based on the size of certain
tables (or something along those lines) but I can't find it now.

Thanks

[slurm-users] what's the simplest way to set the site_factor for all jobs?

2022-09-13 Thread byron

I want to allow users to lower the priority of their jobs to allow other
peoples jobs to go first and am thinking the easiest way would be for them
to use the sbatch nice option.  However all of ours jobs currently run with
a priorty of 1 as all of the priority weights are set to zero meaning
setting a nice value has no affect.

Having looked at how the job priority is calculated I thought the best fix
would be to add a site_factor to every job to increase the default priorty
and allow nice to lower it when required.

Can anyone tell me the simplest way to apply a site_factor to all jobs?

Thanks

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-08-03 Thread byron

Thanks for everyones help.  All I needed to do was compile a new version of
pam_slurm.so.  I'm aware there's a newer slurm_pam_adopt but everything was
already setup for pam_slurm.so so I just went with that.

Regards
Lloyd

On Wed, Jul 27, 2022 at 9:45 PM Bernd Melchers 
wrote:

> >This happens on all our compute nodes.
> >I can't find any mention of slurm_pam_adopt in /etc/pamd.d.  All I
> have
> >is in sshd, account required pam_slurm.so.
>
> We had a similar problem, caused by wrong access bits for
> ssh host key files in /etc/ssh/
> now we have
> -rw-r--r-- root root for public host keys and
> -rw-r- root ssh_keys for private part of host keys
>
> Mit freundlichen Grüßen
> Bernd Melchers
>
> --
> Archiv- und Backup-Service | fab-serv...@zedat.fu-berlin.de
> Freie Universität Berlin   | Tel. +49-30-838-55905
>
>

Re: [slurm-users] slurmctld hanging

2022-07-29 Thread byron

Yep, the question of how he has the job set up is an ongoing conversation,
but for now it is staying like this and I have to make do.

Even with all the traffic he is generating though (at worst 1 a second over
the course of a day) I would still have though that slurm was capable of
managing that.  And it was, until I did the upgrade.


On Fri, Jul 29, 2022 at 7:00 AM Loris Bennett 
wrote:

> Hi Byron,
>
> byron  writes:
>
> > Hi Loris - about a second
>
> What is the use-case for that?  Are these individual jobs or it a job
> array.  Either way it sounds to me like a very bad idea.  On our system,
> jobs which can start immediately because resources are available, still
> take a few seconds to start running (I'm looking at the values for
> 'submit' and 'start' from 'sacct').  If a one-second job has to wait for
> just a minute, the ration of wait-time to run-time is already
> disproportionately large.
>
> Why doesn't the user bundle these individual jobs together?  Depending
> on your maximum run-time and to what degree jobs can make use of
> backfill, I would tell the user something between a single job and
> maybe 100 job.  I certainly wouldn't allow one-second jobs in any
> significant numbers on our system.
>
> I think having a job starting every second is causing your slurmdbd to
> timeout and that is the error you are seeing.
>
> Regards
>
> Loris
>
> > On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett <
> loris.benn...@fu-berlin.de> wrote:
> >
> >  Hi Byron,
> >
> >  byron  writes:
> >
> >  > Hi
> >  >
> >  > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> >  >
> >  > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >  >
> >  > It only seems to happen when one of our users runs a job that submits
> a short lived job every second for 5 days (up to 90,000 in a day).
> Although that could be a red-herring.
> >
> >  What's your definition of a 'short lived job'?
> >
> >  > There is nothing to be found in the slurmctld log.
> >  >
> >  > Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
> >  >
> >  > Thanks
> >
> >  Cheers,
> >
> >  Loris
> >
> >  --
> >  Dr. Loris Bennett (Herr/Mr)
> >  ZEDAT, Freie Universität Berlin Email
> loris.benn...@fu-berlin.de
>
>

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread byron

Hi Loris - about a second

On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett 
wrote:

> Hi Byron,
>
> byron  writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running sinfo
> >
> > “slurm_load_jobs error: Socket timed out on send/recv operation”
> >
> > It only seems to happen when one of our users runs a job that submits a
> short lived job every second for 5 days (up to 90,000 in a day).  Although
> that could be a red-herring.
>
> What's your definition of a 'short lived job'?
>
> > There is nothing to be found in the slurmctld log.
> >
> > Can anyone suggest how to even start troubleshooting this?  Without
> anything in the logs I dont know where to start.
> >
> > Thanks
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>

[slurm-users] slurmctld hanging

2022-07-28 Thread byron

Hi

We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
(3 times in 2 months) have slurmctld hanging so we get the following
message when running sinfo

“slurm_load_jobs error: Socket timed out on send/recv operation”

It only seems to happen when one of our users runs a job that submits a
short lived job every second for 5 days (up to 90,000 in a day).  Although
that could be a red-herring.

There is nothing to be found in the slurmctld log.

Can anyone suggest how to even start troubleshooting this?  Without
anything in the logs I dont know where to start.

Thanks

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread byron

This happens on all our compute nodes.

I can't find any mention of slurm_pam_adopt in /etc/pamd.d.  All I have is
in sshd, account required pam_slurm.so.

On Wed, Jul 27, 2022 at 5:52 PM Brian Andrus  wrote:

> Lloyd,
>
> You could  check out the order of entries in your pam.d/ssh (and
> related/included) files
>
> See where the slurm_pam_adopt is, how it is being called and if there are
> settings that are interferring.
>
> Does this occur only on a single node, or all of them?
>
> Brian Andrus
> On 7/27/2022 9:29 AM, Lloyd Goodman wrote:
>
> I don't think that's the source of the problem.  All our user accounts are
> centrally managed using sssd.
>
> And just to be sure I run "getent passwd " on the management,
> head and compute nodes and they all returned the same values
>
> On Wed, 27 Jul 2022 at 17:22, Brian Andrus  wrote:
>
>> Verify that their uid on the node is the same as the uid your master sees
>>
>> Brian Andrus
>>
>>
>> On 7/27/2022 8:53 AM, byron wrote:
>> > Hi
>> >
>> > When a user tries to login into a compute node on which they have a
>> > running job they get the error
>> >
>> > Access denied: user blahblah (uid=) has no active jobs on this node.
>> > Authentication failed.
>> >
>> > I recently upgraded slurm to 20.11.9 and was under the impression that
>> > prior to the upgrade they were able to ssh into nodes where they had
>> > running jobs, but its entirely possible that I'm mistaken.
>> >
>> > Either way, can some explain how to enable that behaviour please.
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>>
>>
>
> --
> *Lloyd Goodman* // HPC Systems Administrator
>
> *e: *lloyd.good...@cfms.org.uk *w: *www.cfms.org.uk
> CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
> Green // Bristol // BS16 7FR
>
> <https://cfms.org.uk/media/384108/email-footer-darren-swift.png?width=500=115=crop>
> <https://cfms.org.uk/media/384108/email-footer-darren-swift.png?width=500=115=crop>
>
>

[slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread byron

Hi

When a user tries to login into a compute node on which they have a running
job they get the error

Access denied: user blahblah (uid=) has no active jobs on this node.
Authentication failed.

I recently upgraded slurm to 20.11.9 and was under the impression that
prior to the upgrade they were able to ssh into nodes where they had
running jobs, but its entirely possible that I'm mistaken.

Either way, can some explain how to enable that behaviour please.

Thanks

Re: [slurm-users] Rolling upgrade of compute nodes

2022-05-30 Thread byron

Thanks for the feedback.

I've done the database dryrun on a clone of our database / slurmdbd and
that is all good.

We have a reboot program defined.

The one thing I'm unsure about is as much as Linux / NFS issue than a a
slurm one.  When I change the soft link for "default" to point to the new
20.11 slurm install but all the compute nodes are still run the old 19.05
version because they havent been restarted yet, will that not cause any
problems?   Or will they still just see the same old 19.05 version of slurm
that they are running until they are restarted.

thanks

On Mon, May 30, 2022 at 8:18 AM Ole Holm Nielsen 
wrote:

> Hi Byron,
>
> Adding to Stephan's note, it's strongly recommended to make a database
> dry-run upgrade test before upgrading the production slurmdbd.  Many
> details are in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> If you have separate slurmdbd and slurmctld machines (recommended), the
> next step is to upgrade the slurmctld.
>
> Finally you can upgrade the slurmd's while the cluster is running in
> production mode.  Since you have Slurm om NFS, following Chris'
> recommendation of rebooting the nodes may be the safest approach.
>
> After upgrading everything to 20.11, you should next upgrade to 21.08.
> Upgrade to the latest 22.05 should probably wait for a few minor releases.
>
> /Ole
>
> On 5/30/22 08:54, Stephan Roth wrote:
> > If you have the means to set up a test environment to try the upgrade
> > first, I recommend to do it.
> >
> > The upgrade from 19.05 to 20.11 worked for two clusters I maintain with
> a
> > similar NFS based setup, except we keep the Slurm configuration
> separated
> > from the Slurm software accessible through NFS.
> >
> > For updates staying between 2 major releases this should work well by
> > restarting the Slurm daemons in the recommended order (see
> > https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf) after switching the
> > soft link to 20.11:
> >
> > 1. slurmdbd
> > 2. slurmctld
> > 3. individual slurmd on your nodes
> >
> > To be able to revert back to 19.05 you should dump the database between
> > stopping and starting slurmdbd as well as backing up StateSaveLocation
> > between stopping/restarting slurmctld.
> >
> > slurmstepd's of running jobs will continue to run on 19.05 after
> > restarting the slurmd's.
> >
> > Check individual slurmd.log files for problems.
> >
> > Cheers,
> > Stephan
> >
> > On 30.05.22 00:09, byron wrote:
> >> Hi
> >>
> >> I'm currently doing an upgrade from 19.05 to 20.11.
> >>
> >> All of our compute nodes have the same install of slurm NFS mounted.
> The
> >> system has been setup so that all the start scripts and configuration
> >> files point to the default installation which is a soft link to the
> most
> >> recent installation of slurm.
> >>
> >>   This is the first time I've done an upgrade of slurm and I had been
> >> hoping to do a rolling upgrade as opposed to waiting for all the jobs
> to
> >> finish on all the compute nodes and then switching across but I dont
> see
> >> how I can do it with this setup.  Does any one have any expereience of
> >> this?
>
>

[slurm-users] Rolling upgrade of compute nodes

2022-05-29 Thread byron

Hi

I'm currently doing an upgrade from 19.05 to 20.11.

All of our compute nodes have the same install of slurm NFS mounted.  The
system has been setup so that all the start scripts and configuration files
point to the default installation which is a soft link to the most recent
installation of slurm.

 This is the first time I've done an upgrade of slurm and I had been hoping
to do a rolling upgrade as opposed to waiting for all the jobs to finish on
all the compute nodes and then switching across but I dont see how I can do
it with this setup.  Does any one have any expereience of this?

Thanks

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread byron

Sorry, I should have been clearer.   I understand that with regards to
slurmd / slurmctld you can skip a major release without impacting running
jobs etc.  My questions was about upgrading slurmdbd and whether it was
necessary to upgrade through the intermediate major releases (which I know
understand is necessary).

Thanks


On Tue, May 17, 2022 at 4:49 PM Paul Edmon  wrote:

> The slurm docs say you can do two major releases at a time (
> https://slurm.schedmd.com/quickstart_admin.html):
>
> "Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x)
> involves changes to the state files with new data structures, new options,
> etc. Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 20.02.x or 20.11.x to
> 21.08.x) without loss of jobs or other state information."
>
> As for old versions of slurm I think at this point you would need to
> contact SchedMD.  I'm sure they have past releases they can hand out if you
> are bootstrapping to a newer release.
>
> -Paul Edmon-
> On 5/17/22 11:42 AM, byron wrote:
>
> Thanks Brian for the speedy responce.
>
> Am I not correct in thinking that if I just go from 19.05 to 20.11 then
> there is the advantage that I can upgrade slurmd and slurmctld in one go
> and it won't affect the running jobs since upgrading to a new major release
> from the past two major releases doesn't affect the state information.  Or
> are you saying that  in this case (19.05  direct to 21.08) there isn't any
> impact to running jobs either.  Or did you step through all the versions
> when upgrading slurmd and slurmctld also?
>
> Also where do I get a copy of 20.2 from if schedMD aren't providing it as
> a download.
>
> Thanks
>
>
>
>
> On Tue, May 17, 2022 at 4:05 PM Brian Andrus  wrote:
>
>> You need to step upgrade through major versions (not minor).
>>
>> So 19.05=>20.x
>>
>> I would highly recommend going to 21.08 while you are at it.
>> I just did the same migration (although they started at 18.x) with no
>> issues. Running jobs were not impacted and users didn't even notice.
>>
>> Brian Andrus
>>
>>
>> On 5/17/2022 7:35 AM, byron wrote:
>> > Hi
>> >
>> > I'm looking at upgrading our install of slurm from 19.05 to 20.11 in
>> > responce to the recenty announced security vulnerabilities.
>> >
>> > I've been through the documentation / forums and have managed to find
>> > the answers to most of my questions but am still unclear about the
>> > following
>> >
>> >  - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go
>> > through all the versions (19.05 => 20.2 => 20.11)?  From reading the
>> > forums it look as though it is necesary
>> > https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
>> > https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
>> >However if that is the case it would seem strange that SchedMD have
>> > removed 20.2 from the downloads page (I understand the reason is that
>> > it contains the exploit) if it is still required for the upgrade.
>> >
>> > - We are running version 5.5.68 of the MariaDB, the version that comes
>> > with centos7.9.   I've seen a few references to upgrading v5.5 but
>> > they were in the context of upgrading from slurm 17 to 18.  I'm
>> > wondering if its ok to stick with this version since we're already on
>> > slurm 19.05.
>> >
>> > Any help much appreciated.
>> >
>> >
>> >
>> >
>>
>>

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread byron

Thanks Brian for the speedy responce.

Am I not correct in thinking that if I just go from 19.05 to 20.11 then
there is the advantage that I can upgrade slurmd and slurmctld in one go
and it won't affect the running jobs since upgrading to a new major release
from the past two major releases doesn't affect the state information.  Or
are you saying that  in this case (19.05  direct to 21.08) there isn't any
impact to running jobs either.  Or did you step through all the versions
when upgrading slurmd and slurmctld also?

Also where do I get a copy of 20.2 from if schedMD aren't providing it as a
download.

Thanks




On Tue, May 17, 2022 at 4:05 PM Brian Andrus  wrote:

> You need to step upgrade through major versions (not minor).
>
> So 19.05=>20.x
>
> I would highly recommend going to 21.08 while you are at it.
> I just did the same migration (although they started at 18.x) with no
> issues. Running jobs were not impacted and users didn't even notice.
>
> Brian Andrus
>
>
> On 5/17/2022 7:35 AM, byron wrote:
> > Hi
> >
> > I'm looking at upgrading our install of slurm from 19.05 to 20.11 in
> > responce to the recenty announced security vulnerabilities.
> >
> > I've been through the documentation / forums and have managed to find
> > the answers to most of my questions but am still unclear about the
> > following
> >
> >  - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go
> > through all the versions (19.05 => 20.2 => 20.11)?  From reading the
> > forums it look as though it is necesary
> > https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
> > https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
> >However if that is the case it would seem strange that SchedMD have
> > removed 20.2 from the downloads page (I understand the reason is that
> > it contains the exploit) if it is still required for the upgrade.
> >
> > - We are running version 5.5.68 of the MariaDB, the version that comes
> > with centos7.9.   I've seen a few references to upgrading v5.5 but
> > they were in the context of upgrading from slurm 17 to 18.  I'm
> > wondering if its ok to stick with this version since we're already on
> > slurm 19.05.
> >
> > Any help much appreciated.
> >
> >
> >
> >
>
>

[slurm-users] upgrading slurm to 20.11

2022-05-17 Thread byron

Hi

I'm looking at upgrading our install of slurm from 19.05 to 20.11 in
responce to the recenty announced security vulnerabilities.

I've been through the documentation / forums and have managed to find the
answers to most of my questions but am still unclear about the following

 - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go through
all the versions (19.05 => 20.2 => 20.11)?  From reading the forums it look
as though it is necesary
   https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
   https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
   However if that is the case it would seem strange that SchedMD have
removed 20.2 from the downloads page (I understand the reason is that it
contains the exploit) if it is still required for the upgrade.

- We are running version 5.5.68 of the MariaDB, the version that comes with
centos7.9.   I've seen a few references to upgrading v5.5 but they were in
the context of upgrading from slurm 17 to 18.  I'm wondering if its ok to
stick with this version since we're already on slurm 19.05.

Any help much appreciated.

[slurm-users] incorrectly added account and now get "AssocGrpCPUMinutesLimit" when trying to run job

2021-11-29 Thread byron

I’m trying to replicate the setup of a new account where there is a new
“grouping” of accounts and a new account that will actually be used, so
something like this when you run
sacctmgr show assoc tree

mycluster   account1.   (which is just being used to group accounts
and so has no GrpTRESMins assocliated
myclusteraccount2.  (which is used as an account for running
jobs and does have GrpTRESMins associated)

I’ve tried setting it up like this

I’ve add a new project with sbank using

sbank project create -c mycluster -a mynewaccount1

Moved it to where I want it

sacctmgr modify account mynewaccount1 set parent=projects

And the created the account to sit below it

sbank project create -c mycluster -a mynewaccount2

and moved that

sacctmgr modify account mynewaccount2 set parent=mynewaccount1

Add a user

sbank project useradd -c mycluster -a mynewaccount2 -u user1

add hours to the account
sbank deposit -c mycluster -a mynewaccount2 -t 5

The problem is that when the user tries to run a job from the account
mynewaccount2 it gets held in the queue with the reason
“AssocGrpCPUMinutesLimit”

and when I run
sacctmgr show assoc tree

mycluster. mynewaccount1.   cpu=0
mycluster.   Mynewaccount2.cpu=300

Now I’m guessing that the problem is the cpu=0 because the account /
subaccount setup by my predecessor and which works doesnt have this

And I’ve tried getting rid of the cpu=0 by using

sacctmgr modify account mynewaccount1 set=GrpTRESMins=cpu=-1

But the command just hangs and does nothing.

I’m guessing I’ve setup mynewaccount1 incorectly?  Any help appreciated.

Thanks

[slurm-users] Strigger, why not always use "--flags=perm" rather than running the command again each time?

2021-10-11 Thread byron

Hi

I've been looking at using strigger for some simple cases such as when a
node drains or goes down.   Most of the examples I've seen use the format
whereby it calls a script which reruns the strigger command for the next
event.

However there is also the "--flags=perm" approach, is there any
disadvantage to using this method?  It seems to be the more obvious way to
do things but I'm wondering if I'm missing anything since most examples
demonstrate the first method.

Thanks

Re: [slurm-users] is there a way to temporarily freeze an account?

2021-10-08 Thread byron

Thanks for all the feedback, am going with Juergens MaxSubmitJobs approach.

On Thu, Oct 7, 2021 at 2:55 AM Chris Samuel  wrote:

> On 6/10/21 6:21 am, byron wrote:
>
> > We have some accounts that we would like to suspend / freeze for the
> > time being that have unused hours associated with them.  Is there anyway
> > of doing this without removing the users associated with the accounts or
> > zeroing their hours?
>
> We have a QOS called "batchdisable" which has MaxJobs=0 and
> MaxSubmitJobs=0 and then we just set the user's list of QOS's to that.
>
> sacctmgr update where where name=bar qos=batchdisable
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>

[slurm-users] is there a way to temporarily freeze an account?

2021-10-06 Thread byron

We have some accounts that we would like to suspend / freeze for the time
being that have unused hours associated with them.  Is there anyway of
doing this without removing the users associated with the accounts or
zeroing their hours?

We are using slurm version 19.05.7

Thanks

Re: [slurm-users] job stuck as pending - reason "PartitionConfig"

2021-09-30 Thread byron

Bingo!

You were right, I was asking for more cores than was available (our highmem
nodes have less than out standard nodes).  I was so convinced that the
problem was related to my upgrading the OS on those nodes that it never
crossed my mind that it was something as straightforward as that.

Thanks for your help.



On Wed, Sep 29, 2021 at 7:49 PM Paul Brunk  wrote:

> Hello Byron:
>
>
>
> I’m guessing that your job is asking for more HW than the highmem_p
>
> has in it, or more cores or RAM within a node than any of the nodes
>
> have, or something like that.  'scontrol show job 10860160' might
>
> help.  You can also look in slurmctld.log for that jobid.
>
>
>
> --
>
> Paul Brunk, system administrator
>
> Georgia Advanced Computing Resource Center
>
> Enterprise IT Svcs, the University of Georgia
>
>
>
> *From:* slurm-users  *On Behalf Of
> *byron
> *Sent:* Wednesday, September 29, 2021 10:35
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] job stuck as pending - reason "PartitionConfig"
>
>
>
> [EXTERNAL SENDER - PROCEED CAUTIOUSLY]
>
> Hi
>
>
>
> When I try to submit a job to one of our partitions it just stay in the
> stay pending with the reason "PartitionConfig".  Can someone point me in
> the right direction for how to troubleshoot this?  I'm a bit stumpped.
>
>
>
> Some details of the setup
>
>
>
> The version is 19.05.7
>
>
>
> This is the job that is stuck in state pending
>
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>   10860160   highmem MooseBen byron PD   0:00 16
> (PartitionConfig)
>
>
>
> $ sinfo -p highmem
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> highmem  up   infinite  1  drain intel-0012
> highmem  up   infinite 19   idle intel-[0001-0011,0013-0020]
>
>
>
> The output from  scontrol show part
>
> PartitionName=highmem
>AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>AllocNodes=ALL Default=NO QoS=N/A
>DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>Nodes=intel-00[01-20]
>PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=EXCLUSIVE
>OverTimeLimit=NONE PreemptMode=REQUEUE
>State=UP TotalCPUs=320 TotalNodes=20 SelectTypeParameters=NONE
>JobDefaults=(null)
>DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>

[slurm-users] job stuck as pending - reason "PartitionConfig"

2021-09-29 Thread byron

Hi

When I try to submit a job to one of our partitions it just stay in the
stay pending with the reason "PartitionConfig".  Can someone point me in
the right direction for how to troubleshoot this?  I'm a bit stumpped.

Some details of the setup

The version is 19.05.7

This is the job that is stuck in state pending
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
  10860160   highmem MooseBen byron PD   0:00 16
(PartitionConfig)

$ sinfo -p highmem
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
highmem  up   infinite  1  drain intel-0012
highmem  up   infinite 19   idle intel-[0001-0011,0013-0020]

The output from  scontrol show part
PartitionName=highmem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
   Nodes=intel-00[01-20]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=REQUEUE
   State=UP TotalCPUs=320 TotalNodes=20 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Re: [slurm-users] using sacctmgr to change the parent of an account

2021-09-09 Thread byron

Great, thanks for that.

On Wed, Sep 8, 2021 at 5:32 PM Brian Andrus  wrote:

> Yep. I do it all the time when I forget to add a parent. Also when a
> project/account changes who owns it.
>
> sacctmgr will also tell you what it is going to change and gives you 30
> seconds to say yes, else it doesn't make the change.
>
> Brian Andrus
>
> On 9/8/2021 3:41 AM, byron wrote:
> > Hi
> >
> > I've added a new account using sbank and have now discovered it should
> > have been added with the parent set. We've already accumulated a
> > couple of months of user data so I dont just want to delete it and
> > recreate it in the correct location.  I've had a read of the sacctmgr
> > command and think I may be able to modify the account to set a parent
> > but am a little unsure as I haven't seen any example of people doing
> > it online
> >
> > Is there any reason why the following wouldn't work?  I'm using
> > version 19.05.7 of slurm.
> >
> > sacctmgr modify account blah set parent=blahparent
> >
> > Also would it be ok to make this change while the system is running?
> >
> > Thanks
>
>

[slurm-users] using sacctmgr to change the parent of an account

2021-09-08 Thread byron

Hi

I've added a new account using sbank and have now discovered it should have
been added with the parent set.  We've already accumulated a couple of
months of user data so I dont just want to delete it and recreate it in the
correct location.  I've had a read of the sacctmgr command and think I may
be able to modify the account to set a parent but am a little unsure as I
haven't seen any example of people doing it online

Is there any reason why the following wouldn't work?  I'm using version
19.05.7 of slurm.

sacctmgr modify account blah set parent=blahparent

Also would it be ok to make this change while the system is running?

Thanks

[slurm-users] Recommended amount of memory for the database server

[slurm-users] what's the simplest way to set the site_factor for all jobs?

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

Re: [slurm-users] slurmctld hanging

Re: [slurm-users] slurmctld hanging

[slurm-users] slurmctld hanging

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

[slurm-users] unable to ssh onto compute nodes on which I have running jobs

Re: [slurm-users] Rolling upgrade of compute nodes

[slurm-users] Rolling upgrade of compute nodes

Re: [slurm-users] upgrading slurm to 20.11

Re: [slurm-users] upgrading slurm to 20.11

[slurm-users] upgrading slurm to 20.11

[slurm-users] incorrectly added account and now get "AssocGrpCPUMinutesLimit" when trying to run job

[slurm-users] Strigger, why not always use "--flags=perm" rather than running the command again each time?

Re: [slurm-users] is there a way to temporarily freeze an account?

[slurm-users] is there a way to temporarily freeze an account?

Re: [slurm-users] job stuck as pending - reason "PartitionConfig"

[slurm-users] job stuck as pending - reason "PartitionConfig"

Re: [slurm-users] using sacctmgr to change the parent of an account

[slurm-users] using sacctmgr to change the parent of an account

21 matches

Site Navigation

Mail list logo

Footer information