[slurm-users] Configuring slurm.conf and using subpartitions

2023-10-03 Thread Kratz, Zach
I am a systems administrator for a computing cluster.

We have around 24 nodes available, recently adding a whole new updated cluster 
with upgraded nodes.

We use an interactive node that will randomly select from our list of computing 
nodes to complete the job. We would like to find a way to select from our list 
of old nodes first, before using the newer ones. We tried using weight and 
assigned each of the old nodes a lower weight than the new nodes, but in 
testing the new nodes were still assigned, even if the old nodes were available.

Is there any way to configure this in the line that configures the interactive 
node in slurm.conf, for example:

PartitionName=interactive-cpu   Nodes=node[1-17] weight =10 node[18-24] 
weight=50

Or is there a way to create subpartitions where we could put the older nodes 
into a partition within this one?

Thank you for any feedback.




Re: [slurm-users] enabling job script archival

2023-10-03 Thread Davide DelVento
For others potentially seeing this on mailing list search, yes, I needed
that, which of course required creating an account charge which I wasn't
using. So I ran

sacctmgr add account default_account
sacctmgr add -i user $user Accounts=default_account

with an appropriate looping around for $user and everything is working fine
now.

Thanks everybody!

On Tue, Oct 3, 2023 at 7:44 AM Paul Edmon  wrote:

> You will probably need to.
>
> The way we handle it is that we add users when the first submit a job via
> the job_submit.lua script. This way the database autopopulates with active
> users.
>
> -Paul Edmon-
> On 10/3/23 9:01 AM, Davide DelVento wrote:
>
> By increasing the slurmdbd verbosity level, I got additional information,
> namely the following:
>
> slurmdbd: error: couldn't get information for this user (null)(xx)
> slurmdbd: debug: accounting_storage/as_mysql:
> as_mysql_jobacct_process_get_jobs: User  xx  has no associations, and
> is not admin, so not returning any jobs.
>
> again where x is the posix ID of the user who's running the query in
> the slurmdbd logs.
>
> I suspect this is due to the fact that our userbase is small enough (we
> are a department HPC) that we don't need to use allocation and the like, so
> I have not configured any association (and not even studied its
> configuration, since when I was at another place which did use
> associations, someone else took care of slurm administration).
>
> Anyway, I read the fantastic document by our own member at
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations
> and in fact I have not even configured slurm users:
>
> # sacctmgr show user
>   User   Def Acct Admin
> -- -- -
>   root   root Administ+
> #
>
> So is that the issue? Should I just add all users? Any suggestions on the
> minimal (but robust) way to do that?
>
> Thanks!
>
>
> On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento 
> wrote:
>
>> Thanks Paul, this helps.
>>
>> I don't have any PrivateData line in either config file. According to the
>> docs, "By default, all information is visible to all users" so this should
>> not be an issue. I tried to add a line with "PrivateData=jobs" to the conf
>> files, just in case, but that didn't change the behavior.
>>
>> On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon  wrote:
>>
>>> At least in our setup, users can see their own scripts by doing sacct -B
>>> -j JOBID
>>>
>>> I would make sure that the scripts are being stored and how you have
>>> PrivateData set.
>>>
>>> -Paul Edmon-
>>> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>>>
>>> I deployed the job_script archival and it is working, however it can be
>>> queried only by root.
>>>
>>> A regular user can run sacct -lj towards any jobs (even those by other
>>> users, and that's okay in our setup) with no problem. However if they run
>>> sacct -j job_id --batch-script even against a job they own themselves,
>>> nothing is returned and I get a
>>>
>>> slurmdbd: error: couldn't get information for this user (null)(xx)
>>>
>>> where x is the posix ID of the user who's running the query in the
>>> slurmdbd logs.
>>>
>>> Both configure files slurmdbd.conf and slurm.conf do not have any
>>> "permission" setting. FWIW, we use LDAP.
>>>
>>> Is that the expected behavior, in that by default only root can see the
>>> job scripts? I was assuming the users themselves should be able to debug
>>> their own jobs... Any hint on what could be changed to achieve this?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento <
>>> davide.quan...@gmail.com> wrote:
>>>
 Fantastic, this is really helpful, thanks!

 On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon 
 wrote:

> Yes it was later than that. If you are 23.02 you are good.  We've been
> running with storing job_scripts on for years at this point and that part
> of the database only uses up 8.4G.  Our entire database takes up 29G on
> disk. So its about 1/3 of the database.  We also have database compression
> which helps with the on disk size. Raw uncompressed our database is about
> 90G.  We keep 6 months of data in our active database.
>
> -Paul Edmon-
> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>
> Sorry for the duplicate e-mail in a short time: do you know (or
> anyone) when the hashing was added? Was planning to enable this on 21.08,
> but we then had to delay our upgrade to it. I’m assuming later than that,
> as I believe that’s when the feature was added.
>
> On Sep 28, 2023, at 13:55, Ryan Novosielski 
>  wrote:
>
> Thank you; we’ll put in a feature request for improvements in that
> area, and also thanks for the warning? I thought of that in passing, but
> the real world experience is really useful. I could easily see wanting 
> that
> stuff to be retained less often than the main records, which is what I’d
> ask for.
>
>>

Re: [slurm-users] enabling job script archival

2023-10-03 Thread Paul Edmon

You will probably need to.

The way we handle it is that we add users when the first submit a job 
via the job_submit.lua script. This way the database autopopulates with 
active users.


-Paul Edmon-

On 10/3/23 9:01 AM, Davide DelVento wrote:
By increasing the slurmdbd verbosity level, I got additional 
information, namely the following:


slurmdbd: error: couldn't get information for this user (null)(xx)
slurmdbd: debug: accounting_storage/as_mysql: 
as_mysql_jobacct_process_get_jobs: User xx  has no associations, 
and is not admin, so not returning any jobs.


again where x is the posix ID of the user who's running the query 
in the slurmdbd logs.


I suspect this is due to the fact that our userbase is small enough 
(we are a department HPC) that we don't need to use allocation and the 
like, so I have not configured any association (and not even studied 
its configuration, since when I was at another place which did use 
associations, someone else took care of slurm administration).


Anyway, I read the fantastic document by our own member at 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations 
and in fact I have not even configured slurm users:


# sacctmgr show user
      User   Def Acct     Admin
-- -- -
      root       root Administ+
#

So is that the issue? Should I just add all users? Any suggestions on 
the minimal (but robust) way to do that?


Thanks!


On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento 
 wrote:


Thanks Paul, this helps.

I don't have any PrivateData line in either config file. According
to the docs, "By default, all information is visible to all users"
so this should not be an issue. I tried to add a line with
"PrivateData=jobs" to the conf files, just in case, but that
didn't change the behavior.

On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon 
wrote:

At least in our setup, users can see their own scripts by
doing sacct -B -j JOBID

I would make sure that the scripts are being stored and how
you have PrivateData set.

-Paul Edmon-

On 10/2/2023 10:57 AM, Davide DelVento wrote:

I deployed the job_script archival and it is working, however
it can be queried only by root.

A regular user can run sacct -lj towards any jobs (even those
by other users, and that's okay in our setup) with no
problem. However if they run sacct -j job_id --batch-script
even against a job they own themselves, nothing is returned
and I get a

slurmdbd: error: couldn't get information for this user
(null)(xx)

where x is the posix ID of the user who's running the
query in the slurmdbd logs.

Both configure files slurmdbd.conf and slurm.conf do not have
any "permission" setting. FWIW, we use LDAP.

Is that the expected behavior, in that by default only root
can see the job scripts? I was assuming the users themselves
should be able to debug their own jobs... Any hint on what
could be changed to achieve this?

Thanks!



On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento
 wrote:

Fantastic, this is really helpful, thanks!

On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
 wrote:

Yes it was later than that. If you are 23.02 you are
good.  We've been running with storing job_scripts on
for years at this point and that part of the database
only uses up 8.4G.  Our entire database takes up 29G
on disk. So its about 1/3 of the database.  We also
have database compression which helps with the on
disk size. Raw uncompressed our database is about
90G.  We keep 6 months of data in our active database.

-Paul Edmon-

On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

Sorry for the duplicate e-mail in a short time: do
you know (or anyone) when the hashing was added? Was
planning to enable this on 21.08, but we then had to
delay our upgrade to it. I’m assuming later than
that, as I believe that’s when the feature was added.


On Sep 28, 2023, at 13:55, Ryan Novosielski

 wrote:

Thank you; we’ll put in a feature request for
improvements in that area, and also thanks for the
warning? I thought of that in passing, but the real
world experience is really useful. I could easily
see wanting that stuff to be retained less often
than the main records, which is what I’d ask for.

I assume that archiving, in general, would also
remove this stuff, since old jobs themsel

Re: [slurm-users] Guidance on which HPC to try our "OpenHPC or TrintyX " for novice

2023-10-03 Thread Renfro, Michael
I’d probably default to OpenHPC just for the community around it, but I’ll also 
note that TrinityX might not have had any commits in their GitHub for an 
18-month period (unless I’m reading something wrong).

On Oct 3, 2023, at 5:51 AM, John Joseph  wrote:



External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Dear All,
Good afternoon
I would like to install and study  and administer HPC, as first step planning 
to install one of the HPC. When I check the docs I can see OpenHPC and TrintyX 
both of them have slurm in built

Like to get advice, which one would be better for me (have knowledge in Linux 
command line and administration) . Which will be easier for me to install 
OpenHPC or TrinityX

Your guidance would help me to choose my path and much appreciated
thanks
Joseph John



Re: [slurm-users] enabling job script archival

2023-10-03 Thread Davide DelVento
By increasing the slurmdbd verbosity level, I got additional information,
namely the following:

slurmdbd: error: couldn't get information for this user (null)(xx)
slurmdbd: debug: accounting_storage/as_mysql:
as_mysql_jobacct_process_get_jobs: User  xx  has no associations, and
is not admin, so not returning any jobs.

again where x is the posix ID of the user who's running the query in
the slurmdbd logs.

I suspect this is due to the fact that our userbase is small enough (we are
a department HPC) that we don't need to use allocation and the like, so I
have not configured any association (and not even studied its
configuration, since when I was at another place which did use
associations, someone else took care of slurm administration).

Anyway, I read the fantastic document by our own member at
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations
and in fact I have not even configured slurm users:

# sacctmgr show user
  User   Def Acct Admin
-- -- -
  root   root Administ+
#

So is that the issue? Should I just add all users? Any suggestions on the
minimal (but robust) way to do that?

Thanks!


On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento 
wrote:

> Thanks Paul, this helps.
>
> I don't have any PrivateData line in either config file. According to the
> docs, "By default, all information is visible to all users" so this should
> not be an issue. I tried to add a line with "PrivateData=jobs" to the conf
> files, just in case, but that didn't change the behavior.
>
> On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon  wrote:
>
>> At least in our setup, users can see their own scripts by doing sacct -B
>> -j JOBID
>>
>> I would make sure that the scripts are being stored and how you have
>> PrivateData set.
>>
>> -Paul Edmon-
>> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>>
>> I deployed the job_script archival and it is working, however it can be
>> queried only by root.
>>
>> A regular user can run sacct -lj towards any jobs (even those by other
>> users, and that's okay in our setup) with no problem. However if they run
>> sacct -j job_id --batch-script even against a job they own themselves,
>> nothing is returned and I get a
>>
>> slurmdbd: error: couldn't get information for this user (null)(xx)
>>
>> where x is the posix ID of the user who's running the query in the
>> slurmdbd logs.
>>
>> Both configure files slurmdbd.conf and slurm.conf do not have any
>> "permission" setting. FWIW, we use LDAP.
>>
>> Is that the expected behavior, in that by default only root can see the
>> job scripts? I was assuming the users themselves should be able to debug
>> their own jobs... Any hint on what could be changed to achieve this?
>>
>> Thanks!
>>
>>
>>
>> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento 
>> wrote:
>>
>>> Fantastic, this is really helpful, thanks!
>>>
>>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon 
>>> wrote:
>>>
 Yes it was later than that. If you are 23.02 you are good.  We've been
 running with storing job_scripts on for years at this point and that part
 of the database only uses up 8.4G.  Our entire database takes up 29G on
 disk. So its about 1/3 of the database.  We also have database compression
 which helps with the on disk size. Raw uncompressed our database is about
 90G.  We keep 6 months of data in our active database.

 -Paul Edmon-
 On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

 Sorry for the duplicate e-mail in a short time: do you know (or anyone)
 when the hashing was added? Was planning to enable this on 21.08, but we
 then had to delay our upgrade to it. I’m assuming later than that, as I
 believe that’s when the feature was added.

 On Sep 28, 2023, at 13:55, Ryan Novosielski 
  wrote:

 Thank you; we’ll put in a feature request for improvements in that
 area, and also thanks for the warning? I thought of that in passing, but
 the real world experience is really useful. I could easily see wanting that
 stuff to be retained less often than the main records, which is what I’d
 ask for.

 I assume that archiving, in general, would also remove this stuff,
 since old jobs themselves will be removed?

 --
 #BlackLivesMatter
 
 || \\UTGERS,
 |---*O*---
 ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
 RBHS Campus
 ||  \\of NJ  | Office of Advanced Research Computing - MSB
 A555B, Newark
  `'

 On Sep 28, 2023, at 13:48, Paul Edmon 
  wrote:

 Slurm should take care of it when you add it.

 So far as horror stories, under previous versions our database size
 ballooned to be so massive that it actually prevented us from upgrading and
 we had to drop the columns containing the job_

[slurm-users] Guidance on which HPC to try our "OpenHPC or TrintyX " for novice

2023-10-03 Thread John Joseph
Dear All, 
Good afternoon 
I would like to install and study  and administer HPC, as first step planning 
to install one of the HPC. When I check the docs I can see OpenHPC and TrintyX 
both of them have slurm in built 
Like to get advice, which one would be better for me (have knowledge in Linux 
command line and administration) . Which will be easier for me to install 
OpenHPC or TrinityX 

Your guidance would help me to choose my path and much appreciated 
thanksJoseph John 



Re: [slurm-users] job not running if partition MaxCPUsPerNode < actual max

2023-10-03 Thread Diego Zuccato

I've been recently hit by
EnforcePartLimits=ALL
that refused jobs that couldn't run in all given partitions. Solved by 
setting it to

EnforcePartLimits=ANY
so that the job gets queued if it can run in ANY given partition (very 
useful if you also use JobSubmitPlugin=all_partitions ).


Diego


Il 15/08/2023 18:43, Bernstein, Noam CIV USN NRL (6393) Washington DC 
(USA) ha scritto:
We have a heterogeneous mix of nodes, most 32 core, but one group of 36 
core, grouped into homogeneous partitions.  We like to be able to 
specify multiple partitions so that a job can run on any homogeneous 
group.  It would be nice if we could run on all such nodes using 32 
cores per node.  To try to do this, I created a partition for the 
36-core nodes (call them n2019) which specifies a max cpu # of 64


PartitionName=n2019            DefMemPerCPU=2631 Nodes=compute-4-[0-47]
PartitionName=n2019_32         DefMemPerCPU=2631
Nodes=compute-4-[0-47] MaxCPUsPerNode=64
PartitionName=n2021            DefMemPerCPU=2960 Nodes=compute-7-[0-18]


However, if I try to run a 128 task, 1 task per core job on n2019_32, 
the sbatch fails with


 > sbatch  --ntasks=128 --exclusive
--partition=n2019_32  --ntasks-per-core=1 job.pbs

sbatch: error: Batch job submission failed: Requested node
configuration is not available

(please ignore the ".pbs" - it's a relic, and the job script works with 
slurm). The identical command but with "n2019" or "n2021" for the 
partition works (but the former uses 36 cores per node). If I specify 
multiple partitions it will only actually run when the non-n2019 (same 
node set as n2019_32) nodes are available.


The job header includes only walltime, job name and stdout/stderr files, 
shell, and a job array range.


I tried to add "-v" to the sbatch to see if that gives more useful info, 
but I couldn't get any more insight.  Does anyone have any idea why it's 
rejecting my job?


thanks,
Noam


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786