Re: [slurm-users] enabling job script archival

2023-10-02 Thread Davide DelVento
Thanks Paul, this helps.

I don't have any PrivateData line in either config file. According to the
docs, "By default, all information is visible to all users" so this should
not be an issue. I tried to add a line with "PrivateData=jobs" to the conf
files, just in case, but that didn't change the behavior.

On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon  wrote:

> At least in our setup, users can see their own scripts by doing sacct -B
> -j JOBID
>
> I would make sure that the scripts are being stored and how you have
> PrivateData set.
>
> -Paul Edmon-
> On 10/2/2023 10:57 AM, Davide DelVento wrote:
>
> I deployed the job_script archival and it is working, however it can be
> queried only by root.
>
> A regular user can run sacct -lj towards any jobs (even those by other
> users, and that's okay in our setup) with no problem. However if they run
> sacct -j job_id --batch-script even against a job they own themselves,
> nothing is returned and I get a
>
> slurmdbd: error: couldn't get information for this user (null)(xx)
>
> where x is the posix ID of the user who's running the query in the
> slurmdbd logs.
>
> Both configure files slurmdbd.conf and slurm.conf do not have any
> "permission" setting. FWIW, we use LDAP.
>
> Is that the expected behavior, in that by default only root can see the
> job scripts? I was assuming the users themselves should be able to debug
> their own jobs... Any hint on what could be changed to achieve this?
>
> Thanks!
>
>
>
> On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento 
> wrote:
>
>> Fantastic, this is really helpful, thanks!
>>
>> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon 
>> wrote:
>>
>>> Yes it was later than that. If you are 23.02 you are good.  We've been
>>> running with storing job_scripts on for years at this point and that part
>>> of the database only uses up 8.4G.  Our entire database takes up 29G on
>>> disk. So its about 1/3 of the database.  We also have database compression
>>> which helps with the on disk size. Raw uncompressed our database is about
>>> 90G.  We keep 6 months of data in our active database.
>>>
>>> -Paul Edmon-
>>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>>
>>> Sorry for the duplicate e-mail in a short time: do you know (or anyone)
>>> when the hashing was added? Was planning to enable this on 21.08, but we
>>> then had to delay our upgrade to it. I’m assuming later than that, as I
>>> believe that’s when the feature was added.
>>>
>>> On Sep 28, 2023, at 13:55, Ryan Novosielski 
>>>  wrote:
>>>
>>> Thank you; we’ll put in a feature request for improvements in that area,
>>> and also thanks for the warning? I thought of that in passing, but the real
>>> world experience is really useful. I could easily see wanting that stuff to
>>> be retained less often than the main records, which is what I’d ask for.
>>>
>>> I assume that archiving, in general, would also remove this stuff, since
>>> old jobs themselves will be removed?
>>>
>>> --
>>> #BlackLivesMatter
>>> 
>>> || \\UTGERS,
>>> |---*O*---
>>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
>>> RBHS Campus
>>> ||  \\of NJ  | Office of Advanced Research Computing - MSB
>>> A555B, Newark
>>>  `'
>>>
>>> On Sep 28, 2023, at 13:48, Paul Edmon 
>>>  wrote:
>>>
>>> Slurm should take care of it when you add it.
>>>
>>> So far as horror stories, under previous versions our database size
>>> ballooned to be so massive that it actually prevented us from upgrading and
>>> we had to drop the columns containing the job_script and job_env.  This was
>>> back before slurm started hashing the scripts so that it would only store
>>> one copy of duplicate scripts.  After this point we found that the
>>> job_script database stayed at a fairly reasonable size as most users use
>>> functionally the same script each time. However the job_env continued to
>>> grow like crazy as there are variables in our environment that change
>>> fairly consistently depending on where the user is. Thus job_envs ended up
>>> being too massive to keep around and so we had to drop them. Frankly we
>>> never really used them for debugging. The job_scripts though are super
>>> useful and not that much overhead.
>>>
>>> In summary my recommendation is to only store job_scripts. job_envs add
>>> too much storage for little gain, unless your job_envs are basically the
>>> same for each user in each location.
>>>
>>> Also it should be noted that there is no way to prune out job_scripts or
>>> job_envs right now. So the only way to get rid of them if they get large is
>>> to 0 out the column in the table. You can ask SchedMD for the mysql command
>>> to do this as we had to do it here to our job_envs.
>>>
>>> -Paul Edmon-
>>>
>>> On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>>
>>> In my current slurm installation, (recently upgraded to slurm v23.02.3),
>>> I only have
>>>
>>> 

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Paul Edmon
At least in our setup, users can see their own scripts by doing sacct -B 
-j JOBID


I would make sure that the scripts are being stored and how you have 
PrivateData set.


-Paul Edmon-

On 10/2/2023 10:57 AM, Davide DelVento wrote:
I deployed the job_script archival and it is working, however it can 
be queried only by root.


A regular user can run sacct -lj towards any jobs (even those by other 
users, and that's okay in our setup) with no problem. However if they 
run sacct -j job_id --batch-script even against a job they own 
themselves, nothing is returned and I get a


slurmdbd: error: couldn't get information for this user (null)(xx)

where x is the posix ID of the user who's running the query in the 
slurmdbd logs.


Both configure files slurmdbd.conf and slurm.conf do not have any 
"permission" setting. FWIW, we use LDAP.


Is that the expected behavior, in that by default only root can see 
the job scripts? I was assuming the users themselves should be able to 
debug their own jobs... Any hint on what could be changed to achieve this?


Thanks!



On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento 
 wrote:


Fantastic, this is really helpful, thanks!

On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
 wrote:

Yes it was later than that. If you are 23.02 you are good. 
We've been running with storing job_scripts on for years at
this point and that part of the database only uses up 8.4G. 
Our entire database takes up 29G on disk. So its about 1/3 of
the database.  We also have database compression which helps
with the on disk size. Raw uncompressed our database is about
90G.  We keep 6 months of data in our active database.

-Paul Edmon-

On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

Sorry for the duplicate e-mail in a short time: do you know
(or anyone) when the hashing was added? Was planning to
enable this on 21.08, but we then had to delay our upgrade to
it. I’m assuming later than that, as I believe that’s when
the feature was added.


On Sep 28, 2023, at 13:55, Ryan Novosielski
  wrote:

Thank you; we’ll put in a feature request for improvements
in that area, and also thanks for the warning? I thought of
that in passing, but the real world experience is really
useful. I could easily see wanting that stuff to be retained
less often than the main records, which is what I’d ask for.

I assume that archiving, in general, would also remove this
stuff, since old jobs themselves will be removed?

--
#BlackLivesMatter

|| \\UTGERS,
|---*O*---
||_// the State |         Ryan Novosielski -
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922)
~*~ RBHS Campus
||  \\    of NJ | Office of Advanced Research Computing -
MSB A555B, Newark
     `'


On Sep 28, 2023, at 13:48, Paul Edmon
  wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our
database size ballooned to be so massive that it actually
prevented us from upgrading and we had to drop the columns
containing the job_script and job_env.  This was back
before slurm started hashing the scripts so that it would
only store one copy of duplicate scripts.  After this point
we found that the job_script database stayed at a fairly
reasonable size as most users use functionally the same
script each time. However the job_env continued to grow
like crazy as there are variables in our environment that
change fairly consistently depending on where the user is.
Thus job_envs ended up being too massive to keep around and
so we had to drop them. Frankly we never really used them
for debugging. The job_scripts though are super useful and
not that much overhead.

In summary my recommendation is to only store job_scripts.
job_envs add too much storage for little gain, unless your
job_envs are basically the same for each user in each location.

Also it should be noted that there is no way to prune out
job_scripts or job_envs right now. So the only way to get
rid of them if they get large is to 0 out the column in the
table. You can ask SchedMD for the mysql command to do this
as we had to do it here to our job_envs.

-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:

In my current slurm installation, (recently upgraded to
slurm v23.02.3), I only have

AccountingStoreFlags=job_comment

I now intend to add both


Re: [slurm-users] enabling job script archival

2023-10-02 Thread Davide DelVento
I deployed the job_script archival and it is working, however it can be
queried only by root.

A regular user can run sacct -lj towards any jobs (even those by other
users, and that's okay in our setup) with no problem. However if they run
sacct -j job_id --batch-script even against a job they own themselves,
nothing is returned and I get a

slurmdbd: error: couldn't get information for this user (null)(xx)

where x is the posix ID of the user who's running the query in the
slurmdbd logs.

Both configure files slurmdbd.conf and slurm.conf do not have any
"permission" setting. FWIW, we use LDAP.

Is that the expected behavior, in that by default only root can see the job
scripts? I was assuming the users themselves should be able to debug their
own jobs... Any hint on what could be changed to achieve this?

Thanks!



On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento 
wrote:

> Fantastic, this is really helpful, thanks!
>
> On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon 
> wrote:
>
>> Yes it was later than that. If you are 23.02 you are good.  We've been
>> running with storing job_scripts on for years at this point and that part
>> of the database only uses up 8.4G.  Our entire database takes up 29G on
>> disk. So its about 1/3 of the database.  We also have database compression
>> which helps with the on disk size. Raw uncompressed our database is about
>> 90G.  We keep 6 months of data in our active database.
>>
>> -Paul Edmon-
>> On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
>>
>> Sorry for the duplicate e-mail in a short time: do you know (or anyone)
>> when the hashing was added? Was planning to enable this on 21.08, but we
>> then had to delay our upgrade to it. I’m assuming later than that, as I
>> believe that’s when the feature was added.
>>
>> On Sep 28, 2023, at 13:55, Ryan Novosielski 
>>  wrote:
>>
>> Thank you; we’ll put in a feature request for improvements in that area,
>> and also thanks for the warning? I thought of that in passing, but the real
>> world experience is really useful. I could easily see wanting that stuff to
>> be retained less often than the main records, which is what I’d ask for.
>>
>> I assume that archiving, in general, would also remove this stuff, since
>> old jobs themselves will be removed?
>>
>> --
>> #BlackLivesMatter
>> 
>> || \\UTGERS,
>> |---*O*---
>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
>> RBHS Campus
>> ||  \\of NJ  | Office of Advanced Research Computing - MSB
>> A555B, Newark
>>  `'
>>
>> On Sep 28, 2023, at 13:48, Paul Edmon 
>>  wrote:
>>
>> Slurm should take care of it when you add it.
>>
>> So far as horror stories, under previous versions our database size
>> ballooned to be so massive that it actually prevented us from upgrading and
>> we had to drop the columns containing the job_script and job_env.  This was
>> back before slurm started hashing the scripts so that it would only store
>> one copy of duplicate scripts.  After this point we found that the
>> job_script database stayed at a fairly reasonable size as most users use
>> functionally the same script each time. However the job_env continued to
>> grow like crazy as there are variables in our environment that change
>> fairly consistently depending on where the user is. Thus job_envs ended up
>> being too massive to keep around and so we had to drop them. Frankly we
>> never really used them for debugging. The job_scripts though are super
>> useful and not that much overhead.
>>
>> In summary my recommendation is to only store job_scripts. job_envs add
>> too much storage for little gain, unless your job_envs are basically the
>> same for each user in each location.
>>
>> Also it should be noted that there is no way to prune out job_scripts or
>> job_envs right now. So the only way to get rid of them if they get large is
>> to 0 out the column in the table. You can ask SchedMD for the mysql command
>> to do this as we had to do it here to our job_envs.
>>
>> -Paul Edmon-
>>
>> On 9/28/2023 1:40 PM, Davide DelVento wrote:
>>
>> In my current slurm installation, (recently upgraded to slurm v23.02.3),
>> I only have
>>
>> AccountingStoreFlags=job_comment
>>
>> I now intend to add both
>>
>> AccountingStoreFlags=job_script
>> AccountingStoreFlags=job_env
>>
>> leaving the default 4MB value for max_script_size
>>
>> Do I need to do anything on the DB myself, or will slurm take care of the
>> additional tables if needed?
>>
>> Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know
>> about the additional diskspace and potentially load needed, and with our
>> resources and typical workload I should be okay with that.
>>
>> Thanks!
>>
>>
>>
>>
>>


[slurm-users] directory permissions modified during aws-parallelcluter-slurm::finalize_compute

2023-10-02 Thread Jake Jellinek
This is a strange one

I have built a Slurm cluster using AWS parallelcluster and noticed that the 
permissions of my /etc/sysconfig directory are broken!
I have found the logs that support this find but don't have any idea why this 
is happening nor where to find the necessary script/config file to fix this.

Anyone come across this?

Thank you all in advance
Jake



Oct  2 13:47:52 ip-172-25-35-65 user-data:  * directory[/etc/sysconfig] action 
create[2023-10-02T13:47:52+00:00] INFO: Processing directory[/etc/sysconfig] 
action create (aws-parallelcluster-slurm::finalize_compute line 24)

Oct  2 13:47:52 ip-172-25-35-65 user-data: [2023-10-02T13:47:52+00:00] INFO: 
directory[/etc/sysconfig] mode changed to 644

Oct  2 13:47:52 ip-172-25-35-65 user-data:

Oct  2 13:47:52 ip-172-25-35-65 user-data:- change mode from '0755' to 
'0644'