Re: [slurm-users] enabling job script archival

Paul Edmon Tue, 03 Oct 2023 06:44:03 -0700

You will probably need to.

The way we handle it is that we add users when the first submit a jobvia the job_submit.lua script. This way the database autopopulates withactive users.


-Paul Edmon-

On 10/3/23 9:01 AM, Davide DelVento wrote:

By increasing the slurmdbd verbosity level, I got additionalinformation, namely the following:


slurmdbd: error: couldn't get information for this user (null)(xxxxxx)

slurmdbd: debug: accounting_storage/as_mysql:as_mysql_jobacct_process_get_jobs: User xxxxxx has no associations,and is not admin, so not returning any jobs.

again where xxxxx is the posix ID of the user who's running the queryin the slurmdbd logs.

I suspect this is due to the fact that our userbase is small enough(we are a department HPC) that we don't need to use allocation and thelike, so I have not configured any association (and not even studiedits configuration, since when I was at another place which did useassociations, someone else took care of slurm administration).

Anyway, I read the fantastic document by our own member athttps://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associationsand in fact I have not even configured slurm users:


# sacctmgr show user
      User   Def Acct     Admin
---------- ---------- ---------
      root       root Administ+
#

So is that the issue? Should I just add all users? Any suggestions onthe minimal (but robust) way to do that?


Thanks!

On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento<davide.quan...@gmail.com> wrote:


    Thanks Paul, this helps.

    I don't have any PrivateData line in either config file. According
    to the docs, "By default, all information is visible to all users"
    so this should not be an issue. I tried to add a line with
    "PrivateData=jobs" to the conf files, just in case, but that
    didn't change the behavior.

    On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon <ped...@cfa.harvard.edu>
    wrote:

        At least in our setup, users can see their own scripts by
        doing sacct -B -j JOBID

        I would make sure that the scripts are being stored and how
        you have PrivateData set.

        -Paul Edmon-

        On 10/2/2023 10:57 AM, Davide DelVento wrote:

        I deployed the job_script archival and it is working, however
        it can be queried only by root.

        A regular user can run sacct -lj towards any jobs (even those
        by other users, and that's okay in our setup) with no
        problem. However if they run sacct -j job_id --batch-script
        even against a job they own themselves, nothing is returned
        and I get a

        slurmdbd: error: couldn't get information for this user
        (null)(xxxxxx)

        where xxxxx is the posix ID of the user who's running the
        query in the slurmdbd logs.

        Both configure files slurmdbd.conf and slurm.conf do not have
        any "permission" setting. FWIW, we use LDAP.

        Is that the expected behavior, in that by default only root
        can see the job scripts? I was assuming the users themselves
        should be able to debug their own jobs... Any hint on what
        could be changed to achieve this?

        Thanks!



        On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento
        <davide.quan...@gmail.com> wrote:

            Fantastic, this is really helpful, thanks!

            On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
            <ped...@cfa.harvard.edu> wrote:

                Yes it was later than that. If you are 23.02 you are
                good.  We've been running with storing job_scripts on
                for years at this point and that part of the database
                only uses up 8.4G.  Our entire database takes up 29G
                on disk. So its about 1/3 of the database.  We also
                have database compression which helps with the on
                disk size. Raw uncompressed our database is about
                90G.  We keep 6 months of data in our active database.

                -Paul Edmon-

                On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

                Sorry for the duplicate e-mail in a short time: do
                you know (or anyone) when the hashing was added? Was
                planning to enable this on 21.08, but we then had to
                delay our upgrade to it. I’m assuming later than
                that, as I believe that’s when the feature was added.

                On Sep 28, 2023, at 13:55, Ryan Novosielski
                <novos...@rutgers.edu>
                <mailto:novos...@rutgers.edu> wrote:

                Thank you; we’ll put in a feature request for
                improvements in that area, and also thanks for the
                warning? I thought of that in passing, but the real
                world experience is really useful. I could easily
                see wanting that stuff to be retained less often
                than the main records, which is what I’d ask for.

                I assume that archiving, in general, would also
                remove this stuff, since old jobs themselves will
                be removed?

                --
                #BlackLivesMatter
                ____
                || \\UTGERS,
                |---------------------------*O*---------------------------
                ||_// the State |         Ryan Novosielski -
                novos...@rutgers.edu
                || \\ University | Sr. Technologist - 973/972.0922
                (2x0922) ~*~ RBHS Campus
                ||  \\    of NJ | Office of Advanced Research
                Computing - MSB A555B, Newark
                     `'

                On Sep 28, 2023, at 13:48, Paul Edmon
                <ped...@cfa.harvard.edu>
                <mailto:ped...@cfa.harvard.edu> wrote:

                Slurm should take care of it when you add it.

                So far as horror stories, under previous versions
                our database size ballooned to be so massive that
                it actually prevented us from upgrading and we had
                to drop the columns containing the job_script and
                job_env.  This was back before slurm started
                hashing the scripts so that it would only store
                one copy of duplicate scripts.  After this point
                we found that the job_script database stayed at a
                fairly reasonable size as most users use
                functionally the same script each time. However
                the job_env continued to grow like crazy as there
                are variables in our environment that change
                fairly consistently depending on where the user
                is. Thus job_envs ended up being too massive to
                keep around and so we had to drop them. Frankly we
                never really used them for debugging. The
                job_scripts though are super useful and not that
                much overhead.

                In summary my recommendation is to only store
                job_scripts. job_envs add too much storage for
                little gain, unless your job_envs are basically
                the same for each user in each location.

                Also it should be noted that there is no way to
                prune out job_scripts or job_envs right now. So
                the only way to get rid of them if they get large
                is to 0 out the column in the table. You can ask
                SchedMD for the mysql command to do this as we had
                to do it here to our job_envs.

                -Paul Edmon-

                On 9/28/2023 1:40 PM, Davide DelVento wrote:

                In my current slurm installation,
                (recently upgraded to slurm v23.02.3), I only have

                AccountingStoreFlags=job_comment

                I now intend to add both

                AccountingStoreFlags=job_script
                AccountingStoreFlags=job_env

                leaving the default 4MB value for max_script_size

                Do I need to do anything on the DB myself, or
                will slurm take care of the additional tables if
                needed?

                Any
                comments/suggestions/gotcha/pitfalls/horror_stories
                to share? I know about the additional diskspace
                and potentially load needed, and with our
                resources and typical workload I should be okay
                with that.

                Thanks!

Re: [slurm-users] enabling job script archival

Reply via email to