[slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-01 Thread taleintervenor
Hello, Our slurm cluster managed about 600+ nodes and I tested to set HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting this to CYCLE shall cause slurm to "cycle through running on all compute nodes through the course of the HealthCheckInterval". So I set "HealthCheckI

[slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Hello, The question background is: >From query command such as 'sacct -j 123456' I can see a series of jobs named 123456_1, 123456_2, etc. And I need to delete these job records from mysql database for some reason. But in job_table of slurmdb, there is only one record with id_job=123456. n

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Thanks for the help. The doc page is useful and we can get the actual job id now. The reason we need to delete job record from database is our billing system will calculate user cost from these historical records. But after a slurm system faulty there will be some specific jobs which should not

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-29 Thread taleintervenor
Well, maybe my example in first mail caused some misunderstanding. We just use sacct to check some job records manually in the maintenance process after the system fault. Our account and billing system is an commercial product which unfortunately also not provide the ability to adjust billing ra

[slurm-users] how to print all the key-values of "job_desc" in job_submit.lua?

2021-03-29 Thread taleintervenor
Hello, Because I'm not sure about the relations between fields of job_desc structure and sbatch parameter, I want to print all the fields and their values in job_desc when testing job_submit.lua. But the following code add to job_submit.lua failed to iterate through job_desc, the for loop print

[slurm-users] how to check what slurm is doing when job pending with reason=none?

2021-06-16 Thread taleintervenor
Hello, Recently we notice a strange delay from job-submitting to job-start while the partition is sure to have enough idle nodes to meet the job's demand. To avoid interference, we use the 4-node debug partition for test, which does not have any other job to run. And the test job script is also

[slurm-users] 答复: how to check what slurm is doing when job pending with reason=none?

2021-06-17 Thread taleintervenor
Thanks for the help. We tried to reduce the sched_interval and the pending time decreased as expected. But the influence of 'sched_interval' is global, setting it too small may put pressure on slurmctld server. Since we only want quick response on debug partition (which is designed to let user fre

[slurm-users] Is there bug in PrivateData=jobs option of slurmdbd?

2021-06-30 Thread taleintervenor
Hello, We find a strange behavior about sacct and PrivateData option of slurmdbd. Our original configuration is setting "PrivateData = accounts,jobs,usage,users,reservations" in slurm.conf and not setting "PrivateData" in slurmdbd.conf. At this point, common user can see all others job informat

[slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread taleintervenor
I can make sure the test job is running (of course in the default time window) when doing sacct query, and here is the new test record which describe it more clearly: [2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh Submitted batch job 6955371 [2021-07-01T16:02:48+0

[slurm-users] 答复: 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-02 Thread taleintervenor
Well, you got the point. We didn’t configure ldap on slurm database node. After configuring ldap authorization the PrivateData option finally worked as expected. Thanks for the assistance. 发件人: Brian Andrus 发送时间: 2021年7月1日 21:57 收件人: taleinterve...@sjtu.edu.cn 抄送: slurm-users@lists.schedmd

[slurm-users] What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-07 Thread taleintervenor
Hi all, According to Resource Limits page ( https://slurm.schedmd.com/resource_limits.html ), there is Root/Cluster association level under account level to provide default limitation. But how to check or modify this "cluster association"? Using command sacctmgr show association, I can only lis

[slurm-users] 答复: What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-10 Thread taleintervenor
Well, ‘sacctmgr modify cluster name=***’ is exactly what we want, and inspired by this command, we found that ‘sacctmgr show cluster’ can clearly list all the cluster associations. But during test we found another problem. When limitation is defined both on cluster level and user level, the sma

[slurm-users] why sacct display wrong username while the UID is right?

2022-03-12 Thread taleintervenor
Hi all: We encountered a strange bug when query job history using sacct. As show below, we try to list user hpczbzt's job, and sacct do filter the right jobs belong to this user. But there username is displayed as phywht. > sacct -X --user=hpczbzt --format=jobid%16,jobidraw,user,uid,partiti

[slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread taleintervenor
Hi, all: We found a problem that slurm job with argument such as --gres gpu:1 didn't be restricted with gpu usage, user still can see all gpu card on allocated nodes. Our gpu node has 4 cards with their gres.conf to be: > cat /etc/slurm/gres.conf Name=gpu Type=NVlink_A100_40GB File=/dev/nvid

[slurm-users] 答复: how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-24 Thread taleintervenor
Well, this is indeed the point. We didn’t set ConstrainDevices=yes in cgroup.conf. After adding this, gpu restriction works as expected. But what is the relation between gpu restriction and cgroup? I never heard that cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia driver?

[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread taleintervenor
Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if check is not passed. I know exit epilog with non-zero code will make slurm automatically drain the node. But in such way, drain reason will all be mar

[slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-03 Thread taleintervenor
Hi, all: Our cluster set up 2 slurm control node and scontrol show config as below: > scontrol show config . SlurmctldHost[0]= slurm1 SlurmctldHost[1]= slurm2 StateSaveLocation = /etc/slurm/state . Of course we have make sure both node has the some slurm conf and mo

[slurm-users] 答复: what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-04 Thread taleintervenor
Well, after increase slurmctld log level to debug, we do found some error related to munge like: [2022-06-04T15:17:21.258] debug: auth/munge: _decode_cred: Munge decode failed: Failed to connect to "/run/munge/munge.socket.2": Resource temporarily unavailable (retrying ...) But when test m

[slurm-users] slurm continously log _remove_accrue_time_internal and something underflow error

2022-06-16 Thread taleintervenor
Hi all: We found out slurmctld keep log error message as [2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal accrue_cnt underflow [2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal acct acct-ioomj accrue_cnt underflow [2022-06-16T04:01:20.219] err

[slurm-users] Is there split-brain danger when using backup slurmdbd?

2022-06-27 Thread taleintervenor
Hi, all: We noticed that slurmdbd provide the conf option DbdBackupHost for user to set a secondary slurmdbd node. Since slurmdbd is closely related to database, we wonder will multiple slurmdbd bring up the split-brain danger, which is the common topic in database high-available discussion. Wi

[slurm-users] how do slurmctld determine whether a compute node is not responding?

2022-07-11 Thread taleintervenor
Hi, all: Recently we found some strange log in slurmctld.log about node not responding, such as: [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding [2022-07-09T03:23:58.098] Node node171 now responding [2022-07-09T03:23:58.099] Node node165 now responding [2022-07-0

[slurm-users] 答复: how do slurmctld determine whether a compute node is not responding?

2022-07-11 Thread taleintervenor
Hello, Kamil Wilczek: Well I agree that the non-responding case may caused by network unstable, since our slurm cluster has 2 part nodes geographical distant distributed with only ethernet link them. Those reported nodes are all in one building while the slurmctld node in another building. But

[slurm-users] Can slurm be configured to count CG job into max_job or max_submit limitation?

2022-07-18 Thread taleintervenor
Hi all, Recently we found a problem caused by too many CG jobs. When user continuously submit small jobs which complete quickly, the RUNNING and PENDING job number do restricted by MaxJob and MaxSubmit in user's association. But slurm did not count the CG job. Because we set epilog to collect s

[slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-25 Thread taleintervenor
Hi all: When designing restriction in job_submit.lua, I found there is no member in job_desc struct can directly be used to determine the node number finally allocated to a job. The job_desc.min_nodes seem to be a close answer, but it will be 0xFFFE when user not specify -node option. The