[slurm-dev] Re: fairshare incrementing

Alan V. Cowles Fri, 23 Aug 2013 08:25:02 -0700


Sorry to spam the list, but we wanted to keep updates in flux.

We managed to find the issue in our mysqldb we are using for jobaccounting which had the column value set to smallint (5) for thatvalue, so it was rounding things off, some SQL magic and we now haveappropriate uid's showing up. A new monkey wrench, some test jobssubmitted by user3 below get their fairshare value of 5000 as expected,just not user2... we just cleared his jobs from the queue, and submittedanother 100 jobs for testing and none of them got a fairshare value...

In his entire history of using our cluster he hasn't submitted over 5000jobs, in fact:

[root@slurm-master ~]# sacct -c--format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grepuser2 | wc -l

2573

So we can't figure out why he's being overlooked.

AC


On 08/23/2013 10:31 AM, Alan V. Cowles wrote:

We think we may be onto something, in sacct we were looking at thejobs submitted by the users, and found that many users share the sameuidnumber in the slurm database. It seems to correlate with the sizeof the user's uid number in our ldap directory... users who's uidnumber are greater than 65535 get trunked to that number... users withuid numbers below that keep their correct uidnumbers (user2 in thesample output below)
[root@slurm-master ~]# sacct -c--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state|grep user2|headuser2 27545 30548 bwa node01-1 2013-07-08T13:04:2500:00:48 0:0 COMPLETEDuser2 27545 30571 bwa node01-1 2013-07-08T15:18:0000:00:48 0:0 COMPLETEDuser2 27545 30573 bwa node01-1 2013-07-09T09:40:5900:00:48 0:0 COMPLETEDuser2 27545 30618 grep node01-1 2013-07-09T11:57:1200:00:48 0:0 COMPLETEDuser2 27545 30619 bc node01-1 2013-07-09T11:58:0800:00:48 0:0 CANCELLEDuser2 27545 30620 du node01-1 2013-07-09T11:58:1900:00:48 0:0 COMPLETEDuser2 27545 30621 wc node01-1 2013-07-09T11:58:4300:00:48 0:0 COMPLETEDuser2 27545 30622 zcat node01-1 2013-07-09T11:58:5400:00:48 0:0 COMPLETEDuser2 27545 30623 zcat node01-1 2013-07-09T12:12:5600:00:48 0:0 COMPLETEDuser2 27545 30624 zcat node01-1 2013-07-09T12:26:3700:00:48 0:0 CANCELLED[root@slurm-master ~]# sacct -c--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state|grep user1|headuser1 65535 83 impute2_w+ node01-1 2013-04-17T09:29:4700:00:48 0:0 FAILEDuser1 65535 84 impute2_w+ node01-1 2013-04-17T09:30:1700:00:48 0:0 FAILEDuser1 65535 85 impute2_w+ node01-1 2013-04-17T09:30:4000:00:48 0:0 FAILEDuser1 65535 86 impute2_w+ node01-1 2013-04-17T09:40:4500:00:48 0:0 FAILEDuser1 65535 87 date node01-1 2013-04-17T09:42:3600:00:48 0:0 COMPLETEDuser1 65535 88 hostname node01-1 2013-04-17T09:42:3700:00:48 0:0 COMPLETEDuser1 65535 89 impute2_w+ node01-1 2013-04-17T09:48:5000:00:48 0:0 FAILEDuser1 65535 90 impute2_w+ node01-1 2013-04-17T09:48:5600:00:48 0:0 FAILEDuser1 65535 91 impute2_w+ node01-1 2013-04-17T09:49:5600:00:48 0:0 FAILEDuser1 65535 92 impute2_w+ node01-1 2013-04-17T09:50:0600:00:48 0:0 FAILED[root@slurm-master ~]# sacct -c--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state|grep user3|headuser3 65535 5 script.sh node09-1 2013-04-09T15:55:0700:00:48 0:0 FAILEDuser3 65535 6 script.sh node09-1 2013-04-09T15:55:13INVALID 0:0 COMPLETEDuser3 65535 8 bash node09-1 2013-04-09T15:57:3400:00:48 0:0 COMPLETEDuser3 65535 7 bash node09-1 2013-04-09T15:57:2100:00:48 0:0 COMPLETEDuser3 65535 23 script.sh node09-1 2013-04-09T16:10:0200:00:48 0:0 COMPLETEDuser3 65535 27 script.sh node09-+ 2013-04-09T16:18:3300:00:48 0:0 CANCELLEDuser3 65535 28 script.sh node01-+ 2013-04-09T16:18:5500:00:48 0:0 CANCELLEDuser3 65535 30 script.sh node01-+ 2013-04-09T16:34:1200:00:48 0:0 CANCELLEDuser3 65535 31 script.sh node01-+ 2013-04-09T16:34:1700:00:48 0:0 CANCELLEDuser3 65535 32 script.sh node01-+ 2013-04-09T16:34:2100:00:48 0:0 CANCELLED
We are thinking perhaps this could lead to our major issues with thesystem and priority factoring.
AC

On 08/23/2013 07:56 AM, Alan V. Cowles wrote:
Hey guys,
So in the past we had 3 prioritization factors in effect: partition,age and fairshare and they were working wonderfully. Currentlypartition has no effect for us as it's all one large shared partitionso everyone gets the same value there. So everything is balanced inage and fairshare, In the past age and fairshare worked splendidly,and we have it set as I understand to refresh counters every 2weeks... so basically everyone had a blank slate this past weekend.What our current issue is as follows...
A problematic user has submitted 70k jobs to a partition with 512slots and she is currently consuming all slots... basically lockingup the queue for anybody else that wants to try and work.
Normally fairshare kicks in and jumps other users to the top of thequeue but when a new user submitted 25 jobs (vs the 70k) he didn'tget any fairshare weighting at all...
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZEPARTITION QOS NICE162986 uid1 8371 371 0 0 80000 0162987 uid1 8371 371 0 0 80000 0162988 uid1 8371 371 0 0 80000 0180698 uid2 8320 321 0 0 80000 0180699 uid2 8320 321 0 0 80000 0180700 uid2 8320 321 0 0 80000 0180701 uid2 8320 321 0 0 80000 0
I'm used to seeing a user like that get 5000 fairshare to start outwith... Thoughts?
AC

[slurm-dev] Re: fairshare incrementing

Reply via email to