Hey guys,

We're still hung on our priority scheduling here, but I had a thought reading some other mailings to the list this morning. The syntax that others are using with user creation is not as basic as ours, and has other variables in place suchas "fairshare=parent" are these things that we need to specify when creating an account or are they defaults, wondering if this is why our newer users aren't showing up with the correct accounts in squeue.

It still bugs us that accounts show up correctly in sacctmgr, just not in squeue for the point of enforcing priority. Could this be a bug corrected in later releases?

AC

On 08/23/2013 03:13 PM, Alan V. Cowles wrote:
Final update for the day, we have found what is causing priority to be overlooked we just don't know what is causing it...

[root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" |grep user1 (null) 181378 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181379 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181380 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181381 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181382 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181383 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181384 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181385 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181386 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority) (null) 181387 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 (Priority)

Compared to:

[root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R" |grep user2 account 181378 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181379 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181380 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181381 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181382 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181383 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181384 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181385 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181386 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority) account 181387 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 (Priority)


We have tried to create new users and new accounts this afternoon and all of them show (null) as their account when we break out the formatting rules on sacct.

sacctmgr add account accountname
sacctmgr add user username defaultaccount accountname

We have even one case where all users under and account are working fine except a user we added yesterday... so at some point in the past (logs aren't helping us thus far) the ability to actually sync up a user and an account for accounting purposes has left us. Also I have failed to mention to this point that we are still running Slurm 2.5.4, my apologies for that.

AC


On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
Sorry to spam the list, but we wanted to keep updates in flux.

We managed to find the issue in our mysqldb we are using for job accounting which had the column value set to smallint (5) for that value, so it was rounding things off, some SQL magic and we now have appropriate uid's showing up. A new monkey wrench, some test jobs submitted by user3 below get their fairshare value of 5000 as expected, just not user2... we just cleared his jobs from the queue, and submitted another 100 jobs for testing and none of them got a fairshare value...

In his entire history of using our cluster he hasn't submitted over 5000 jobs, in fact:

[root@slurm-master ~]# sacct -c --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep user2 | wc -l
2573

So we can't figure out why he's being overlooked.

AC


On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
We think we may be onto something, in sacct we were looking at the jobs submitted by the users, and found that many users share the same uidnumber in the slurm database. It seems to correlate with the size of the user's uid number in our ldap directory... users who's uid number are greater than 65535 get trunked to that number... users with uid numbers below that keep their correct uidnumbers (user2 in the sample output below)




[root@slurm-master ~]# sacct -c --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state |grep user2|head user2 27545 30548 bwa node01-1 2013-07-08T13:04:25 00:00:48 0:0 COMPLETED user2 27545 30571 bwa node01-1 2013-07-08T15:18:00 00:00:48 0:0 COMPLETED user2 27545 30573 bwa node01-1 2013-07-09T09:40:59 00:00:48 0:0 COMPLETED user2 27545 30618 grep node01-1 2013-07-09T11:57:12 00:00:48 0:0 COMPLETED user2 27545 30619 bc node01-1 2013-07-09T11:58:08 00:00:48 0:0 CANCELLED user2 27545 30620 du node01-1 2013-07-09T11:58:19 00:00:48 0:0 COMPLETED user2 27545 30621 wc node01-1 2013-07-09T11:58:43 00:00:48 0:0 COMPLETED user2 27545 30622 zcat node01-1 2013-07-09T11:58:54 00:00:48 0:0 COMPLETED user2 27545 30623 zcat node01-1 2013-07-09T12:12:56 00:00:48 0:0 COMPLETED user2 27545 30624 zcat node01-1 2013-07-09T12:26:37 00:00:48 0:0 CANCELLED [root@slurm-master ~]# sacct -c --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state |grep user1|head user1 65535 83 impute2_w+ node01-1 2013-04-17T09:29:47 00:00:48 0:0 FAILED user1 65535 84 impute2_w+ node01-1 2013-04-17T09:30:17 00:00:48 0:0 FAILED user1 65535 85 impute2_w+ node01-1 2013-04-17T09:30:40 00:00:48 0:0 FAILED user1 65535 86 impute2_w+ node01-1 2013-04-17T09:40:45 00:00:48 0:0 FAILED user1 65535 87 date node01-1 2013-04-17T09:42:36 00:00:48 0:0 COMPLETED user1 65535 88 hostname node01-1 2013-04-17T09:42:37 00:00:48 0:0 COMPLETED user1 65535 89 impute2_w+ node01-1 2013-04-17T09:48:50 00:00:48 0:0 FAILED user1 65535 90 impute2_w+ node01-1 2013-04-17T09:48:56 00:00:48 0:0 FAILED user1 65535 91 impute2_w+ node01-1 2013-04-17T09:49:56 00:00:48 0:0 FAILED user1 65535 92 impute2_w+ node01-1 2013-04-17T09:50:06 00:00:48 0:0 FAILED [root@slurm-master ~]# sacct -c --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state |grep user3|head user3 65535 5 script.sh node09-1 2013-04-09T15:55:07 00:00:48 0:0 FAILED user3 65535 6 script.sh node09-1 2013-04-09T15:55:13 INVALID 0:0 COMPLETED user3 65535 8 bash node09-1 2013-04-09T15:57:34 00:00:48 0:0 COMPLETED user3 65535 7 bash node09-1 2013-04-09T15:57:21 00:00:48 0:0 COMPLETED user3 65535 23 script.sh node09-1 2013-04-09T16:10:02 00:00:48 0:0 COMPLETED user3 65535 27 script.sh node09-+ 2013-04-09T16:18:33 00:00:48 0:0 CANCELLED user3 65535 28 script.sh node01-+ 2013-04-09T16:18:55 00:00:48 0:0 CANCELLED user3 65535 30 script.sh node01-+ 2013-04-09T16:34:12 00:00:48 0:0 CANCELLED user3 65535 31 script.sh node01-+ 2013-04-09T16:34:17 00:00:48 0:0 CANCELLED user3 65535 32 script.sh node01-+ 2013-04-09T16:34:21 00:00:48 0:0 CANCELLED

We are thinking perhaps this could lead to our major issues with the system and priority factoring.

AC

On 08/23/2013 07:56 AM, Alan V. Cowles wrote:
Hey guys,

So in the past we had 3 prioritization factors in effect: partition, age and fairshare and they were working wonderfully. Currently partition has no effect for us as it's all one large shared partition so everyone gets the same value there. So everything is balanced in age and fairshare, In the past age and fairshare worked splendidly, and we have it set as I understand to refresh counters every 2 weeks... so basically everyone had a blank slate this past weekend. What our current issue is as follows...

A problematic user has submitted 70k jobs to a partition with 512 slots and she is currently consuming all slots... basically locking up the queue for anybody else that wants to try and work.

Normally fairshare kicks in and jumps other users to the top of the queue but when a new user submitted 25 jobs (vs the 70k) he didn't get any fairshare weighting at all...

JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 162986 uid1 8371 371 0 0 8000 0 0 162987 uid1 8371 371 0 0 8000 0 0 162988 uid1 8371 371 0 0 8000 0 0 180698 uid2 8320 321 0 0 8000 0 0 180699 uid2 8320 321 0 0 8000 0 0 180700 uid2 8320 321 0 0 8000 0 0 180701 uid2 8320 321 0 0 8000 0 0


I'm used to seeing a user like that get 5000 fairshare to start out with... Thoughts?

AC






Reply via email to