Perhaps it is a copy/paste error - but those two tables are identical On Aug 23, 2013, at 12:14 PM, Alan V. Cowles <alan.cow...@duke.edu> wrote:
> > Final update for the day, we have found what is causing priority to be > overlooked we just don't know what is causing it... > > [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M > %.9l %.6D %R" |grep user1 > (null) 181378 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181379 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181380 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181381 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181382 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181383 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181384 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181385 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181386 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > (null) 181387 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 > (Priority) > > Compared to: > > [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M > %.9l %.6D %R" |grep user2 > account 181378 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181379 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181380 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181381 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181382 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181383 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181384 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181385 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181386 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > account 181387 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 > (Priority) > > > We have tried to create new users and new accounts this afternoon and all of > them show (null) as their account when we break out the formatting rules on > sacct. > > sacctmgr add account accountname > sacctmgr add user username defaultaccount accountname > > We have even one case where all users under and account are working fine > except a user we added yesterday... so at some point in the past (logs aren't > helping us thus far) the ability to actually sync up a user and an account > for accounting purposes has left us. Also I have failed to mention to this > point that we are still running Slurm 2.5.4, my apologies for that. > > AC > > > On 08/23/2013 11:22 AM, Alan V. Cowles wrote: >> Sorry to spam the list, but we wanted to keep updates in flux. >> >> We managed to find the issue in our mysqldb we are using for job accounting >> which had the column value set to smallint (5) for that value, so it was >> rounding things off, some SQL magic and we now have appropriate uid's >> showing up. A new monkey wrench, some test jobs submitted by user3 below get >> their fairshare value of 5000 as expected, just not user2... we just cleared >> his jobs from the queue, and submitted another 100 jobs for testing and none >> of them got a fairshare value... >> >> In his entire history of using our cluster he hasn't submitted over 5000 >> jobs, in fact: >> >> [root@slurm-master ~]# sacct -c >> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep >> user2 | wc -l >> 2573 >> >> So we can't figure out why he's being overlooked. >> >> AC >> >> >> On 08/23/2013 10:31 AM, Alan V. Cowles wrote: >>> We think we may be onto something, in sacct we were looking at the jobs >>> submitted by the users, and found that many users share the same uidnumber >>> in the slurm database. It seems to correlate with the size of the user's >>> uid number in our ldap directory... users who's uid number are greater than >>> 65535 get trunked to that number... users with uid numbers below that keep >>> their correct uidnumbers (user2 in the sample output below) >>> >>> >>> >>> >>> [root@slurm-master ~]# sacct -c >>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>> |grep user2|head >>> user2 27545 30548 bwa node01-1 2013-07-08T13:04:25 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30571 bwa node01-1 2013-07-08T15:18:00 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30573 bwa node01-1 2013-07-09T09:40:59 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30618 grep node01-1 2013-07-09T11:57:12 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30619 bc node01-1 2013-07-09T11:58:08 >>> 00:00:48 0:0 CANCELLED >>> user2 27545 30620 du node01-1 2013-07-09T11:58:19 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30621 wc node01-1 2013-07-09T11:58:43 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30622 zcat node01-1 2013-07-09T11:58:54 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30623 zcat node01-1 2013-07-09T12:12:56 >>> 00:00:48 0:0 COMPLETED >>> user2 27545 30624 zcat node01-1 2013-07-09T12:26:37 >>> 00:00:48 0:0 CANCELLED >>> [root@slurm-master ~]# sacct -c >>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>> |grep user1|head >>> user1 65535 83 impute2_w+ node01-1 2013-04-17T09:29:47 >>> 00:00:48 0:0 FAILED >>> user1 65535 84 impute2_w+ node01-1 2013-04-17T09:30:17 >>> 00:00:48 0:0 FAILED >>> user1 65535 85 impute2_w+ node01-1 2013-04-17T09:30:40 >>> 00:00:48 0:0 FAILED >>> user1 65535 86 impute2_w+ node01-1 2013-04-17T09:40:45 >>> 00:00:48 0:0 FAILED >>> user1 65535 87 date node01-1 2013-04-17T09:42:36 >>> 00:00:48 0:0 COMPLETED >>> user1 65535 88 hostname node01-1 2013-04-17T09:42:37 >>> 00:00:48 0:0 COMPLETED >>> user1 65535 89 impute2_w+ node01-1 2013-04-17T09:48:50 >>> 00:00:48 0:0 FAILED >>> user1 65535 90 impute2_w+ node01-1 2013-04-17T09:48:56 >>> 00:00:48 0:0 FAILED >>> user1 65535 91 impute2_w+ node01-1 2013-04-17T09:49:56 >>> 00:00:48 0:0 FAILED >>> user1 65535 92 impute2_w+ node01-1 2013-04-17T09:50:06 >>> 00:00:48 0:0 FAILED >>> [root@slurm-master ~]# sacct -c >>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>> |grep user3|head >>> user3 65535 5 script.sh node09-1 2013-04-09T15:55:07 >>> 00:00:48 0:0 FAILED >>> user3 65535 6 script.sh node09-1 2013-04-09T15:55:13 >>> INVALID 0:0 COMPLETED >>> user3 65535 8 bash node09-1 2013-04-09T15:57:34 >>> 00:00:48 0:0 COMPLETED >>> user3 65535 7 bash node09-1 2013-04-09T15:57:21 >>> 00:00:48 0:0 COMPLETED >>> user3 65535 23 script.sh node09-1 2013-04-09T16:10:02 >>> 00:00:48 0:0 COMPLETED >>> user3 65535 27 script.sh node09-+ 2013-04-09T16:18:33 >>> 00:00:48 0:0 CANCELLED >>> user3 65535 28 script.sh node01-+ 2013-04-09T16:18:55 >>> 00:00:48 0:0 CANCELLED >>> user3 65535 30 script.sh node01-+ 2013-04-09T16:34:12 >>> 00:00:48 0:0 CANCELLED >>> user3 65535 31 script.sh node01-+ 2013-04-09T16:34:17 >>> 00:00:48 0:0 CANCELLED >>> user3 65535 32 script.sh node01-+ 2013-04-09T16:34:21 >>> 00:00:48 0:0 CANCELLED >>> >>> We are thinking perhaps this could lead to our major issues with the system >>> and priority factoring. >>> >>> AC >>> >>> On 08/23/2013 07:56 AM, Alan V. Cowles wrote: >>>> Hey guys, >>>> >>>> So in the past we had 3 prioritization factors in effect: partition, age >>>> and fairshare and they were working wonderfully. Currently partition has >>>> no effect for us as it's all one large shared partition so everyone gets >>>> the same value there. So everything is balanced in age and fairshare, In >>>> the past age and fairshare worked splendidly, and we have it set as I >>>> understand to refresh counters every 2 weeks... so basically everyone had >>>> a blank slate this past weekend. What our current issue is as follows... >>>> >>>> A problematic user has submitted 70k jobs to a partition with 512 slots >>>> and she is currently consuming all slots... basically locking up the queue >>>> for anybody else that wants to try and work. >>>> >>>> Normally fairshare kicks in and jumps other users to the top of the queue >>>> but when a new user submitted 25 jobs (vs the 70k) he didn't get any >>>> fairshare weighting at all... >>>> >>>> JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS >>>> NICE >>>> 162986 uid1 8371 371 0 0 8000 0 >>>> 0 >>>> 162987 uid1 8371 371 0 0 8000 0 >>>> 0 >>>> 162988 uid1 8371 371 0 0 8000 0 >>>> 0 >>>> 180698 uid2 8320 321 0 0 8000 0 >>>> 0 >>>> 180699 uid2 8320 321 0 0 8000 0 >>>> 0 >>>> 180700 uid2 8320 321 0 0 8000 0 >>>> 0 >>>> 180701 uid2 8320 321 0 0 8000 0 >>>> 0 >>>> >>>> >>>> I'm used to seeing a user like that get 5000 fairshare to start out >>>> with... Thoughts? >>>> >>>> AC >>>> >>>> >>>> >>> >>