Yes its all us just running and killing test jobs as the various users now.
Ralph Castain <[email protected]> wrote: > >Ah, never mind - I see the difference now. Was looking for some info to be >different > > >On Aug 23, 2013, at 12:17 PM, Ralph Castain <[email protected]> wrote: > >> Perhaps it is a copy/paste error - but those two tables are identical >> >> On Aug 23, 2013, at 12:14 PM, Alan V. Cowles <[email protected]> wrote: >> >>> >>> Final update for the day, we have found what is causing priority to be >>> overlooked we just don't know what is causing it... >>> >>> [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M >>> %.9l %.6D %R" |grep user1 >>> (null) 181378 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181379 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181380 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181381 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181382 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181383 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181384 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181385 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181386 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> (null) 181387 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> >>> Compared to: >>> >>> [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M >>> %.9l %.6D %R" |grep user2 >>> account 181378 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181379 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181380 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181381 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181382 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181383 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181384 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181385 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181386 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> account 181387 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >>> (Priority) >>> >>> >>> We have tried to create new users and new accounts this afternoon and all >>> of them show (null) as their account when we break out the formatting rules >>> on sacct. >>> >>> sacctmgr add account accountname >>> sacctmgr add user username defaultaccount accountname >>> >>> We have even one case where all users under and account are working fine >>> except a user we added yesterday... so at some point in the past (logs >>> aren't helping us thus far) the ability to actually sync up a user and an >>> account for accounting purposes has left us. Also I have failed to mention >>> to this point that we are still running Slurm 2.5.4, my apologies for that. >>> >>> AC >>> >>> >>> On 08/23/2013 11:22 AM, Alan V. Cowles wrote: >>>> Sorry to spam the list, but we wanted to keep updates in flux. >>>> >>>> We managed to find the issue in our mysqldb we are using for job >>>> accounting which had the column value set to smallint (5) for that value, >>>> so it was rounding things off, some SQL magic and we now have appropriate >>>> uid's showing up. A new monkey wrench, some test jobs submitted by user3 >>>> below get their fairshare value of 5000 as expected, just not user2... we >>>> just cleared his jobs from the queue, and submitted another 100 jobs for >>>> testing and none of them got a fairshare value... >>>> >>>> In his entire history of using our cluster he hasn't submitted over 5000 >>>> jobs, in fact: >>>> >>>> [root@slurm-master ~]# sacct -c >>>> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep >>>> user2 | wc -l >>>> 2573 >>>> >>>> So we can't figure out why he's being overlooked. >>>> >>>> AC >>>> >>>> >>>> On 08/23/2013 10:31 AM, Alan V. Cowles wrote: >>>>> We think we may be onto something, in sacct we were looking at the jobs >>>>> submitted by the users, and found that many users share the same >>>>> uidnumber in the slurm database. It seems to correlate with the size of >>>>> the user's uid number in our ldap directory... users who's uid number are >>>>> greater than 65535 get trunked to that number... users with uid numbers >>>>> below that keep their correct uidnumbers (user2 in the sample output >>>>> below) >>>>> >>>>> >>>>> >>>>> >>>>> [root@slurm-master ~]# sacct -c >>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>>> |grep user2|head >>>>> user2 27545 30548 bwa node01-1 2013-07-08T13:04:25 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30571 bwa node01-1 2013-07-08T15:18:00 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30573 bwa node01-1 2013-07-09T09:40:59 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30618 grep node01-1 2013-07-09T11:57:12 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30619 bc node01-1 2013-07-09T11:58:08 >>>>> 00:00:48 0:0 CANCELLED >>>>> user2 27545 30620 du node01-1 2013-07-09T11:58:19 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30621 wc node01-1 2013-07-09T11:58:43 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30622 zcat node01-1 2013-07-09T11:58:54 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30623 zcat node01-1 2013-07-09T12:12:56 >>>>> 00:00:48 0:0 COMPLETED >>>>> user2 27545 30624 zcat node01-1 2013-07-09T12:26:37 >>>>> 00:00:48 0:0 CANCELLED >>>>> [root@slurm-master ~]# sacct -c >>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>>> |grep user1|head >>>>> user1 65535 83 impute2_w+ node01-1 2013-04-17T09:29:47 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 84 impute2_w+ node01-1 2013-04-17T09:30:17 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 85 impute2_w+ node01-1 2013-04-17T09:30:40 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 86 impute2_w+ node01-1 2013-04-17T09:40:45 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 87 date node01-1 2013-04-17T09:42:36 >>>>> 00:00:48 0:0 COMPLETED >>>>> user1 65535 88 hostname node01-1 2013-04-17T09:42:37 >>>>> 00:00:48 0:0 COMPLETED >>>>> user1 65535 89 impute2_w+ node01-1 2013-04-17T09:48:50 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 90 impute2_w+ node01-1 2013-04-17T09:48:56 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 91 impute2_w+ node01-1 2013-04-17T09:49:56 >>>>> 00:00:48 0:0 FAILED >>>>> user1 65535 92 impute2_w+ node01-1 2013-04-17T09:50:06 >>>>> 00:00:48 0:0 FAILED >>>>> [root@slurm-master ~]# sacct -c >>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>>> |grep user3|head >>>>> user3 65535 5 script.sh node09-1 2013-04-09T15:55:07 >>>>> 00:00:48 0:0 FAILED >>>>> user3 65535 6 script.sh node09-1 2013-04-09T15:55:13 >>>>> INVALID 0:0 COMPLETED >>>>> user3 65535 8 bash node09-1 2013-04-09T15:57:34 >>>>> 00:00:48 0:0 COMPLETED >>>>> user3 65535 7 bash node09-1 2013-04-09T15:57:21 >>>>> 00:00:48 0:0 COMPLETED >>>>> user3 65535 23 script.sh node09-1 2013-04-09T16:10:02 >>>>> 00:00:48 0:0 COMPLETED >>>>> user3 65535 27 script.sh node09-+ 2013-04-09T16:18:33 >>>>> 00:00:48 0:0 CANCELLED >>>>> user3 65535 28 script.sh node01-+ 2013-04-09T16:18:55 >>>>> 00:00:48 0:0 CANCELLED >>>>> user3 65535 30 script.sh node01-+ 2013-04-09T16:34:12 >>>>> 00:00:48 0:0 CANCELLED >>>>> user3 65535 31 script.sh node01-+ 2013-04-09T16:34:17 >>>>> 00:00:48 0:0 CANCELLED >>>>> user3 65535 32 script.sh node01-+ 2013-04-09T16:34:21 >>>>> 00:00:48 0:0 CANCELLED >>>>> >>>>> We are thinking perhaps this could lead to our major issues with the >>>>> system and priority factoring. >>>>> >>>>> AC >>>>> >>>>> On 08/23/2013 07:56 AM, Alan V. Cowles wrote: >>>>>> Hey guys, >>>>>> >>>>>> So in the past we had 3 prioritization factors in effect: partition, age >>>>>> and fairshare and they were working wonderfully. Currently partition has >>>>>> no effect for us as it's all one large shared partition so everyone gets >>>>>> the same value there. So everything is balanced in age and fairshare, In >>>>>> the past age and fairshare worked splendidly, and we have it set as I >>>>>> understand to refresh counters every 2 weeks... so basically everyone >>>>>> had a blank slate this past weekend. What our current issue is as >>>>>> follows... >>>>>> >>>>>> A problematic user has submitted 70k jobs to a partition with 512 slots >>>>>> and she is currently consuming all slots... basically locking up the >>>>>> queue for anybody else that wants to try and work. >>>>>> >>>>>> Normally fairshare kicks in and jumps other users to the top of the >>>>>> queue but when a new user submitted 25 jobs (vs the 70k) he didn't get >>>>>> any fairshare weighting at all... >>>>>> >>>>>> JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION >>>>>> QOS NICE >>>>>> 162986 uid1 8371 371 0 0 8000 0 >>>>>> 0 >>>>>> 162987 uid1 8371 371 0 0 8000 0 >>>>>> 0 >>>>>> 162988 uid1 8371 371 0 0 8000 0 >>>>>> 0 >>>>>> 180698 uid2 8320 321 0 0 8000 0 >>>>>> 0 >>>>>> 180699 uid2 8320 321 0 0 8000 0 >>>>>> 0 >>>>>> 180700 uid2 8320 321 0 0 8000 0 >>>>>> 0 >>>>>> 180701 uid2 8320 321 0 0 8000 0 >>>>>> 0 >>>>>> >>>>>> >>>>>> I'm used to seeing a user like that get 5000 fairshare to start out >>>>>> with... Thoughts? >>>>>> >>>>>> AC >>>>>> >>>>>> >>>>>> >>>>> >>>> >>
