Hi, Am 10.05.2013 um 00:35 schrieb Chris Paciorek:
> For the (default) queue [called low.q] that these jobs are going to, we have > the time limit set to 28 days (see below). Users are not explicitly > requesting h_rt/s_rt. The jobs that are slipping ahead of the reserved job > are not actually jobs that are short in time, and SGE shouldn't have any way > of thinking that they are. https://arc.liv.ac.uk/trac/SGE/ticket/388 http://gridengine.org/pipermail/users/2012-July/004104.html Without an explicit request the default runtime will be assumed for all jobs. The jobs 34195-34198 weren't started at once, but one after the other. I would say the jobs running before them on node scf-sm01 resp. scf-sm03 were shorther than the extimated 7200 hrs. Can you please give it a try to submit shorter job with an explicitly requested h_rt and check whether it changes anything. -- Reuti > I'm starting to suspect that the issue may be that the reservation seems to > be hard-wired to individual nodes, and in our case it is being hard-wired to > a node with the longest-running job, while other jobs on other nodes are > finishing more quickly. I suppose this makes sense - in order to collect > sufficient cores for a reservation, it needs to do so on a single node, so at > some point, it needs to decide which node that will be. Unfortunately in > this case, it's immediately choosing the node with the long-running job as > soon as the reservation is requested, but that long-running job is likely to > continue to run for a while. Can anyone weigh in on whether this sounds right > and if so, any ideas to deal with this? > > beren:~$ qconf -sq low.q > qname low.q > hostlist @sm0 > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 19 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH > ckpt_list NONE > pe_list smp smpcontrol > rerun FALSE > slots 32 > tmpdir /tmp > shell /bin/bash > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists sm0users > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt 671:00:00 > h_rt 672:00:00 > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > > > beren:~$ qconf -sc > #name shortcut type relop requestable consumable > default urgency > #---------------------------------------------------------------------------------------- > arch a RESTRING == YES NO NONE > 0 > calendar c RESTRING == YES NO NONE > 0 > cpu cpu DOUBLE >= YES NO 0 > 0 > display_win_gui dwg BOOL == YES NO 0 > 0 > h_core h_core MEMORY <= YES NO 0 > 0 > h_cpu h_cpu TIME <= YES NO 0:0:0 > 0 > h_data h_data MEMORY <= YES NO 0 > 0 > h_fsize h_fsize MEMORY <= YES NO 0 > 0 > h_rss h_rss MEMORY <= YES NO 0 > 0 > h_rt h_rt TIME <= YES NO 0:0:0 > 0 > h_stack h_stack MEMORY <= YES NO 0 > 0 > h_vmem h_vmem MEMORY <= YES NO 0 > 0 > hostname h HOST == YES NO NONE > 0 > load_avg la DOUBLE >= NO NO 0 > 0 > load_long ll DOUBLE >= NO NO 0 > 0 > load_medium lm DOUBLE >= NO NO 0 > 0 > load_short ls DOUBLE >= NO NO 0 > 0 > m_core core INT <= YES NO 0 > 0 > m_socket socket INT <= YES NO 0 > 0 > m_topology topo RESTRING == YES NO NONE > 0 > m_topology_inuse utopo RESTRING == YES NO NONE > 0 > mem_free mf MEMORY <= YES NO 0 > 0 > mem_total mt MEMORY <= YES NO 0 > 0 > mem_used mu MEMORY >= YES NO 0 > 0 > min_cpu_interval mci TIME <= NO NO 0:0:0 > 0 > np_load_avg nla DOUBLE >= NO NO 0 > 0 > np_load_long nll DOUBLE >= NO NO 0 > 0 > np_load_medium nlm DOUBLE >= NO NO 0 > 0 > np_load_short nls DOUBLE >= NO NO 0 > 0 > num_proc p INT == YES NO 0 > 0 > qname q RESTRING == YES NO NONE > 0 > rerun re BOOL == NO NO 0 > 0 > s_core s_core MEMORY <= YES NO 0 > 0 > s_cpu s_cpu TIME <= YES NO 0:0:0 > 0 > s_data s_data MEMORY <= YES NO 0 > 0 > s_fsize s_fsize MEMORY <= YES NO 0 > 0 > s_rss s_rss MEMORY <= YES NO 0 > 0 > s_rt s_rt TIME <= YES NO 0:0:0 > 0 > s_stack s_stack MEMORY <= YES NO 0 > 0 > s_vmem s_vmem MEMORY <= YES NO 0 > 0 > seq_no seq INT == NO NO 0 > 0 > slots s INT <= YES YES 1 > 1000 > swap_free sf MEMORY <= YES NO 0 > 0 > swap_rate sr MEMORY >= YES NO 0 > 0 > swap_rsvd srsv MEMORY >= YES NO 0 > 0 > swap_total st MEMORY <= YES NO 0 > 0 > swap_used su MEMORY >= YES NO 0 > 0 > tmpdir tmp RESTRING == NO NO NONE > 0 > virtual_free vf MEMORY <= YES YES 0 > 0 > virtual_total vt MEMORY <= YES NO 0 > 0 > virtual_used vu MEMORY >= YES NO 0 > 0 > > > > On Thu, May 9, 2013 at 10:43 AM, Reuti <[email protected]> wrote: > Am 09.05.2013 um 18:51 schrieb Chris Paciorek: > > > We're having a problem similar to that described in this thread: > > http://www.mentby.com/Group/grid-engine/62u4-resource-reservation-not-working-for-some-jobs.html > > > > We're running Grid Engine 6.2u5 for a cluster of 4 Linux nodes (32 cores > > each) running Ubuntu 12.04 (Precise). > > > > We're seeing that jobs that request a reservation and are at the top of the > > queue are not starting, with lower-priority jobs that are requesting fewer > > cores slipping ahead of the higher priority job. An example of this is at > > the bottom of this posting. > > Besides the defined "default_duration 7200:00:00": what h_rt/s_rt request was > supplied to the short jobs? > > -- Reuti > > > > Here's the results of "qconf -ssconf": > > algorithm default > > schedule_interval 0:0:15 > > maxujobs 0 > > queue_sort_method load > > job_load_adjustments np_load_avg=0.50 > > load_adjustment_decay_time 0:7:30 > > load_formula np_load_avg > > schedd_job_info true > > flush_submit_sec 0 > > flush_finish_sec 0 > > params MONITOR=1 > > reprioritize_interval 0:0:0 > > halftime 720 > > usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 > > compensation_factor 5.000000 > > weight_user 0.250000 > > weight_project 0.250000 > > weight_department 0.250000 > > weight_job 0.250000 > > weight_tickets_functional 0 > > weight_tickets_share 100000 > > share_override_tickets TRUE > > share_functional_shares TRUE > > max_functional_jobs_to_schedule 200 > > report_pjob_tickets TRUE > > max_pending_tasks_per_job 50 > > halflife_decay_list none > > policy_hierarchy SOF > > weight_ticket 1.000000 > > weight_waiting_time 0.278000 > > weight_deadline 3600000.000000 > > weight_urgency 0.000000 > > weight_priority 0.000000 > > max_reservation 10 > > default_duration 7200:00:00 > > > > Here's the example: > > > > Job #34378 was submitted as: > > qsub -pe smp 16 -R y -b y "R CMD BATCH --no-save tmp.R tmp.out" > > > > > > Soon after submitting #34378, we see that the job #34378 is next in line: > > job-ID prior name user state submit/start at queue > > slots ja-task-ID > > ----------------------------------------------------------------------------------------------------------------- > > 33004 0.11762 tophat.sh seqc r 04/24/2013 07:14:20 > > [email protected] 32 > > 33718 0.12405 fooSU_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33719 0.12405 fooSV_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33720 0.12405 fooWV_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33721 0.12405 fooWU_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33745 0.06583 toy.sh yjhuoh r 05/07/2013 22:29:28 > > [email protected] 1 > > 33758 0.06583 toy.sh yjhuoh r 05/07/2013 22:30:28 > > [email protected] 1 > > 33763 0.06583 toy.sh yjhuoh r 05/07/2013 22:33:58 > > [email protected] 1 > > 33787 0.06583 toy.sh yjhuoh r 05/08/2013 00:15:58 > > [email protected] 1 > > 33794 0.06583 toy.sh yjhuoh r 05/08/2013 01:45:58 > > [email protected] 1 > > 34183 0.00570 SubSampleF isoform r 05/09/2013 03:29:32 > > [email protected] 8 > > 34185 0.00570 SubSampleF isoform r 05/09/2013 04:27:47 > > [email protected] 8 > > 34186 0.00570 SubSampleF isoform r 05/09/2013 04:36:47 > > [email protected] 8 > > 34187 0.00570 SubSampleF isoform r 05/09/2013 05:05:02 > > [email protected] 8 > > 34188 0.00570 SubSampleF isoform r 05/09/2013 05:42:17 > > [email protected] 8 > > 34189 0.00570 SubSampleF isoform r 05/09/2013 06:12:47 > > [email protected] 8 > > 34190 0.00570 SubSampleF isoform r 05/09/2013 06:14:17 > > [email protected] 8 > > 34191 0.00570 SubSampleF isoform r 05/09/2013 07:07:32 > > [email protected] 8 > > 34192 0.00570 SubSampleF isoform r 05/09/2013 07:24:02 > > [email protected] 8 > > 34194 0.00570 SubSampleF isoform r 05/09/2013 07:37:17 > > [email protected] 8 > > 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 > > 16 > > 34195 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34196 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34197 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34198 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > > > A little while later, we see that jobs 34195-34198 have slipped ahead of > > 34378: > > > > job-ID prior name user state submit/start at queue > > slots ja-task-ID > > ----------------------------------------------------------------------------------------------------------------- > > 33004 0.11790 tophat.sh seqc r 04/24/2013 07:14:20 > > [email protected] 32 > > 33718 0.12398 fooSU_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33719 0.12398 fooSV_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33720 0.12398 fooWV_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33721 0.12398 fooWU_long lwtai r 05/06/2013 17:01:58 > > [email protected] 1 > > 33745 0.08234 toy.sh yjhuoh r 05/07/2013 22:29:28 > > [email protected] 1 > > 33758 0.08234 toy.sh yjhuoh r 05/07/2013 22:30:28 > > [email protected] 1 > > 33763 0.08234 toy.sh yjhuoh r 05/07/2013 22:33:58 > > [email protected] 1 > > 33787 0.08234 toy.sh yjhuoh r 05/08/2013 00:15:58 > > [email protected] 1 > > 34188 0.00568 SubSampleF isoform r 05/09/2013 05:42:17 > > [email protected] 8 > > 34189 0.00568 SubSampleF isoform r 05/09/2013 06:12:47 > > [email protected] 8 > > 34190 0.00568 SubSampleF isoform r 05/09/2013 06:14:17 > > [email protected] 8 > > 34191 0.00568 SubSampleF isoform r 05/09/2013 07:07:32 > > [email protected] 8 > > 34192 0.00568 SubSampleF isoform r 05/09/2013 07:24:02 > > [email protected] 8 > > 34194 0.00568 SubSampleF isoform r 05/09/2013 07:37:17 > > [email protected] 8 > > 34195 0.00568 SubSampleF isoform r 05/09/2013 08:16:47 > > [email protected] 8 > > 34196 0.00568 SubSampleF isoform r 05/09/2013 08:47:32 > > [email protected] 8 > > 34197 0.00568 SubSampleF isoform r 05/09/2013 09:11:02 > > [email protected] 8 > > 34198 0.00568 SubSampleF isoform r 05/09/2013 09:16:32 > > [email protected] 8 > > 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 > > 16 > > 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 > > 8 > > 34211 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34212 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34213 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34214 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34215 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34216 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34217 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > 34218 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 > > 8 > > > > The schedule file shows that there are RESERVING statements for #34378: > > 34378:1:RESERVING:1369228520:25920060:P:smp:slots:16.000000 > > 34378:1:RESERVING:1369228520:25920060:Q:[email protected]:slots:16.000000 > > > > Perhaps the issue is that the reservation seems specific to the cluster > > node "scf-sm02.Berkeley.EDU", and that specific node is occupied by a > > long-running job (#33004). If so, is there any way to have the reservation > > not tied to a node? > > > > -Chris > > > > ---------------------------------------------------------------------------------------------- > > Chris Paciorek > > > > Statistical Computing Consultant, Associate Research Statistician, Lecturer > > > > Office: 495 Evans Hall Email: > > [email protected] > > Mailing Address: Voice: 510-842-6670 > > Department of Statistics Fax: 510-642-7892 > > 367 Evans Hall Skype: cjpaciorek > > University of California, Berkeley WWW: > > www.stat.berkeley.edu/~paciorek > > Berkeley, CA 94720 USA Permanent forward: > > [email protected] > > > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
