We're having a problem similar to that described in this thread: http://www.mentby.com/Group/grid-engine/62u4-resource-reservation-not-working-for-some-jobs.html
We're running Grid Engine 6.2u5 for a cluster of 4 Linux nodes (32 cores each) running Ubuntu 12.04 (Precise). We're seeing that jobs that request a reservation and are at the top of the queue are not starting, with lower-priority jobs that are requesting fewer cores slipping ahead of the higher priority job. An example of this is at the bottom of this posting. Here's the results of "qconf -ssconf": algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method load job_load_adjustments np_load_avg=0.50 load_adjustment_decay_time 0:7:30 load_formula np_load_avg schedd_job_info true flush_submit_sec 0 flush_finish_sec 0 params MONITOR=1 reprioritize_interval 0:0:0 halftime 720 usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000 compensation_factor 5.000000 weight_user 0.250000 weight_project 0.250000 weight_department 0.250000 weight_job 0.250000 weight_tickets_functional 0 weight_tickets_share 100000 share_override_tickets TRUE share_functional_shares TRUE max_functional_jobs_to_schedule 200 report_pjob_tickets TRUE max_pending_tasks_per_job 50 halflife_decay_list none policy_hierarchy SOF weight_ticket 1.000000 weight_waiting_time 0.278000 weight_deadline 3600000.000000 weight_urgency 0.000000 weight_priority 0.000000 max_reservation 10 default_duration 7200:00:00 Here's the example: Job #34378 was submitted as: qsub -pe smp 16 -R y -b y "R CMD BATCH --no-save tmp.R tmp.out" Soon after submitting #34378, we see that the job #34378 is next in line: job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 33004 0.11762 tophat.sh seqc r 04/24/2013 07:14:20 [email protected] 32 33718 0.12405 fooSU_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33719 0.12405 fooSV_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33720 0.12405 fooWV_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33721 0.12405 fooWU_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33745 0.06583 toy.sh yjhuoh r 05/07/2013 22:29:28 [email protected] 1 33758 0.06583 toy.sh yjhuoh r 05/07/2013 22:30:28 [email protected] 1 33763 0.06583 toy.sh yjhuoh r 05/07/2013 22:33:58 [email protected] 1 33787 0.06583 toy.sh yjhuoh r 05/08/2013 00:15:58 [email protected] 1 33794 0.06583 toy.sh yjhuoh r 05/08/2013 01:45:58 [email protected] 1 34183 0.00570 SubSampleF isoform r 05/09/2013 03:29:32 [email protected] 8 34185 0.00570 SubSampleF isoform r 05/09/2013 04:27:47 [email protected] 8 34186 0.00570 SubSampleF isoform r 05/09/2013 04:36:47 [email protected] 8 34187 0.00570 SubSampleF isoform r 05/09/2013 05:05:02 [email protected] 8 34188 0.00570 SubSampleF isoform r 05/09/2013 05:42:17 [email protected] 8 34189 0.00570 SubSampleF isoform r 05/09/2013 06:12:47 [email protected] 8 34190 0.00570 SubSampleF isoform r 05/09/2013 06:14:17 [email protected] 8 34191 0.00570 SubSampleF isoform r 05/09/2013 07:07:32 [email protected] 8 34192 0.00570 SubSampleF isoform r 05/09/2013 07:24:02 [email protected] 8 34194 0.00570 SubSampleF isoform r 05/09/2013 07:37:17 [email protected] 8 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 16 34195 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34196 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34197 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34198 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 A little while later, we see that jobs 34195-34198 have slipped ahead of 34378: job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 33004 0.11790 tophat.sh seqc r 04/24/2013 07:14:20 [email protected] 32 33718 0.12398 fooSU_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33719 0.12398 fooSV_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33720 0.12398 fooWV_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33721 0.12398 fooWU_long lwtai r 05/06/2013 17:01:58 [email protected] 1 33745 0.08234 toy.sh yjhuoh r 05/07/2013 22:29:28 [email protected] 1 33758 0.08234 toy.sh yjhuoh r 05/07/2013 22:30:28 [email protected] 1 33763 0.08234 toy.sh yjhuoh r 05/07/2013 22:33:58 [email protected] 1 33787 0.08234 toy.sh yjhuoh r 05/08/2013 00:15:58 [email protected] 1 34188 0.00568 SubSampleF isoform r 05/09/2013 05:42:17 [email protected] 8 34189 0.00568 SubSampleF isoform r 05/09/2013 06:12:47 [email protected] 8 34190 0.00568 SubSampleF isoform r 05/09/2013 06:14:17 [email protected] 8 34191 0.00568 SubSampleF isoform r 05/09/2013 07:07:32 [email protected] 8 34192 0.00568 SubSampleF isoform r 05/09/2013 07:24:02 [email protected] 8 34194 0.00568 SubSampleF isoform r 05/09/2013 07:37:17 [email protected] 8 34195 0.00568 SubSampleF isoform r 05/09/2013 08:16:47 [email protected] 8 34196 0.00568 SubSampleF isoform r 05/09/2013 08:47:32 [email protected] 8 34197 0.00568 SubSampleF isoform r 05/09/2013 09:11:02 [email protected] 8 34198 0.00568 SubSampleF isoform r 05/09/2013 09:16:32 [email protected] 8 34378 1.00000 R CMD BATC paciorek qw 05/09/2013 08:14:31 16 34199 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34200 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34201 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34202 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34203 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34204 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34205 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34206 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34207 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34208 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34209 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34210 0.00000 SubSampleF isoform qw 05/08/2013 19:30:51 8 34211 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34212 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34213 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34214 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34215 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34216 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34217 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 34218 0.00000 SubSampleF isoform qw 05/08/2013 19:30:52 8 The schedule file shows that there are RESERVING statements for #34378: 34378:1:RESERVING:1369228520:25920060:P:smp:slots:16.000000 34378:1:RESERVING:1369228520:25920060:Q:[email protected]: slots:16.000000 Perhaps the issue is that the reservation seems specific to the cluster node "scf-sm02.Berkeley.EDU", and that specific node is occupied by a long-running job (#33004). If so, is there any way to have the reservation not tied to a node? -Chris ---------------------------------------------------------------------------------------------- Chris Paciorek Statistical Computing Consultant, Associate Research Statistician, Lecturer Office: 495 Evans Hall Email: [email protected] Mailing Address: Voice: 510-842-6670 Department of Statistics Fax: 510-642-7892 367 Evans Hall Skype: cjpaciorek University of California, Berkeley WWW: www.stat.berkeley.edu/~paciorek Berkeley, CA 94720 USA Permanent forward: [email protected]
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
