Hi, this is a constant issue we have. Maui is unable to schedule all jobs, but there doesn't seem to be a fixed amount, but it varies. Sometimes the running peaks at 3200 sometimes 3700 sometimes 3900 sometimes 4100, no correlation found yet. A usual situation:
[root@torque-v-1 ~]# qstat -q server: torque-v-1.local Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- test -- 01:00:00 02:00:00 -- 0 0 -- E R long -- 48:00:00 72:00:00 -- 3249 756 -- E R short -- 01:00:00 02:00:00 -- 0 0 -- E R ----- ----- 3249 756 [root@torque-v-1 ~]# diagnose -t DEFAULT [test 4122:4122] [root@torque-v-1 ~]# pbsnodes -l free|wc -l 120 So as you can see there are more free cores than queued jobs. All our jobs are single core jobs with no requirements that would prohibit running (defaults are only used, Grid doesn't specify job requirements that would conflict). The main reason seems to be this: 01/10 14:36:20 MPBSWorkloadQuery(base,JCount,SC) 01/10 14:36:20 INFO: job '2081730' changed states from Running to Hold 01/10 14:36:20 INFO: job '2081809' changed states from Running to Hold 01/10 14:36:20 INFO: job '2081810' changed states from Running to Hold 01/10 14:36:29 INFO: 3916 PBS jobs detected on RM base 01/10 14:36:29 INFO: jobs detected: 3916 01/10 14:36:30 INFO: total jobs selected (ALL): 647/3916 [State: 3269] 01/10 14:36:30 INFO: total jobs selected (ALL): 647/3916 [State: 3269] 01/10 14:36:30 INFO: total jobs selected in partition ALL: 647/647 01/10 14:36:30 INFO: total jobs selected in partition ALL: 647/647 01/10 14:36:30 INFO: total jobs selected in partition DEFAULT: 647/647 01/10 14:36:30 MRMJobStart(2081811,Msg,SC) 01/10 14:36:30 MPBSJobStart(2081811,base,Msg,SC) 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,wn-v-2936.local) 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,1) 01/10 14:36:30 INFO: job '2081811' successfully started 01/10 14:36:30 MRMJobStart(2081735,Msg,SC) 01/10 14:36:30 MPBSJobStart(2081735,base,Msg,SC) 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,wn-v-4556.local) 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,1) 01/10 14:36:30 INFO: job '2081735' successfully started 01/10 14:36:30 ERROR: cannot create reservation for job '2081735' 01/10 14:36:30 ERROR: cannot start job '2081735' in partition DEFAULT 01/10 14:36:30 MJobPReserve(2081735,DEFAULT,ResCount,ResCountRej) 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve 01/10 14:36:30 MJobPReserve(2081736,DEFAULT,ResCount,ResCountRej) 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve 01/10 14:36:30 MJobPReserve(2081815,DEFAULT,ResCount,ResCountRej) 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve 01/10 14:36:30 MJobPReserve(2081738,DEFAULT,ResCount,ResCountRej) 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve ... This message of cannot create reservation follows in hundreds and then the whole scheduling restarts for the next cycle. As you can see it was able to start two jobs, but I assume those were the ones that had finished recently and then it filled the slots. We've not been able to figure out what causes this. Any ideas how to debug this would be welcome. If we force a job to run it'll run, but maui itself won't run them. The level at which it gets to this state varies as I mentioned, we've even seen once it almost fill the whole cluster. Mario Kadastik, PhD Researcher --- "Physics is like sex, sure it may have practical reasons, but that's not why we do it" -- Richard P. Feynman _______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers