Hi,

this is a constant issue we have. Maui is unable to schedule all jobs, but 
there doesn't seem to be a fixed amount, but it varies. Sometimes the running 
peaks at 3200 sometimes 3700 sometimes 3900 sometimes 4100, no correlation 
found yet. A usual situation:

[root@torque-v-1 ~]# qstat -q

server: torque-v-1.local

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
test               --   01:00:00 02:00:00   --    0   0 --   E R
long               --   48:00:00 72:00:00   --  3249 756 --   E R
short              --   01:00:00 02:00:00   --    0   0 --   E R
                                               ----- -----
                                                3249   756

[root@torque-v-1 ~]# diagnose -t
     DEFAULT [test 4122:4122]

[root@torque-v-1 ~]# pbsnodes -l free|wc -l
120

So as you can see there are more free cores than queued jobs. All our jobs are 
single core jobs with no requirements that would prohibit running (defaults are 
only used, Grid doesn't specify job requirements that would conflict). 

The main reason seems to be this:

01/10 14:36:20 MPBSWorkloadQuery(base,JCount,SC)
01/10 14:36:20 INFO:     job '2081730' changed states from Running to Hold
01/10 14:36:20 INFO:     job '2081809' changed states from Running to Hold
01/10 14:36:20 INFO:     job '2081810' changed states from Running to Hold
01/10 14:36:29 INFO:     3916 PBS jobs detected on RM base
01/10 14:36:29 INFO:     jobs detected: 3916
01/10 14:36:30 INFO:     total jobs selected (ALL): 647/3916 [State: 3269]
01/10 14:36:30 INFO:     total jobs selected (ALL): 647/3916 [State: 3269]
01/10 14:36:30 INFO:     total jobs selected in partition ALL: 647/647 
01/10 14:36:30 INFO:     total jobs selected in partition ALL: 647/647 
01/10 14:36:30 INFO:     total jobs selected in partition DEFAULT: 647/647 
01/10 14:36:30 MRMJobStart(2081811,Msg,SC)
01/10 14:36:30 MPBSJobStart(2081811,base,Msg,SC)
01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,wn-v-2936.local)
01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,1)
01/10 14:36:30 INFO:     job '2081811' successfully started
01/10 14:36:30 MRMJobStart(2081735,Msg,SC)
01/10 14:36:30 MPBSJobStart(2081735,base,Msg,SC)
01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,wn-v-4556.local)
01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,1)
01/10 14:36:30 INFO:     job '2081735' successfully started
01/10 14:36:30 ERROR:    cannot create reservation for job '2081735'
01/10 14:36:30 ERROR:    cannot start job '2081735' in partition DEFAULT
01/10 14:36:30 MJobPReserve(2081735,DEFAULT,ResCount,ResCountRej)
01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
01/10 14:36:30 MJobPReserve(2081736,DEFAULT,ResCount,ResCountRej)
01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
01/10 14:36:30 MJobPReserve(2081815,DEFAULT,ResCount,ResCountRej)
01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
01/10 14:36:30 MJobPReserve(2081738,DEFAULT,ResCount,ResCountRej)
01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
...

This message of cannot create reservation follows in hundreds and then the 
whole scheduling restarts for the next cycle. As you can see it was able to 
start two jobs, but I assume those were the ones that had finished recently and 
then it filled the slots. We've not been able to figure out what causes this. 
Any ideas how to debug this would be welcome. If we force a job to run it'll 
run, but maui itself won't run them. The level at which it gets to this state 
varies as I mentioned, we've even seen once it almost fill the whole cluster. 

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to