Hi all, we're using: maui-3.2.6p19_20.snap.1182974819-4.slc3 maui-server-3.2.6p19_20.snap.1182974819-4.slc3 maui-client-3.2.6p19_20.snap.1182974819-4.slc3 torque-client-2.1.9-4cri.slc3 torque-server-2.1.9-4cri.slc3 torque-2.1.9-4cri.slc3
Last week we notice a strange behaviour in maui, and now, we're able to reproduce: 1.-) we submit a job requesting a special resource, in example node with Scientific Linux 3 $ qsub -q slc3 job.sh 3445312.pbs01.pic.es our queue slc3 ask for that resource: # qmgr -c "p s"|grep slc3 [...] set queue slc3 resources_default.neednodes = slc3 [...] 2.-) We close the only node with that resource, so no WN will fit the job. # pbsnodes td248.pic.es td248.pic.es state = offline np = 10 properties = slc3 ntype = cluster [...] 3.-) Our job goes to the first position of our queue and maui see that cannot find a WN [...] 3440629 nsidro Running 1 3:00:00:00 Thu Dec 13 12:54:10 196 Active Jobs 196 of 249 Processors Active (78.71%) 57 of 62 Nodes Active (91.94%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 3445312 arnaubria Idle 1 3:00:00:00 Thu Dec 13 12:30:43 3440631 nsidro Idle 1 3:00:00:00 Wed Dec 12 20:58:15 [...] # checkjob 3445312 checking job 3445312 State: Idle Creds: user:arnaubria group:grid class:slc3 qos:DEFAULT WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Thu Dec 13 12:30:43 (Time Queued Total: 00:25:29 Eligible: 00:24:28) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [slc3] IWD: [NONE] Executable: [NONE] Bypass: 10 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PE: 1.00 StartPriority: 1000001000 SystemPriority: 1000 job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found) idle procs: 151 feasible procs: 0 Rejection Reasons: [Features : 64][State : 1] 4.-) Maui does not schedule any other job, so the farm gets empty. Checkjob for the second job in queue. # checkjob 3440631 checking job 3440631 State: Idle Creds: user:nsidro group:magic class:long qos:lhmagic WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Wed Dec 12 20:58:15 (Time Queued Total: 15:57:57 Eligible: 15:51:32) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [slc4] IWD: [NONE] Executable: [NONE] Bypass: 10 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PE: 1.00 StartPriority: 89 job can run in partition DEFAULT (32 procs available. 1 procs required) progress of jobs running/queued in our farm: 224 2358 222 2359 215 2355 213 2365 194 2371 ... 5.) We open the WN again, and all works fine again. # pbsnodes -c td248.pic.es after that, immediately after jobs start again: 222 2337 Something similar happened when requesting hosst with "slc3 && slc4", no nodes fit that condition and maui got hanged.... So, is it a bug?¿ Is anyone having same problem ? any workaround? Cheers, Arnau _______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers