Hi, everyone, I'm working on testing an upgrade from our current torque-2.5.7 + maui-3.3 to torque-4.2.2 + maui-3.3.1 and I'm running into an issue which does not seem to be mentioned in the mailing list archives.
Background: 1 torque- and maui-server node: RHEL6 with all latest updates, 64-bit 4 compute nodes: RHEL6 with all latest updates, 64-bit SELinux permissive iptables off Torque compiled with cpuset and munge support. When I submit a job, doesn't matter how many nodes or ppn, and the job is killed because it overran its cput request, Maui gets very confused. diagnose -n gives: WARNING: node 'compute1' has more processors utilized than dedicated (6 > 0) WARNING: node 'compute1' state 'Running' does not match expected state 'Idle'. sync deadline in 00:03:54 at Fri May 10 16:02:22 In my test, I've even left things alone (no jobs submitted) overnight, but the situation doesn't clear up. Jobs which end successfully do not cause this problem. Jobs which are qdel'ed do cause this problem. pbs_server sees the right thing: all nodes show up as "free" using qstat and pbsnodes. Looking at the pbs_server and pbs_mom logs, things seem as they should be: "obit sent to server", "removed job script", "scan for terminated ... task N terminated". Maui log does show that the job was destroyed: MJobRemove(51) MResDestroy(51) MResChargeAllocation(51,2) INFO: node 'compute1' released from reservation INFO: node 'compute1' released from reservation INFO: 2 nodes/12 tasks released from reservation MJobFind('51',J,O) INFO: Job '51' reservation released (tasks requested: 12) MResAdjustDRes(51,TRUE) MJobRemoveHash(51) MJobDestroy(51) But later, it fails to get some node info: MSchedProcessJobs() MRMGetInfo() MClusterClearUsage() MRMClusterQuery() MPBSClusterQuery(0,RCount,SC) ERROR: cannot get node info: End of File ALERT: cannot load cluster resources on RM (RM '0' failed in function 'clusterquery') WARNING: no resources detected Before the job was killed, there were other calls to MPBSClusterQuery() that succeeded: 05/10 15:49:56 MPBSClusterQuery(0,RCount,SC) 05/10 15:49:56 __MPBSGetNodeState(Name,State,PNode) 05/10 15:49:56 INFO: PBS node compute1 set to state Idle (free) 05/10 15:49:56 MNodeFind(compute1,N) 05/10 15:49:56 MNodeAdd(compute1,N) 05/10 15:49:56 MNodeFind(compute1,N) 05/10 15:49:56 MRMNodePreLoad(compute1,Idle,0) 05/10 15:49:56 MPBSNodeLoad(compute1,compute1,Idle,0) Talking to the pbs_server works, and it sees that there are no jobs: 05/10 16:36:48 INFO: opened service socket on port 15004 05/10 16:36:48 __MPBSSystemQuery(0,RCount,SC) 05/10 16:36:48 INFO: connected to PBS server servernode:0 on sd 1 05/10 16:36:48 INFO: queue is empty 05/10 16:36:48 INFO: 0 PBS jobs detected on RM 0 05/10 16:36:48 WARNING: no workload detected 05/10 16:36:48 MStatClearUsage(node,Active) 05/10 16:36:48 MClusterUpdateNodeState() 05/10 16:36:48 MParUpdate(ALL) 05/10 16:36:48 INFO: P[ALL]: Total 4:24 Up 4:24 Idle 2:12 Active 2:0 05/10 16:36:48 INFO: MNode[compute1] added to MPar[DEFAULT] (0:6) 05/10 16:36:48 INFO: MNode[compute2] added to MPar[DEFAULT] (0:6) 05/10 16:36:48 INFO: MNode[compute3] added to MPar[DEFAULT] (6:6) 05/10 16:36:48 INFO: MNode[compute4] added to MPar[DEFAULT] (6:6) 05/10 16:36:48 INFO: P[ALL]: Total 4:24 Up 4:24 Idle 2:12 Active 2:0 05/10 16:36:48 INFO: jobs in queue So, it looks like after a job is killed by pbs_mom, Maui cannot then get the node state. I'm not sure I understand where this communication is failing. It succeeds when a job completes successfully. Restarting Maui fixes things. Have I missed something in my configuration, maybe? Or does this indicate an error in either torque or maui? Thanks in advance, Dave -- David Chin, Ph.D. chi...@wfu.edu High Performance Computing Systems Analyst Office: +1.336.758.2964 Wake Forest University Mobile: +1.336.608.0793 Winston-Salem, NC Email-to-txt: 3366080...@mms.att.net Google Talk: chi...@wfu.edu Web: http://users.wfu.edu/chindw/ http://linuxfollies.blogspot.com/ https://plus.google.com/108169173177119739731/about
_______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers