Hi, everyone,

I'm working on testing an upgrade from our current torque-2.5.7 + maui-3.3
to torque-4.2.2 + maui-3.3.1 and I'm running into an issue which does not
seem to be mentioned in the mailing list archives.

Background:
1 torque- and maui-server node: RHEL6 with all latest updates, 64-bit
4 compute nodes: RHEL6 with all latest updates, 64-bit
SELinux permissive
iptables off

Torque compiled with cpuset and munge support.

When I submit a job, doesn't matter how many nodes or ppn, and the job is
killed because it overran its cput request, Maui gets very confused.
diagnose -n gives:

WARNING:  node 'compute1' has more processors utilized than dedicated (6 >
0)
WARNING:  node 'compute1' state 'Running' does not match expected state
'Idle'.  sync deadline in 00:03:54 at Fri May 10 16:02:22

In my test, I've even left things alone (no jobs submitted) overnight, but
the situation doesn't clear up.

Jobs which end successfully do not cause this problem. Jobs which are
qdel'ed do cause this problem.

pbs_server sees the right thing: all nodes show up as "free" using qstat
and pbsnodes. Looking at the pbs_server and pbs_mom logs, things seem as
they should be: "obit sent to server", "removed job script", "scan for
terminated ... task N terminated".

Maui log does show that the job was destroyed:
  MJobRemove(51)
  MResDestroy(51)
  MResChargeAllocation(51,2)
  INFO:     node 'compute1' released from reservation
  INFO:     node 'compute1' released from reservation
  INFO:     2 nodes/12 tasks released from reservation
  MJobFind('51',J,O)
  INFO:     Job '51' reservation released (tasks requested: 12)
  MResAdjustDRes(51,TRUE)
  MJobRemoveHash(51)
  MJobDestroy(51)

But later, it fails to get some node info:

  MSchedProcessJobs()
  MRMGetInfo()
  MClusterClearUsage()
  MRMClusterQuery()
  MPBSClusterQuery(0,RCount,SC)
  ERROR:    cannot get node info: End of File
  ALERT:    cannot load cluster resources on RM (RM '0' failed in function
'clusterquery')
  WARNING:  no resources detected

Before the job was killed, there were other calls to MPBSClusterQuery()
that succeeded:

  05/10 15:49:56 MPBSClusterQuery(0,RCount,SC)
  05/10 15:49:56 __MPBSGetNodeState(Name,State,PNode)
  05/10 15:49:56 INFO:     PBS node compute1 set to state Idle (free)
  05/10 15:49:56 MNodeFind(compute1,N)
  05/10 15:49:56 MNodeAdd(compute1,N)
  05/10 15:49:56 MNodeFind(compute1,N)
  05/10 15:49:56 MRMNodePreLoad(compute1,Idle,0)
  05/10 15:49:56 MPBSNodeLoad(compute1,compute1,Idle,0)


Talking to the pbs_server works, and it sees that there are no jobs:

  05/10 16:36:48 INFO:     opened service socket on port 15004
  05/10 16:36:48 __MPBSSystemQuery(0,RCount,SC)
  05/10 16:36:48 INFO:     connected to PBS server servernode:0 on sd 1
  05/10 16:36:48 INFO:     queue is empty
  05/10 16:36:48 INFO:     0 PBS jobs detected on RM 0
  05/10 16:36:48 WARNING:  no workload detected
  05/10 16:36:48 MStatClearUsage(node,Active)
  05/10 16:36:48 MClusterUpdateNodeState()
  05/10 16:36:48 MParUpdate(ALL)
  05/10 16:36:48 INFO:     P[ALL]:  Total 4:24  Up 4:24  Idle 2:12  Active
2:0
  05/10 16:36:48 INFO:     MNode[compute1] added to MPar[DEFAULT] (0:6)
  05/10 16:36:48 INFO:     MNode[compute2] added to MPar[DEFAULT] (0:6)
  05/10 16:36:48 INFO:     MNode[compute3] added to MPar[DEFAULT] (6:6)
  05/10 16:36:48 INFO:     MNode[compute4] added to MPar[DEFAULT] (6:6)
  05/10 16:36:48 INFO:     P[ALL]:  Total 4:24  Up 4:24  Idle 2:12  Active
2:0
  05/10 16:36:48 INFO:     jobs in queue


So, it looks like after a job is killed by pbs_mom, Maui cannot then get
the node state. I'm not sure I understand where this communication is
failing. It succeeds when a job completes successfully.

Restarting Maui fixes things.

Have I missed something in my configuration, maybe? Or does this indicate
an error in either torque or maui?

Thanks in advance,
   Dave

--
David Chin, Ph.D.
chi...@wfu.edu                  High Performance Computing Systems Analyst
Office: +1.336.758.2964         Wake Forest University
Mobile: +1.336.608.0793         Winston-Salem, NC
Email-to-txt: 3366080...@mms.att.net           Google Talk: chi...@wfu.edu
Web: http://users.wfu.edu/chindw/  http://linuxfollies.blogspot.com/
     https://plus.google.com/108169173177119739731/about
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to