What distro are you running OSCAR over? Have you patched them (or the login node) recently? I am trying to think of reasons besides cosmic rays that it would have worked yesterday, but not today.
You might also want to try dropping this on the maui or torque lists. They might have a beter idea of where to start looking.
Since we are planning on releasing 5.0 in a week or two, you could of course upgrade to that and see if subsequent maui or torque patches (or a clean install) fixes the problem, but thats a fairly invasive solution.
Hello all!
I've got an OSCAR 4.2.1 cluster running on 4 quad opteron compute nodes.
The head node has the same hardware configuration. Interactive jobs were running fine until earlier today. There were no changes made to the configuration of TORQUE or MAUI.
Suddenly I'm getting the following from qsub when I try to submit an interactive job from a login node:
qsub: waiting for job 1106.brahe000.cluster to start
qsub: job 1106.brahe000.cluster apparently deleted
The login node *is* in the hosts.equiv file on the head node.
Interactive jobs do work from the head node itself.
The following is in the pbs server logs, but I've been seeing this all along
11/08/2006 17:12:21;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to connect from 172.24.254.1:1023 (address not trusted)
And from maui's logs:
11/08 17:14:17 MReqCreate(1107,SrcRQ,DstRQ,DoCreate)
11/08 17:14:17 INFO: processing node request line '1:ppn=1'
11/08 17:14:17 INFO: job '1107' loaded: 1 apreece nweng 14400 Idle 0 1163031254 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE] 1163031257
11/08 17:14:17 INFO: 21 PBS jobs detected on RM base
11/08 17:14:17 INFO: jobs detected: 21
11/08 17:14:17 INFO: total jobs selected (ALL): 6/21 [State: 15]
11/08 17:14:17 INFO: total jobs selected (ALL): 1/21 [State: 15][Policy: 5]
11/08 17:14:17 INFO: total jobs selected in partition ALL: 1/6 [Policy: 5]
11/08 17:14:17 MQueueScheduleRJobs(Q)
11/08 17:14:17 INFO: total jobs selected in partition ALL: 1/1
11/08 17:14:17 INFO: total jobs selected in partition ALL: 0/1 [PartitionAccess: 1]
11/08 17:14:17 INFO: total jobs selected in partition opteron: 1/1
11/08 17:14:17 MQueueScheduleIJobs(Q,opteron)
11/08 17:14:17 INFO: 16 feasible tasks found for job 1107:0 in partition opteron (1 Needed)
11/08 17:14:17 INFO: tasks located for job 1107: 1 of 1 required (1 feasible)
11/08 17:14:17 MJobStart(1107)
11/08 17:14:17 MRMJobStart(1107,Msg,SC)
11/08 17:14:17 MPBSJobStart(1107,base,Msg,SC)
11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,brahe003.cluster)
11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,1:ppn=1)
11/08 17:14:17 INFO: job '1107' successfully started
11/08 17:14:17 INFO: starting job '1107'
11/08 17:14:17 INFO: 1 jobs started on iteration 293
Active Jobs------
------------------
11/08 17:14:17 INFO: resources available after scheduling: N: 0 P: 3
11/08 17:14:17 INFO: total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]
11/08 17:14:17 INFO: total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]
11/08 17:14:17 MSchedUpdateStats()
11/08 17:14:17 INFO: iteration: 293 scheduling time: 0.117 seconds
11/08 17:14:17 INFO: current util[293]: 4/4 (100.00%) PH: 67.69% active jobs: 16 of 22 (completed: 9)
11/08 17:14:17 INFO: scheduling complete. sleeping 10 seconds
11/08 17:14:17 INFO: received service request from host 'brahe000.cluster'
11/08 17:14:17 MSURecvData(S,5000000,TRUE,SC,EMsg)
11/08 17:14:17 UIQueueShow(RBuffer,Buffer,1,root,BufSize)
11/08 17:14:17 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)
11/08 17:14:17 INFO: UIQueueShowAllJobs buffer size: 1686 bytes
11/08 17:14:17 MSUSendData(S,5000000,TRUE,TRUE)
11/08 17:14:17 INFO: packet sent (1772 bytes of 1772)
11/08 17:14:17 MSUDisconnect(S)
11/08 17:14:22 INFO: received service request from host 'brahe000.cluster'
11/08 17:14:22 MSURecvData(S,5000000,TRUE,SC,EMsg)
11/08 17:14:22 UIQueueShow(RBuffer,Buffer,1,root,BufSize)
11/08 17:14:22 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)
11/08 17:14:22 INFO: UIQueueShowAllJobs buffer size: 1686 bytes
11/08 17:14:22 MSUSendData(S,5000000,TRUE,TRUE)
11/08 17:14:22 INFO: packet sent (1772 bytes of 1772)
11/08 17:14:22 MSUDisconnect(S)
11/08 17:14:28 ServerProcessRequests()
11/08 17:14:28 MResAdjust(NULL,0,0)
11/08 17:14:28 MStatInitializeActiveSysUsage()
11/08 17:14:28 INFO: starting iteration 294
11/08 17:14:28 MRMGetInfo()
11/08 17:14:28 MRMClusterQuery()
11/08 17:14:28 MPBSClusterQuery(base,RCount,SC)
11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)
11/08 17:14:28 INFO: PBS node brahe001.cluster set to state Busy (job-exclusive)
11/08 17:14:28 MPBSNodeUpdate(brahe001.cluster,brahe001.cluster,Busy,base)
11/08 17:14:28 MPBSLoadQueueInfo(base,brahe001.cluster,SC)
11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)
11/08 17:14:28 INFO: PBS node brahe002.cluster set to state Busy (job-exclusive)
11/08 17:14:28 MPBSNodeUpdate(brahe002.cluster,brahe002.cluster,Busy,base)
11/08 17:14:28 MPBSLoadQueueInfo(base,brahe002.cluster,SC)
11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)
11/08 17:14:28 INFO: PBS node brahe003.cluster set to state Idle (free)
11/08 17:14:28 INFO: node 'brahe003.cluster' changed states from Running to Idle
11/08 17:14:28 MPBSNodeUpdate(brahe003.cluster,brahe003.cluster,Idle,base)
11/08 17:14:28 MPBSLoadQueueInfo(base,brahe003.cluster,SC)
11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)
11/08 17:14:28 INFO: PBS node brahe004.cluster set to state Busy (job-exclusive)
11/08 17:14:28 MPBSNodeUpdate(brahe004.cluster,brahe004.cluster,Busy,base)
11/08 17:14:28 MPBSLoadQueueInfo(base,brahe004.cluster,SC)
11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)
11/08 17:14:28 INFO: 6 PBS resources detected on RM base
11/08 17:14:28 INFO: resources detected: 6
11/08 17:14:28 MRMWorkloadQuery()
11/08 17:14:28 MPBSWorkloadQuery(base,JCount,SC)
11/08 17:14:28 MPBSJobUpdate(1084,1084.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1085,1085.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1086,1086.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1087,1087.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1088,1088.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1089,1089.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1090,1090.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1091,1091.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1092,1092.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1093,1093.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1094,1094.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1095,1095.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1097,1097.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1098,1098.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1099,1099.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1100,1100.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1101,1101.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1102,1102.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1103,1103.brahe000.cluster,TaskList,0)
11/08 17:14:28 MPBSJobUpdate(1104,1104.brahe000.cluster,TaskList,0)
11/08 17:14:28 INFO: active PBS job 1107 has been removed from the queue. assuming successful completion
Any help would be greatly appreciated!
Thanks,
Andrew.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
