|
Hello all! I’ve got an OSCAR 4.2.1 cluster running on 4 quad
opteron compute nodes. The head node has the same hardware configuration.
Interactive jobs were running fine until earlier today. There were no changes
made to the configuration of TORQUE or MAUI. Suddenly I’m getting the following from qsub when I try
to submit an interactive job from a login node: qsub: waiting for job 1106.brahe000.cluster to start qsub: job 1106.brahe000.cluster apparently deleted The login node *is* in the hosts.equiv file on the
head node. Interactive jobs do work from the head node itself. The following is in the pbs server logs, but I’ve been
seeing this all along 11/08/2006
17:12:21;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to connect from
172.24.254.1:1023 (address not trusted) And from maui’s logs: 11/08 17:14:17 MReqCreate(1107,SrcRQ,DstRQ,DoCreate) 11/08 17:14:17 INFO: processing node
request line '1:ppn=1' 11/08 17:14:17 INFO: job '1107'
loaded: 1 apreece nweng 14400 Idle
0 1163031254 [NONE] [NONE] [NONE]
>= 0 >= 0
[NONE] 1163031257 11/08 17:14:17 INFO: 21 PBS jobs
detected on RM base 11/08 17:14:17 INFO: jobs detected:
21 11/08 17:14:17 INFO: total jobs
selected (ALL): 6/21 [State: 15] 11/08 17:14:17 INFO: total jobs
selected (ALL): 1/21 [State: 15][Policy: 5] 11/08 17:14:17 INFO: total jobs
selected in partition ALL: 1/6 [Policy: 5] 11/08 17:14:17 MQueueScheduleRJobs(Q) 11/08 17:14:17 INFO: total jobs
selected in partition ALL: 1/1 11/08 17:14:17 INFO: total jobs
selected in partition ALL: 0/1 [PartitionAccess: 1] 11/08 17:14:17 INFO: total jobs
selected in partition opteron: 1/1 11/08 17:14:17 MQueueScheduleIJobs(Q,opteron) 11/08 17:14:17 INFO: 16 feasible
tasks found for job 1107:0 in partition opteron (1 Needed) 11/08 17:14:17 INFO: tasks located
for job 1107: 1 of 1 required (1 feasible) 11/08 17:14:17 MJobStart(1107) 11/08 17:14:17 MRMJobStart(1107,Msg,SC) 11/08 17:14:17 MPBSJobStart(1107,base,Msg,SC) 11/08 17:14:17
MPBSJobModify(1107,Resource_List,Resource,brahe003.cluster) 11/08 17:14:17
MPBSJobModify(1107,Resource_List,Resource,1:ppn=1) 11/08 17:14:17 INFO: job '1107'
successfully started 11/08 17:14:17 INFO: starting job
'1107' 11/08 17:14:17 INFO: 1 jobs started
on iteration 293 Active Jobs------ ------------------ 11/08 17:14:17 INFO: resources
available after scheduling: N: 0 P: 3 11/08 17:14:17 INFO: total jobs
selected in partition ALL: 0/6 [State: 1][Policy: 5] 11/08 17:14:17 INFO: total jobs
selected in partition ALL: 0/6 [State: 1][Policy: 5] 11/08 17:14:17 MSchedUpdateStats() 11/08 17:14:17 INFO:
iteration: 293 scheduling time: 0.117 seconds 11/08 17:14:17 INFO: current
util[293]: 4/4 (100.00%) PH: 67.69% active jobs: 16 of 22
(completed: 9) 11/08 17:14:17 INFO: scheduling
complete. sleeping 10 seconds 11/08 17:14:17 INFO: received
service request from host 'brahe000.cluster' 11/08 17:14:17 MSURecvData(S,5000000,TRUE,SC,EMsg) 11/08 17:14:17 UIQueueShow(RBuffer,Buffer,1,root,BufSize) 11/08 17:14:17 UIQueueShowAllJobs(SBuffer,SBufSize,ALL) 11/08 17:14:17 INFO:
UIQueueShowAllJobs buffer size: 1686 bytes 11/08 17:14:17 MSUSendData(S,5000000,TRUE,TRUE) 11/08 17:14:17 INFO: packet sent
(1772 bytes of 1772) 11/08 17:14:17 MSUDisconnect(S) 11/08 17:14:22 INFO: received
service request from host 'brahe000.cluster' 11/08 17:14:22 MSURecvData(S,5000000,TRUE,SC,EMsg) 11/08 17:14:22 UIQueueShow(RBuffer,Buffer,1,root,BufSize) 11/08 17:14:22 UIQueueShowAllJobs(SBuffer,SBufSize,ALL) 11/08 17:14:22 INFO:
UIQueueShowAllJobs buffer size: 1686 bytes 11/08 17:14:22 MSUSendData(S,5000000,TRUE,TRUE) 11/08 17:14:22 INFO: packet sent
(1772 bytes of 1772) 11/08 17:14:22 MSUDisconnect(S) 11/08 17:14:28 ServerProcessRequests() 11/08 17:14:28 MResAdjust(NULL,0,0) 11/08 17:14:28 MStatInitializeActiveSysUsage() 11/08 17:14:28 INFO: starting
iteration 294 11/08 17:14:28 MRMGetInfo() 11/08 17:14:28 MRMClusterQuery() 11/08 17:14:28 MPBSClusterQuery(base,RCount,SC) 11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode) 11/08 17:14:28 INFO: PBS node
brahe001.cluster set to state Busy (job-exclusive) 11/08 17:14:28
MPBSNodeUpdate(brahe001.cluster,brahe001.cluster,Busy,base) 11/08 17:14:28 MPBSLoadQueueInfo(base,brahe001.cluster,SC) 11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode) 11/08 17:14:28 INFO: PBS node
brahe002.cluster set to state Busy (job-exclusive) 11/08 17:14:28
MPBSNodeUpdate(brahe002.cluster,brahe002.cluster,Busy,base) 11/08 17:14:28 MPBSLoadQueueInfo(base,brahe002.cluster,SC) 11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode) 11/08 17:14:28 INFO: PBS node
brahe003.cluster set to state Idle (free) 11/08 17:14:28 INFO: node
'brahe003.cluster' changed states from Running to Idle 11/08 17:14:28
MPBSNodeUpdate(brahe003.cluster,brahe003.cluster,Idle,base) 11/08 17:14:28 MPBSLoadQueueInfo(base,brahe003.cluster,SC) 11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode) 11/08 17:14:28 INFO: PBS node
brahe004.cluster set to state Busy (job-exclusive) 11/08 17:14:28
MPBSNodeUpdate(brahe004.cluster,brahe004.cluster,Busy,base) 11/08 17:14:28 MPBSLoadQueueInfo(base,brahe004.cluster,SC) 11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode) 11/08 17:14:28 INFO: 6 PBS resources
detected on RM base 11/08 17:14:28 INFO: resources
detected: 6 11/08 17:14:28 MRMWorkloadQuery() 11/08 17:14:28 MPBSWorkloadQuery(base,JCount,SC) 11/08 17:14:28
MPBSJobUpdate(1084,1084.brahe000.cluster,TaskList,0) 11/08 17:14:28 MPBSJobUpdate(1085,1085.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1086,1086.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1087,1087.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1088,1088.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1089,1089.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1090,1090.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1091,1091.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1092,1092.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1093,1093.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1094,1094.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1095,1095.brahe000.cluster,TaskList,0) 11/08 17:14:28 MPBSJobUpdate(1097,1097.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1098,1098.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1099,1099.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1100,1100.brahe000.cluster,TaskList,0) 11/08 17:14:28 MPBSJobUpdate(1101,1101.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1102,1102.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1103,1103.brahe000.cluster,TaskList,0) 11/08 17:14:28
MPBSJobUpdate(1104,1104.brahe000.cluster,TaskList,0) 11/08 17:14:28 INFO: active PBS job
1107 has been removed from the queue. assuming successful completion Any help would be greatly appreciated! Thanks, Andrew. |
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
