Hello all!

 

I’ve got an OSCAR 4.2.1 cluster running on 4 quad opteron compute nodes.

 

The head node has the same hardware configuration. Interactive jobs were running fine until earlier today. There were no changes made to the configuration of TORQUE or MAUI.

 

Suddenly I’m getting the following from qsub when I try to submit an interactive job from a login node:

 

qsub: waiting for job 1106.brahe000.cluster to start

qsub: job 1106.brahe000.cluster apparently deleted

 

The login node *is* in the hosts.equiv file on the head node.

Interactive jobs do work from the head node itself.

 

The following is in the pbs server logs, but I’ve been seeing this all along

 

11/08/2006 17:12:21;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to connect from 172.24.254.1:1023 (address not trusted)

 

And from maui’s logs:

 

11/08 17:14:17 MReqCreate(1107,SrcRQ,DstRQ,DoCreate)

11/08 17:14:17 INFO:     processing node request line '1:ppn=1'

11/08 17:14:17 INFO:     job '1107' loaded:   1  apreece    nweng  14400       Idle   0 1163031254   [NONE] [NONE] [NONE] >=      0 >=      0 [NONE] 1163031257

11/08 17:14:17 INFO:     21 PBS jobs detected on RM base

11/08 17:14:17 INFO:     jobs detected: 21

11/08 17:14:17 INFO:     total jobs selected (ALL): 6/21 [State: 15]

11/08 17:14:17 INFO:     total jobs selected (ALL): 1/21 [State: 15][Policy: 5]

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 1/6 [Policy: 5]

11/08 17:14:17 MQueueScheduleRJobs(Q)

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 1/1

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/1 [PartitionAccess: 1]

11/08 17:14:17 INFO:     total jobs selected in partition opteron: 1/1

11/08 17:14:17 MQueueScheduleIJobs(Q,opteron)

11/08 17:14:17 INFO:     16 feasible tasks found for job 1107:0 in partition opteron (1 Needed)

11/08 17:14:17 INFO:     tasks located for job 1107:  1 of 1 required (1 feasible)

11/08 17:14:17 MJobStart(1107)

11/08 17:14:17 MRMJobStart(1107,Msg,SC)

11/08 17:14:17 MPBSJobStart(1107,base,Msg,SC)

11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,brahe003.cluster)

11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,1:ppn=1)

11/08 17:14:17 INFO:     job '1107' successfully started

11/08 17:14:17 INFO:     starting job '1107'

11/08 17:14:17 INFO:     1 jobs started on iteration 293

Active Jobs------

------------------

11/08 17:14:17 INFO:     resources available after scheduling: N: 0  P: 3

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]

11/08 17:14:17 MSchedUpdateStats()

11/08 17:14:17 INFO:     iteration:  293   scheduling time:  0.117 seconds

11/08 17:14:17 INFO:     current util[293]:  4/4 (100.00%)  PH: 67.69%  active jobs: 16 of 22 (completed: 9)

11/08 17:14:17 INFO:     scheduling complete.  sleeping 10 seconds

11/08 17:14:17 INFO:     received service request from host 'brahe000.cluster'

11/08 17:14:17 MSURecvData(S,5000000,TRUE,SC,EMsg)

11/08 17:14:17 UIQueueShow(RBuffer,Buffer,1,root,BufSize)

11/08 17:14:17 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)

11/08 17:14:17 INFO:     UIQueueShowAllJobs buffer size: 1686 bytes

11/08 17:14:17 MSUSendData(S,5000000,TRUE,TRUE)

11/08 17:14:17 INFO:     packet sent (1772 bytes of 1772)

11/08 17:14:17 MSUDisconnect(S)

11/08 17:14:22 INFO:     received service request from host 'brahe000.cluster'

11/08 17:14:22 MSURecvData(S,5000000,TRUE,SC,EMsg)

11/08 17:14:22 UIQueueShow(RBuffer,Buffer,1,root,BufSize)

11/08 17:14:22 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)

11/08 17:14:22 INFO:     UIQueueShowAllJobs buffer size: 1686 bytes

11/08 17:14:22 MSUSendData(S,5000000,TRUE,TRUE)

11/08 17:14:22 INFO:     packet sent (1772 bytes of 1772)

11/08 17:14:22 MSUDisconnect(S)

11/08 17:14:28 ServerProcessRequests()

11/08 17:14:28 MResAdjust(NULL,0,0)

11/08 17:14:28 MStatInitializeActiveSysUsage()

11/08 17:14:28 INFO:     starting iteration 294

11/08 17:14:28 MRMGetInfo()

11/08 17:14:28 MRMClusterQuery()

11/08 17:14:28 MPBSClusterQuery(base,RCount,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe001.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe001.cluster,brahe001.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe001.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe002.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe002.cluster,brahe002.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe002.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe003.cluster set to state Idle (free)

11/08 17:14:28 INFO:     node 'brahe003.cluster' changed states from Running to Idle

11/08 17:14:28 MPBSNodeUpdate(brahe003.cluster,brahe003.cluster,Idle,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe003.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe004.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe004.cluster,brahe004.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe004.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     6 PBS resources detected on RM base

11/08 17:14:28 INFO:     resources detected: 6

11/08 17:14:28 MRMWorkloadQuery()

11/08 17:14:28 MPBSWorkloadQuery(base,JCount,SC)

11/08 17:14:28 MPBSJobUpdate(1084,1084.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1085,1085.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1086,1086.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1087,1087.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1088,1088.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1089,1089.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1090,1090.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1091,1091.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1092,1092.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1093,1093.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1094,1094.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1095,1095.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1097,1097.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1098,1098.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1099,1099.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1100,1100.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1101,1101.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1102,1102.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1103,1103.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1104,1104.brahe000.cluster,TaskList,0)

11/08 17:14:28 INFO:     active PBS job 1107 has been removed from the queue.  assuming successful completion

 

Any help would be greatly appreciated!

 

Thanks,

Andrew.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to