I have an interesting issue that I'm having trouble diagnosing.  Periodically, 
Maui will simply stop scheduling jobs, even though adequate resources exist.  A 
restart of the Torque-Maui processes on the scheduler node alleviates the 
issue, only for it to reappear some time later.  Here's a sample of the Maui 
log when it is misbehaving:

Maui Version 3.3.1
Torque Version 4.2.6.1

Maui.log
04/15 06:45:01 MResAdjust(NULL,0,0)
04/15 06:45:01 MStatInitializeActiveSysUsage()
04/15 06:45:01 MStatClearUsage([NONE],Active)
04/15 06:45:01 ServerUpdate()
04/15 06:45:01 MSysUpdateTime()
04/15 06:45:01 INFO:     starting iteration 993
04/15 06:45:01 MRMGetInfo()
04/15 06:45:01 MClusterClearUsage()
04/15 06:45:01 MRMClusterQuery()
04/15 06:45:01 MPBSClusterQuery(xxxx.xxxx.COM,RCount,SC)
04/15 06:45:10 ERROR:    cannot get node info: End of File
04/15 06:45:15 ALERT:    cannot load cluster resources on RM (RM 
'xxxxx.xxxxx.COM' failed in function 'clusterquery')
04/15 06:45:15 WARNING:  no resources detected
.
.
.
04/15 06:45:15 INFO:     attempting BF backfill with job '        1068124' (p:  
 3 t: 86400)
04/15 06:45:15 INFO:     580 feasible tasks found for job 1068124:0 in 
partition DEFAULT (3 Needed)
04/15 06:45:15 INFO:     inadequate feasible tasks found for job 1068124:0 (0 < 
3)
04/15 06:45:15 INFO:     inadequate nodes found for job 1068124:0 (0 < 1)
04/15 06:45:15 INFO:     attempting BF backfill with job '        1068125' (p:  
 3 t: 86400)
04/15 06:45:15 INFO:     580 feasible tasks found for job 1068125:0 in 
partition DEFAULT (3 Needed)
04/15 06:45:15 INFO:     inadequate feasible tasks found for job 1068125:0 (0 < 
3)
04/15 06:45:15 INFO:     inadequate nodes found for job 1068125:0 (0 < 1)

The torque logs are littered with errors like the following when the system is 
hung:

04/15/2015 06:41:29;0001;PBS_Server.6443;Svr;PBS_Server;LOG_ERROR::Connection 
refused (111) in contact_listener, Could not contact Listener 0 - port 15002 
cannot connect to port 1015 in client_to_svr - connection refused.
04/15/2015 06:41:34;0008;PBS_Server.6391;Req;decode_DIS_replySvr;failed to get 
PROT_TYPE: 0, (rc: 11)
04/15/2015 
06:41:34;0008;PBS_Server.6391;Job;send_request_to_remote_server;DIS_reply_read 
failed: 11
04/15/2015 06:41:34;0080;PBS_Server.6391;Req;req_reject;Reject reply 
code=11(Resource temporarily unavailable), aux=0, type=DeleteJob, from 
r...@xxxx.xxxx.com
04/15/2015 06:41:34;0008;PBS_Server.6391;Job;1064040.xxxx.xxxxxx.com;Job sent 
signal SIGTERM on delete
04/15/2015 06:41:34;0002;PBS_Server.6391;Req;dis_reply_write;DIS reply failure, 
-1
04/15/2015 
06:41:35;0001;PBS_Server.6375;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.114:858 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 06:41:39;0008;PBS_Server.6459;Job;1064040.xxxxx.xxxxxxx.com;Job 
deleted at request of r...@xxxx.xxxx.com
04/15/2015 
06:41:45;0001;PBS_Server.6457;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.113:819 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 
06:41:49;0001;PBS_Server.6390;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.110:799 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 
06:41:51;0001;PBS_Server.6443;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.210:797 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 
06:42:04;0001;PBS_Server.6440;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.112:741 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 
06:42:10;0001;PBS_Server.6448;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.111:238 (address not trusted - check entry 
in server_priv/nodes)
04/15/2015 
06:42:13;0001;PBS_Server.6373;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.3.48:568 (address not trusted - check entry in 
server_priv/nodes)
04/15/2015 
06:42:20;0001;PBS_Server.6396;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad 
attempt to connect from 192.168.21.114:858 (address not trusted - check entry 
in server_priv/nodes)


I've tried looking around for solutions but to no avail.  Any thoughts on where 
to start in diagnosing what is going wrong, and why it does so intermittently?

-Eric
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to