I have an interesting issue that I'm having trouble diagnosing. Periodically, Maui will simply stop scheduling jobs, even though adequate resources exist. A restart of the Torque-Maui processes on the scheduler node alleviates the issue, only for it to reappear some time later. Here's a sample of the Maui log when it is misbehaving:
Maui Version 3.3.1 Torque Version 4.2.6.1 Maui.log 04/15 06:45:01 MResAdjust(NULL,0,0) 04/15 06:45:01 MStatInitializeActiveSysUsage() 04/15 06:45:01 MStatClearUsage([NONE],Active) 04/15 06:45:01 ServerUpdate() 04/15 06:45:01 MSysUpdateTime() 04/15 06:45:01 INFO: starting iteration 993 04/15 06:45:01 MRMGetInfo() 04/15 06:45:01 MClusterClearUsage() 04/15 06:45:01 MRMClusterQuery() 04/15 06:45:01 MPBSClusterQuery(xxxx.xxxx.COM,RCount,SC) 04/15 06:45:10 ERROR: cannot get node info: End of File 04/15 06:45:15 ALERT: cannot load cluster resources on RM (RM 'xxxxx.xxxxx.COM' failed in function 'clusterquery') 04/15 06:45:15 WARNING: no resources detected . . . 04/15 06:45:15 INFO: attempting BF backfill with job ' 1068124' (p: 3 t: 86400) 04/15 06:45:15 INFO: 580 feasible tasks found for job 1068124:0 in partition DEFAULT (3 Needed) 04/15 06:45:15 INFO: inadequate feasible tasks found for job 1068124:0 (0 < 3) 04/15 06:45:15 INFO: inadequate nodes found for job 1068124:0 (0 < 1) 04/15 06:45:15 INFO: attempting BF backfill with job ' 1068125' (p: 3 t: 86400) 04/15 06:45:15 INFO: 580 feasible tasks found for job 1068125:0 in partition DEFAULT (3 Needed) 04/15 06:45:15 INFO: inadequate feasible tasks found for job 1068125:0 (0 < 3) 04/15 06:45:15 INFO: inadequate nodes found for job 1068125:0 (0 < 1) The torque logs are littered with errors like the following when the system is hung: 04/15/2015 06:41:29;0001;PBS_Server.6443;Svr;PBS_Server;LOG_ERROR::Connection refused (111) in contact_listener, Could not contact Listener 0 - port 15002 cannot connect to port 1015 in client_to_svr - connection refused. 04/15/2015 06:41:34;0008;PBS_Server.6391;Req;decode_DIS_replySvr;failed to get PROT_TYPE: 0, (rc: 11) 04/15/2015 06:41:34;0008;PBS_Server.6391;Job;send_request_to_remote_server;DIS_reply_read failed: 11 04/15/2015 06:41:34;0080;PBS_Server.6391;Req;req_reject;Reject reply code=11(Resource temporarily unavailable), aux=0, type=DeleteJob, from r...@xxxx.xxxx.com 04/15/2015 06:41:34;0008;PBS_Server.6391;Job;1064040.xxxx.xxxxxx.com;Job sent signal SIGTERM on delete 04/15/2015 06:41:34;0002;PBS_Server.6391;Req;dis_reply_write;DIS reply failure, -1 04/15/2015 06:41:35;0001;PBS_Server.6375;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.114:858 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:41:39;0008;PBS_Server.6459;Job;1064040.xxxxx.xxxxxxx.com;Job deleted at request of r...@xxxx.xxxx.com 04/15/2015 06:41:45;0001;PBS_Server.6457;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.113:819 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:41:49;0001;PBS_Server.6390;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.110:799 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:41:51;0001;PBS_Server.6443;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.210:797 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:42:04;0001;PBS_Server.6440;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.112:741 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:42:10;0001;PBS_Server.6448;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.111:238 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:42:13;0001;PBS_Server.6373;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.3.48:568 (address not trusted - check entry in server_priv/nodes) 04/15/2015 06:42:20;0001;PBS_Server.6396;Svr;PBS_Server;LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.21.114:858 (address not trusted - check entry in server_priv/nodes) I've tried looking around for solutions but to no avail. Any thoughts on where to start in diagnosing what is going wrong, and why it does so intermittently? -Eric
_______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers