We have jobs that are randomly getting killed. We are running GE 6.2u5 This is from the messages log:
02/13/2014 08:34:49|worker|kepler|W|job 31233.1 failed on host c052.cm.cluster assumedly after job because: job 31233.1 died through signal KILL (9) 02/13/2014 09:23:00| timer|kepler|W|got timeout error while write data to heartbeat file "heartbeat" 02/13/2014 09:44:55|worker|kepler|W|job 30895.1 failed on host c062.cm.cluster assumedly after job because: job 30895.1 died through signal KILL (9) 02/13/2014 11:28:26| timer|kepler|W|got timeout error while write data to heartbeat file "heartbeat" 02/13/2014 11:41:17|event_|kepler|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "kepler" 02/13/2014 11:48:34| timer|kepler|W|got timeout error while write data to heartbeat file "heartbeat" I do have a hard limit set. s_rt INFINITY h_rt 96:00:00 s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY I am running GE from a NFS share. Would this have something to do with the exehost spool directory configuration? Thanks, Eric -- Eric Kaufmann | Application Support Analyst - Advanced Technology Group | Saint Louis University | 314-977-2257 | [email protected]
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
