[gridengine users] Jobs Getting Killed

Eric Kaufmann Thu, 13 Feb 2014 13:44:08 -0800

We have jobs that are randomly getting killed. We are running GE 6.2u5

This is from the messages log:


02/13/2014 08:34:49|worker|kepler|W|job 31233.1 failed on host
c052.cm.cluster assumedly after job because: job 31233.1 died through
signal KILL (9)
02/13/2014 09:23:00| timer|kepler|W|got timeout error while write data to
heartbeat file "heartbeat"
02/13/2014 09:44:55|worker|kepler|W|job 30895.1 failed on host
c062.cm.cluster assumedly after job because: job 30895.1 died through
signal KILL (9)
02/13/2014 11:28:26| timer|kepler|W|got timeout error while write data to
heartbeat file "heartbeat"
02/13/2014 11:41:17|event_|kepler|W|acknowledge timeout after 600 seconds
for event client (schedd:0) on host "kepler"
02/13/2014 11:48:34| timer|kepler|W|got timeout error while write data to
heartbeat file "heartbeat"

I do have a hard limit set.

s_rt                  INFINITY
h_rt                  96:00:00
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

I am running GE from a NFS share. Would this have something to do with the
exehost spool directory configuration?

Thanks,

Eric



-- 
Eric Kaufmann |  Application Support Analyst -  Advanced Technology Group |
Saint Louis University | 314-977-2257 | [email protected]

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Jobs Getting Killed

Reply via email to