We are using ge 6.2u5 with CentOS 6.4.

I have jobs that are randomly being killed. Here is the log entry. The jobs
that are getting killed are getting an exit status of 127 or 137. I did
check /var/log/messages on the nodes and didn't see anything out of the
ordinary.

03/31/2014 09:55:30|worker|kepler|W|job 33393.1 failed on host
research029.cm.cluster assumedly after job because: job 33393.1 died
through signal KILL (9)

03/31/2014 09:55:34|worker|kepler|W|job 33394.1 failed on host
research026.cm.cluster assumedly after job because: job 33394.1 died
through signal KILL (9)

qacct -j 33394

qname        std
hostname     research026.cm.cluster
group        justinchem
owner        justinchem
project      NONE
department   defaultdepartment
jobname      runCHO-C6H5-Cs_opt.24081
jobnumber    33394
taskid       undefined
account      sge
priority     0
qsub_time    Mon Mar 31 09:54:53 2014
start_time   Mon Mar 31 09:55:10 2014
end_time     Mon Mar 31 09:55:33 2014
granted_pe   gauss
slots        4
failed       100 : assumedly after job
exit_status  137
ru_wallclock 23
ru_utime     0.003
ru_stime     0.008
ru_maxrss    1380
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    1957
ru_majflt    5
ru_nswap     0
ru_inblock   584
ru_oublock   40
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     58
ru_nivcsw    6
cpu          82.570
mem          452.669
io           0.084
iow          0.000
maxvmem      5.710G
arid         undefined

Thanks,

Eric

-- 
Eric Kaufmann |  Application Support Analyst -  Advanced Technology Group |
Saint Louis University | 314-977-2257 | [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to