Dear all,
i have a problem that jobs sent to gridengine randomly die.
The gridengine version is 8.1.9
The OS is opensuse 15.0
The gridengine messages file says:
05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing 
job
05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
assumedly after job because: job 635659.1 died through signal KILL (9)

qacct -j 635659 says:
failed       100 : assumedly after job
exit_status  137                  (Killed)


The was no kill triggered by the user. Also there are no other limitations, 
neither ulimit nor in the gridengine queue
The 'qconf -sq all.q' command gives:
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Years ago there were some threads about the same issue, but i did not find a 
solution.

Does somebody have a hint what i can do or check/debug?

With kind regards and many thanks for any help, ulrich
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to