AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would be: what was the initial command in the job script, and what output/error did it generate?
-- Reuti > Am 14.05.2019 um 11:36 schrieb hiller <hil...@mpia-hd.mpg.de>: > > Dear all, > i have a problem that jobs sent to gridengine randomly die. > The gridengine version is 8.1.9 > The OS is opensuse 15.0 > The gridengine messages file says: > 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - > killing job > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 > assumedly after job because: job 635659.1 died through signal KILL (9) > > qacct -j 635659 says: > failed 100 : assumedly after job > exit_status 137 (Killed) > > > The was no kill triggered by the user. Also there are no other limitations, > neither ulimit nor in the gridengine queue > The 'qconf -sq all.q' command gives: > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > Years ago there were some threads about the same issue, but i did not find a > solution. > > Does somebody have a hint what i can do or check/debug? > > With kind regards and many thanks for any help, ulrich > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users