~> qconf -srqs
No resource quota set found
'dmesg -T' does not give an oom or other weird messages.
'free -h' looks good and also looked good at 'kill time':
~> free -h
total used free shared buff/cache available
Mem: 188G 1.0G 185G 2.6M 2.0G 186G
Swap: 49G 0B 49G
Full output of qacct:
~> qacct -j 635659
==============================================================
qname all.q
hostname karun10
group users
owner calj
project NONE
department defaultdepartment
jobname dsc_gdr2
jobnumber 635659
taskid undefined
account sge
priority 0
qsub_time Mon May 13 13:06:58 2019
start_time Mon May 13 13:06:56 2019
end_time Mon May 13 18:31:42 2019
granted_pe make
slots 1
failed 100 : assumedly after job
exit_status 137 (Killed)
ru_wallclock 19486s
ru_utime 0.048s
ru_stime 0.006s
ru_maxrss 11.566KB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 7885
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 8
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 142
ru_nivcsw 3
cpu 19305.760s
mem 7.463TBs
io 70.435GB
iow 0.000s
maxvmem 532.004MB
arid undefined
ar_sub_time undefined
category -l hostname=karun10 -pe make 1
Thanks, ulrich
On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> It's a limit being reached, of some sort. Do you have a RQS of any kind
> (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion
> (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well
> as time limits reached. What is the whole output from 'qacct -j JOBID'?
>
> Cheers,
> -Hugh
>
> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf
> Of hiller
> Sent: Tuesday, May 14, 2019 9:02 AM
> To: [email protected]
> Subject: Re: [gridengine users] jobs randomly die
>
> Hi,
> nope, there are no oom messages in the journal.
> Regards, ulrich
>
>
> On 5/14/19 12:49 PM, Arnau wrote:
>> Hi,
>>
>> _maybe_ the OOM killer killed the job ? a look to messages will give you an
>> answer (I've seen this in my cluster).
>>
>> HTH,
>> Arnau
>>
>> El mar., 14 may. 2019 a las 12:37, hiller (<[email protected]
>> <mailto:[email protected]>>) escribió:
>>
>> Dear all,
>> i have a problem that jobs sent to gridengine randomly die.
>> The gridengine version is 8.1.9
>> The OS is opensuse 15.0
>> The gridengine messages file says:
>> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed -
>> killing job
>> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10
>> assumedly after job because: job 635659.1 died through signal KILL (9)
>>
>> qacct -j 635659 says:
>> failed 100 : assumedly after job
>> exit_status 137 (Killed)
>>
>>
>> The was no kill triggered by the user. Also there are no other
>> limitations, neither ulimit nor in the gridengine queue
>> The 'qconf -sq all.q' command gives:
>> s_rt INFINITY
>> h_rt INFINITY
>> s_cpu INFINITY
>> h_cpu INFINITY
>> s_fsize INFINITY
>> h_fsize INFINITY
>> s_data INFINITY
>> h_data INFINITY
>> s_stack INFINITY
>> h_stack INFINITY
>> s_core INFINITY
>> h_core INFINITY
>> s_rss INFINITY
>> h_rss INFINITY
>> s_vmem INFINITY
>> h_vmem INFINITY
>>
>> Years ago there were some threads about the same issue, but i did not
>> find a solution.
>>
>> Does somebody have a hint what i can do or check/debug?
>>
>> With kind regards and many thanks for any help, ulrich
>> _______________________________________________
>> users mailing list
>> [email protected] <mailto:[email protected]>
>> https://gridengine.org/mailman/listinfo/users
>>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users