On Tue, 2019-05-14 at 10:03 -0400, Feng Zhang wrote: > looks like your job used a lot of ram: > > mem 7.463TBs > io 70.435GB > iow 0.000s > maxvmem 532.004MB
Not really 532MB isn't a lot of memory these days. The mem figure is in TerraByte Seconds which accumulate fairly quickly. At 512 M you get a TBs every 2000 seconds or so. However the fact that it is reporting these numbers indicates some sort of built in memory limit was enabled. Grid Engine won't measure memory usage unless it has some sort of limit to enforce. William > > Do you have CGROUP to limit resource of jobs? > > Best, > > Feng > > On Tue, May 14, 2019 at 9:53 AM hiller <hil...@mpia-hd.mpg.de> wrote: > > > > ~> qconf -srqs > > No resource quota set found > > > > 'dmesg -T' does not give an oom or other weird messages. > > > > 'free -h' looks good and also looked good at 'kill time': > > > > ~> free -h > > total used free shared buff/cache > > available > > Mem: 188G 1.0G 185G 2.6M 2.0G > > 186G > > Swap: 49G 0B 49G > > > > Full output of qacct: > > ~> qacct -j 635659 > > ============================================================== > > qname all.q > > hostname karun10 > > group users > > owner calj > > project NONE > > department defaultdepartment > > jobname dsc_gdr2 > > jobnumber 635659 > > taskid undefined > > account sge > > priority 0 > > qsub_time Mon May 13 13:06:58 2019 > > start_time Mon May 13 13:06:56 2019 > > end_time Mon May 13 18:31:42 2019 > > granted_pe make > > slots 1 > > failed 100 : assumedly after job > > exit_status 137 (Killed) > > ru_wallclock 19486s > > ru_utime 0.048s > > ru_stime 0.006s > > ru_maxrss 11.566KB > > ru_ixrss 0.000B > > ru_ismrss 0.000B > > ru_idrss 0.000B > > ru_isrss 0.000B > > ru_minflt 7885 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 8 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 142 > > ru_nivcsw 3 > > cpu 19305.760s > > mem 7.463TBs > > io 70.435GB > > iow 0.000s > > maxvmem 532.004MB > > arid undefined > > ar_sub_time undefined > > category -l hostname=karun10 -pe make 1 > > > > > > Thanks, ulrich > > > > > > On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote: > > > It's a limit being reached, of some sort. Do you have a RQS of > > > any kind (qconf -srqs)? We see this for job-requested, or system > > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on > > > compute nodes often useful), as well as time limits reached. What > > > is the whole output from 'qacct -j JOBID'? > > > > > > Cheers, > > > -Hugh > > > > > > -----Original Message----- > > > From: users-boun...@gridengine.org <users-boun...@gridengine.org> > > > On Behalf Of hiller > > > Sent: Tuesday, May 14, 2019 9:02 AM > > > To: users@gridengine.org > > > Subject: Re: [gridengine users] jobs randomly die > > > > > > Hi, > > > nope, there are no oom messages in the journal. > > > Regards, ulrich > > > > > > > > > On 5/14/19 12:49 PM, Arnau wrote: > > > > Hi, > > > > > > > > _maybe_ the OOM killer killed the job ? a look to messages will > > > > give you an answer (I've seen this in my cluster). > > > > > > > > HTH, > > > > Arnau > > > > > > > > El mar., 14 may. 2019 a las 12:37, hiller (<hil...@mpia-hd.mpg. > > > > de <mailto:hil...@mpia-hd.mpg.de>>) escribió: > > > > > > > > Dear all, > > > > i have a problem that jobs sent to gridengine randomly die. > > > > The gridengine version is 8.1.9 > > > > The OS is opensuse 15.0 > > > > The gridengine messages file says: > > > > 05/13/2019 18:31:45|worker|karun|E|master task of job > > > > 635659.1 failed - killing job > > > > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on > > > > host karun10 assumedly after job because: job 635659.1 died > > > > through signal KILL (9) > > > > > > > > qacct -j 635659 says: > > > > failed 100 : assumedly after job > > > > exit_status 137 (Killed) > > > > > > > > > > > > The was no kill triggered by the user. Also there are no > > > > other limitations, neither ulimit nor in the gridengine queue > > > > The 'qconf -sq all.q' command gives: > > > > s_rt INFINITY > > > > h_rt INFINITY > > > > s_cpu INFINITY > > > > h_cpu INFINITY > > > > s_fsize INFINITY > > > > h_fsize INFINITY > > > > s_data INFINITY > > > > h_data INFINITY > > > > s_stack INFINITY > > > > h_stack INFINITY > > > > s_core INFINITY > > > > h_core INFINITY > > > > s_rss INFINITY > > > > h_rss INFINITY > > > > s_vmem INFINITY > > > > h_vmem INFINITY > > > > > > > > Years ago there were some threads about the same issue, but > > > > i did not find a solution. > > > > > > > > Does somebody have a hint what i can do or check/debug? > > > > > > > > With kind regards and many thanks for any help, ulrich > > > > _______________________________________________ > > > > users mailing list > > > > users@gridengine.org <mailto:users@gridengine.org> > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3 > > > > A%2F%2Fgridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02% > > > > 7C01%7Cw.hay%40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1 > > > > faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C636934401623294243&am > > > > p;sdata=KzIWuZo2f%2FoxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&re > > > > served=0 > > > > > > > > > > _______________________________________________ > > > users mailing list > > > users@gridengine.org > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2 > > > Fgridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw > > > .hay%40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea99 > > > 84c5b93c9210a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIW > > > uZo2f%2FoxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0 > > > > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg > > ridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw.hay > > %40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea9984c5b9 > > 3c9210a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIWuZo2f%2F > > oxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0 > > _______________________________________________ > users mailing list > users@gridengine.org > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgri > dengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw.hay%40u > cl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea9984c5b93c9210 > a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIWuZo2f%2FoxmYoLNb > oOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0 _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users