Hi. Interesting.
From /opt/gridengine/default/spool/compute-0-4/messages, we are seeing some unusual stuff (or, maybe it is entirely run of the mill?): 01/16/2013 18:05:50| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:07:56| main|compute-0-4|E|removing unreferenced job 1350379.4111 without job report from ptf 01/16/2013 18:08:09| main|compute-0-4|E|removing unreferenced job 1350379.4123 without job report from ptf 01/16/2013 18:09:41| main|compute-0-4|E|removing unreferenced job 1350379.4165 without job report from ptf 01/16/2013 18:13:21| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:14:05| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:21:29| main|compute-0-4|E|removing unreferenced job 1350379.4555 without job report from ptf 01/16/2013 18:24:44| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:27:26| main|compute-0-4|E|removing unreferenced job 1350379.4825 without job report from ptf 01/16/2013 18:36:38| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:41:41| main|compute-0-4|E|removing unreferenced job 1350379.5401 without job report from ptf 01/16/2013 18:42:49| main|compute-0-4|E|removing unreferenced job 1350379.5455 without job report from ptf 01/16/2013 18:44:30| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:51:56| main|compute-0-4|E|removing unreferenced job 1350379.5857 without job report from ptf 01/16/2013 18:52:19| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:53:49| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:57:10| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:58:27| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 18:58:46| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 19:00:25| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist 01/16/2013 19:20:56| main|compute-0-4|E|removing unreferenced job 1350379.7048 without job report from ptf 01/16/2013 19:29:55| main|compute-0-4|W|reaping job "1350379" ptf complains: Job does not exist (END) So, I immediately jump back to thinking that hard runtime limits are being reached, but, we don't use them as defaults at all! [root@cluster ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u epilog /share/apps/file_trans_eplg.sh h_core INFINITY h_cpu INFINITY h_data INFINITY h_fsize INFINITY h_rss INFINITY h_rt INFINITY h_stack INFINITY h_vmem INFINITY prolog /share/apps/file_trans_prlg.sh s_core INFINITY s_cpu INFINITY s_data INFINITY s_fsize INFINITY s_rss INFINITY s_rt INFINITY s_stack INFINITY s_vmem INFINITY Reuti: I recall way back in October 2011 you saw in issue like this, where jobs simply started disappearing on a user, who posted about it, but I don't think anyone truly ever got to the bottom of why? At this point, we're scratching our heads and considering a reboot of the head node on Friday, as we really aren't understanding what is going wrong here. Thoughts? Thanks! ---JC On 15/01/13 8:24 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >Hi, > >Am 14.01.2013 um 23:08 schrieb Jake Carroll: > >> So we tested out trying to hard set wall-time different for the specific >> user who's experiencing the Exit 137 issue. We noticed the jobs are >>still >> failing, however. > >is there any message about the kill signal in the spooling directory's >messages file of the node, i.e.: > >/opt/gridengine/default/spool/compute-0-4/messages (search for the job id) > >-- Reuti > > >> One job that was killed that included the wall-time setting. Obviously >>the >> job did not run for 24h, anyway input and outputs shown below. >> >> -------- >> - qsub b5_set112.sh >> >> >> >> - b5_set11_2.sh: >> >> #$ -cwd >> #$ -l h_rt=24:00:00 >> >> #$ -l vf=20G >> #$ -N b5_set11_2 >> #$ -m eas >> #$ -M someguy@somewhere >> /blah/blah/blah/bayesRsim <b5_set11_2.par >> >> >> - cat b5_set11_2.e1325823: >> /opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7: >> 8117 Killed /blah/blah/blag/bayesRsim < b5_set11_2.par >> >> >> -qacct -j 1325823 >> ============================================================== >> qname medium.q >> hostname compute-0-4.local >> group users >> owner someguy >> project NONE >> department defaultdepartment >> jobname b5_set11_2 >> jobnumber 1325823 >> taskid undefined >> account sge >> priority 0 >> qsub_time Mon Jan 14 15:36:49 2013 >> start_time Mon Jan 14 15:36:55 2013 >> end_time Mon Jan 14 18:11:56 2013 >> granted_pe NONE >> slots 1 >> failed 0 >> exit_status 137 >> ru_wallclock 9301 >> ru_utime 9262.906 >> ru_stime 7.916 >> ru_maxrss 13820636 >> ru_ixrss 0 >> ru_ismrss 0 >> ru_idrss 0 >> ru_isrss 0 >> ru_minflt 46056 >> ru_majflt 26 >> ru_nswap 0 >> ru_inblock 392840 >> ru_oublock 32 >> ru_msgsnd 0 >> ru_msgrcv 0 >> ru_nsignals 0 >> ru_nvcsw 536 >> ru_nivcsw 30791 >> cpu 9270.822 >> mem 61688.906 >> io 0.430 >> iow 0.000 >> maxvmem 13.302G >> arid undefined >> >> So, you mentioned "default time limit of your shell". My googling >> suggested trying to set a wall time limit, or have the user specify the >> wall time, but that did not help. A few google searches show the use of >>a >> global time limit for jobs in general, but make no reference to a >>default >> time limit of the shell. Am I supposed to be looking at limits such as >> s_rt and h_rt? If so, how go I manipulate these for the specific user? >>The >> queue_conf man page makes some reference to this, but it doesn't explain >> explicitly how to manipulate it globally or on a per user basis making >> reference to defaults or "shell". >> >> Sorry - just stumbling through this and not finding it too intuitive. >> >> >> --JC >> >> >> >> >> On 14/01/13 10:34 AM, "Ron Chen" <ron_chen_...@yahoo.com> wrote: >> >>> Exit code 137 = process was killed because it exceeded the time limit, >>> and Google is your best friend if you have similar issues - and the >>> solution is to check the default time limit of your shell. >>> >>> -Ron >>> >>> >>> >>>************************************************************************ >>> >>> Open Grid Scheduler - the official open source Grid Engine: >>> http://gridscheduler.sourceforge.net/ >>> >>> >>> >>> >>> >>> ________________________________ >>> From: Jake Carroll <jake.carr...@uq.edu.au> >>> To: "users@gridengine.org" <users@gridengine.org> >>> Sent: Sunday, January 13, 2013 6:56 PM >>> Subject: [gridengine users] Error 137 - trying to figure out what it >>> means. >>> >>> >>> Hi all. >>> >>> We're trying to figure out the answer to a problem that is escaping us. >>> We can usually self solve most of these issues, but this one, we're >>> having problems trapping and can't find any solid answers for after a >>>lot >>> of looking around on online resources. >>> >>> One of our quite capable users [read: he rarely needs our help with >>>grid >>> engine] has an unusual issue with certain jobs (seemingly, randomly?) >>> crashing out on error 137. The code is predominantly C++ based running >>> atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard >>>for >>> us is that sometimes these array based jobs (non PE's/parallel >>> environments and no mpi/mpich explicit in use) are only crashing >>> sometimes. Some, and not others. It seems almost quasi-random. >>> >>> The code is written in fortran compiled with Intels ifort, using >>>standard >>> code optimisation (compile flag 02). However, the code is also >>>compiled >>> with optimisation turned off and traceback and error reporting turned >>>on, >>> and in both cases programs failed and no run-time error was printed. >>>The >>> same code was also compiled with gfortran and did also produce error >>> '137'. >>> >>> The code run successfully numerous times, but is doing something >>>slightly >>> different each time due to random sampling and different model >>> specifications. There are 20 jobs because analyses are run across 20 >>> replicates of a simulations. Previously our user had >>> no problems running these 20 replicates across 11 different models >>> (20x11=220 runs). >>> >>> Some specifics: >>> >>> Array jobMemory allocation is 20GB, and the job uses less than 14GB. >>> >>> Submitted through a shell script qsub test.sh, where test sh looks >>>like: >>> >>> ------------------------------------------------------- >>> #$ -cwd >>> #$ -l vf=20G >>> #$ -N b1_set12_1 >>> #$ -m eas >>> #$ -M someu...@somedomain.com >>> /path/to/some/stuff/here/bayesRsim <b1_set12_1.par >>> >>> >>>------------------------------------------------------------------------ >>>-- >>> --------------------------------------- >>> >>> Intels default is 'static compiling' from what we understand, in anyway >>> no external libraries are used (although Intel uses its own MKL >>>library). >>> >>> >>> We can't see any obvious memory starvation issues or resource >>>contention >>> problems. Do you have any suggestions in things we could look at to >>>trap >>> this? The error 137 stuff online, after looking around a little, seems >>> sparse at best. >>> >>> Any help would be appreciated. >>> >>> --JC >>> _______________________________________________ >>> users mailing list >>> users@gridengine.org >>> https://gridengine.org/mailman/listinfo/users >> >> >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users