Re: [gridengine users] Error 137 - trying to figure out what it means.

Jake Carroll Wed, 16 Jan 2013 03:36:33 -0800

Hi.

Interesting.


From /opt/gridengine/default/spool/compute-0-4/messages, we are seeing
some unusual stuff (or, maybe it is entirely run of the mill?):

01/16/2013 18:05:50|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:07:56|  main|compute-0-4|E|removing unreferenced job
1350379.4111 without job report from ptf
01/16/2013 18:08:09|  main|compute-0-4|E|removing unreferenced job
1350379.4123 without job report from ptf
01/16/2013 18:09:41|  main|compute-0-4|E|removing unreferenced job
1350379.4165 without job report from ptf
01/16/2013 18:13:21|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:14:05|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:21:29|  main|compute-0-4|E|removing unreferenced job
1350379.4555 without job report from ptf
01/16/2013 18:24:44|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:27:26|  main|compute-0-4|E|removing unreferenced job
1350379.4825 without job report from ptf
01/16/2013 18:36:38|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:41:41|  main|compute-0-4|E|removing unreferenced job
1350379.5401 without job report from ptf
01/16/2013 18:42:49|  main|compute-0-4|E|removing unreferenced job
1350379.5455 without job report from ptf
01/16/2013 18:44:30|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:51:56|  main|compute-0-4|E|removing unreferenced job
1350379.5857 without job report from ptf
01/16/2013 18:52:19|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:53:49|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:57:10|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:58:27|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 18:58:46|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 19:00:25|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
01/16/2013 19:20:56|  main|compute-0-4|E|removing unreferenced job
1350379.7048 without job report from ptf
01/16/2013 19:29:55|  main|compute-0-4|W|reaping job "1350379" ptf
complains: Job does not exist
(END) 

So, I immediately jump back to thinking that hard runtime limits are being
reached, but, we don't use them as defaults at all!


[root@cluster ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u
epilog                /share/apps/file_trans_eplg.sh
h_core                INFINITY
h_cpu                 INFINITY
h_data                INFINITY
h_fsize               INFINITY
h_rss                 INFINITY
h_rt                  INFINITY
h_stack               INFINITY
h_vmem                INFINITY
prolog                /share/apps/file_trans_prlg.sh
s_core                INFINITY
s_cpu                 INFINITY
s_data                INFINITY
s_fsize               INFINITY
s_rss                 INFINITY
s_rt                  INFINITY
s_stack               INFINITY
s_vmem                INFINITY



Reuti: I recall way back in October 2011 you saw in issue like this, where
jobs simply started disappearing on a user, who posted about it, but I
don't think anyone truly ever got to the bottom of why?

At this point, we're scratching our heads and considering a reboot of the
head node on Friday, as we really aren't understanding what is going wrong
here.

Thoughts?

Thanks!

---JC



On 15/01/13 8:24 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:

>Hi,
>
>Am 14.01.2013 um 23:08 schrieb Jake Carroll:
>
>> So we tested out trying to hard set wall-time different for the specific
>> user who's experiencing the Exit 137 issue. We noticed the jobs are
>>still
>> failing, however.
>
>is there any message about the kill signal in the spooling directory's
>messages file of the node, i.e.:
>
>/opt/gridengine/default/spool/compute-0-4/messages (search for the job id)
>
>-- Reuti
>
>
>> One job that was killed that included the wall-time setting. Obviously
>>the
>> job did not run for 24h, anyway input and outputs shown below.
>> 
>> --------
>> - qsub b5_set112.sh
>> 
>> 
>> 
>> - b5_set11_2.sh:
>> 
>> #$ -cwd 
>> #$ -l h_rt=24:00:00
>> 
>> #$ -l vf=20G
>> #$ -N b5_set11_2
>> #$ -m eas
>> #$ -M someguy@somewhere
>> /blah/blah/blah/bayesRsim <b5_set11_2.par
>> 
>> 
>> - cat b5_set11_2.e1325823:
>> /opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7:
>> 8117 Killed                  /blah/blah/blag/bayesRsim < b5_set11_2.par
>> 
>> 
>> -qacct -j 1325823
>> ==============================================================
>> qname        medium.q
>> hostname     compute-0-4.local
>> group        users
>> owner        someguy
>> project      NONE
>> department   defaultdepartment
>> jobname      b5_set11_2
>> jobnumber    1325823
>> taskid       undefined
>> account      sge
>> priority     0  
>> qsub_time    Mon Jan 14 15:36:49 2013
>> start_time   Mon Jan 14 15:36:55 2013
>> end_time     Mon Jan 14 18:11:56 2013
>> granted_pe   NONE
>> slots        1  
>> failed       0  
>> exit_status  137
>> ru_wallclock 9301
>> ru_utime     9262.906
>> ru_stime     7.916
>> ru_maxrss    13820636
>> ru_ixrss     0  
>> ru_ismrss    0  
>> ru_idrss     0  
>> ru_isrss     0  
>> ru_minflt    46056
>> ru_majflt    26 
>> ru_nswap     0  
>> ru_inblock   392840
>> ru_oublock   32 
>> ru_msgsnd    0  
>> ru_msgrcv    0  
>> ru_nsignals  0  
>> ru_nvcsw     536
>> ru_nivcsw    30791
>> cpu          9270.822
>> mem          61688.906
>> io           0.430
>> iow          0.000
>> maxvmem      13.302G
>> arid         undefined
>> 
>> So, you mentioned "default time limit of your shell". My googling
>> suggested trying to set a wall time limit, or have the user specify the
>> wall time, but that did not help. A few google searches show the use of
>>a
>> global time limit for jobs in general, but make no reference to a
>>default
>> time limit of the shell. Am I supposed to be looking at limits such as
>> s_rt and h_rt? If so, how go I manipulate these for the specific user?
>>The
>> queue_conf man page makes some reference to this, but it doesn't explain
>> explicitly how to manipulate it globally or on a per user basis making
>> reference to defaults or "shell".
>> 
>> Sorry - just stumbling through this and not finding it too intuitive.
>> 
>> 
>> --JC
>> 
>> 
>> 
>> 
>> On 14/01/13 10:34 AM, "Ron Chen" <ron_chen_...@yahoo.com> wrote:
>> 
>>> Exit code 137 = process was killed because it exceeded the time limit,
>>> and Google is your best friend if you have similar issues - and the
>>> solution is to check the default time limit of your shell.
>>> 
>>> -Ron
>>> 
>>> 
>>> 
>>>************************************************************************
>>> 
>>> Open Grid Scheduler - the official open source Grid Engine:
>>> http://gridscheduler.sourceforge.net/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Jake Carroll <jake.carr...@uq.edu.au>
>>> To: "users@gridengine.org" <users@gridengine.org>
>>> Sent: Sunday, January 13, 2013 6:56 PM
>>> Subject: [gridengine users] Error 137 - trying to figure out what it
>>> means.
>>> 
>>> 
>>> Hi all.
>>> 
>>> We're trying to figure out the answer to a problem that is escaping us.
>>> We can usually self solve most of these issues, but this one, we're
>>> having problems trapping and can't find any solid answers for after a
>>>lot
>>> of looking around on online resources.
>>> 
>>> One of our quite capable users [read: he rarely needs our help with
>>>grid
>>> engine] has an unusual issue with certain jobs (seemingly, randomly?)
>>> crashing out on error 137. The code is predominantly C++ based running
>>> atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard
>>>for
>>> us is that sometimes these array based jobs (non PE's/parallel
>>> environments and no mpi/mpich explicit in use) are only crashing
>>> sometimes. Some, and not others. It seems almost quasi-random.
>>> 
>>> The code is written in fortran compiled with Intels ifort, using
>>>standard
>>> code optimisation (compile flag 02). However, the code is also
>>>compiled
>>> with optimisation turned off and traceback and error reporting turned
>>>on,
>>> and in both cases  programs failed and no run-time error was printed.
>>>The
>>> same code was also compiled with gfortran and did also produce error
>>> '137'.
>>> 
>>> The code run successfully numerous times, but is doing something
>>>slightly
>>> different each time due to random sampling and different model
>>> specifications. There are 20 jobs because  analyses are run across 20
>>> replicates of a simulations. Previously our user had
>>> no problems running these 20 replicates across 11 different models
>>> (20x11=220 runs).
>>> 
>>> Some specifics:
>>> 
>>> Array jobMemory allocation is 20GB, and the job uses less than 14GB.
>>> 
>>> Submitted through a shell script qsub test.sh, where test sh looks
>>>like:
>>> 
>>> -------------------------------------------------------
>>> #$ -cwd 
>>> #$ -l vf=20G
>>> #$ -N b1_set12_1
>>> #$ -m eas
>>> #$ -M someu...@somedomain.com
>>> /path/to/some/stuff/here/bayesRsim <b1_set12_1.par
>>> 
>>> 
>>>------------------------------------------------------------------------
>>>--
>>> ---------------------------------------
>>> 
>>> Intels default is 'static compiling' from what we understand, in anyway
>>> no external libraries are used (although Intel uses its own MKL
>>>library).
>>> 
>>> 
>>> We can't see any obvious memory starvation issues or resource
>>>contention
>>> problems. Do you have any suggestions in things we could look at to
>>>trap
>>> this? The error 137 stuff online, after looking around a little, seems
>>> sparse at best.
>>> 
>>> Any help would be appreciated.
>>> 
>>> --JC
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>> 
>


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error 137 - trying to figure out what it means.

Reply via email to