Re: [gridengine users] Error 137 - trying to figure out what it means.

Ron Chen Sun, 13 Jan 2013 16:35:30 -0800

Exit code 137 = process was killed because it exceeded the time limit, and 
Google is your best friend if you have similar issues - and the solution is to 
check the default time limit of your shell.


 -Ron


************************************************************************

Open Grid Scheduler - the official open source Grid Engine: 
http://gridscheduler.sourceforge.net/





________________________________
From: Jake Carroll <jake.carr...@uq.edu.au>
To: "users@gridengine.org" <users@gridengine.org> 
Sent: Sunday, January 13, 2013 6:56 PM
Subject: [gridengine users] Error 137 - trying to figure out what it means.


Hi all.

We're trying to figure out the answer to a problem that is escaping us. We can 
usually self solve most of these issues, but this one, we're having problems 
trapping and can't find any solid answers for after a lot of looking around on 
online resources.

One of our quite capable users [read: he rarely needs our help with grid 
engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing 
out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on 
the ROCKS clusters platform. What is making it hard for us is that sometimes 
these array based jobs (non PE's/parallel environments and no mpi/mpich 
explicit in use) are only crashing sometimes. Some, and not others. It seems 
almost quasi-random.

The code is written in fortran compiled with Intels ifort, using standard code 
optimisation (compile flag –02). However, the code is also compiled with 
optimisation turned off and traceback and error reporting turned on, and in 
both cases  programs failed and no run-time error was printed. The same code 
was also compiled with gfortran and did also produce error '137'.

The code run successfully numerous times, but is doing something slightly 
different each time due to random sampling and different model specifications. 
There are 20 jobs because  analyses are run across 20 replicates of a 
simulations. Previously our user had
no problems running these 20 replicates across 11 different models (20x11=220 
runs). 

Some specifics:

Array jobMemory allocation is 20GB, and the job uses less than 14GB.

Submitted through a shell script qsub test.sh, where test sh looks like:

-------------------------------------------------------
#$ -cwd 
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
#$ -M someu...@somedomain.com
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
       
-----------------------------------------------------------------------------------------------------------------

Intels default is 'static compiling' from what we understand, in anyway no 
external libraries are used (although Intel uses its own MKL library).


We can't see any obvious memory starvation issues or resource contention 
problems. Do you have any suggestions in things we could look at to trap this? 
The error 137 stuff online, after looking around a little, seems sparse at best.

Any help would be appreciated.

--JC
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users     

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error 137 - trying to figure out what it means.

Reply via email to