Exit code 137 = process was killed because it exceeded the time limit, and Google is your best friend if you have similar issues - and the solution is to check the default time limit of your shell.
-Ron ************************************************************************ Open Grid Scheduler - the official open source Grid Engine: http://gridscheduler.sourceforge.net/ ________________________________ From: Jake Carroll <jake.carr...@uq.edu.au> To: "users@gridengine.org" <users@gridengine.org> Sent: Sunday, January 13, 2013 6:56 PM Subject: [gridengine users] Error 137 - trying to figure out what it means. Hi all. We're trying to figure out the answer to a problem that is escaping us. We can usually self solve most of these issues, but this one, we're having problems trapping and can't find any solid answers for after a lot of looking around on online resources. One of our quite capable users [read: he rarely needs our help with grid engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for us is that sometimes these array based jobs (non PE's/parallel environments and no mpi/mpich explicit in use) are only crashing sometimes. Some, and not others. It seems almost quasi-random. The code is written in fortran compiled with Intels ifort, using standard code optimisation (compile flag –02). However, the code is also compiled with optimisation turned off and traceback and error reporting turned on, and in both cases programs failed and no run-time error was printed. The same code was also compiled with gfortran and did also produce error '137'. The code run successfully numerous times, but is doing something slightly different each time due to random sampling and different model specifications. There are 20 jobs because analyses are run across 20 replicates of a simulations. Previously our user had no problems running these 20 replicates across 11 different models (20x11=220 runs). Some specifics: Array jobMemory allocation is 20GB, and the job uses less than 14GB. Submitted through a shell script qsub test.sh, where test sh looks like: ------------------------------------------------------- #$ -cwd #$ -l vf=20G #$ -N b1_set12_1 #$ -m eas #$ -M someu...@somedomain.com /path/to/some/stuff/here/bayesRsim <b1_set12_1.par ----------------------------------------------------------------------------------------------------------------- Intels default is 'static compiling' from what we understand, in anyway no external libraries are used (although Intel uses its own MKL library). We can't see any obvious memory starvation issues or resource contention problems. Do you have any suggestions in things we could look at to trap this? The error 137 stuff online, after looking around a little, seems sparse at best. Any help would be appreciated. --JC _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users