Hi,
We've been using Grid (Currently Son of Grid Engine, 8.0.0b) for about 9
months now, but the person who knew it best recently left, so I'm spending
a lot of my time trying to get my head around things.
I'm trying to get Grid to accurately respond to return codes (99 for retry,
100 for error)
It seems that, if I use 'qsub' to submit a job, it works fine, but if I
submit the same job through drmaa, it doesn't respond correctly.
The script I've been running is purely the following:
sleep 5
exit 100
When I submit it using drmaa (via our own internal interface), I checked
the DRMAA's job template, and found a load of command-line args in
'nativeSpecification'. I used these to replicate, as close as possible, the
same command in qsub.
This is the qsub command-line that I used.
qsub -w n -l clamp=0,arch=lx-amd64 -p -800 -shell yes -js 0 -P myproject -V
-q default -R yes -pe XX 1-1 -wd `pwd` -j y ./test_exit.sh
I find that the DRMAA jobs, as soon as they finish, disappear from Grid.
However, when the qsub submitted jobs finish, they get (correctly) marked
up as Errored.
I grabbed the 'qstat -j <jobid>' output from each of the created tasks. The
only differences were the job name, and that the DRMAA task has the
following in it:
stdout_path_list: NONE:myhostname:/jobs/myproject/seq/shot/work/app/logs
Any ideas what might be going on here? It'd be good to be able to get my
jobs to be able to error/retry properly.
Cheers
Hugh Macdonald
*n**vizible** – VISUAL EFFECTS
*www.nvizible.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users