Thanks for all of the helpful responses.  It turns out that this was simply a 
memory limit problem.  Apparently the combination of packages being loaded by 
this python script meant that even though I was only trying to show the help 
page it required more than 3GB of RAM allocation to allow it to run.

A few useful things that I learned whilst debugging this.

Normally when jobs fail because of memory problems I'd expect to see a high 
memory usage in qacct -j [job_id].  In this case though that reported zero 
memory taken.  It looks like it only reports the amount of memory successfully 
allocated and this was an initial failure of a large allocation, so for jobs 
which don't increase their memory allocation sequentially you might not see 
anything in here.

I also kind of expected that programs failing to allocate memory would generate 
some kind of error saying something to this effect (some certainly do, as we've 
seen that before), but in this case it just segfaulted immediately.

The way I eventually diagnosed this was to switch the command to:

qsub -o test.log strace htseq-count

..which gave me the set of system calls for the program until it failed.  The 
end of this log said:

mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, 
-1, 0) = 0x2b3a39ac5000
mprotect(0x2b3a39ac5000, 4096, PROT_NONE) = 0
clone(child_stack=0x2b3a39cc4ff0, 
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
 parent_tidptr=0x2b3a39cc59d0, tls=0x2b3a39cc5700, child_tidptr=0x2b3a39cc59d0) 
= 61710
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, 
-1, 0) = 0x2b3a3bcc6000
mprotect(0x2b3a3bcc6000, 4096, PROT_NONE) = 0
clone(child_stack=0x2b3a3bec5ff0, 
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
 parent_tidptr=0x2b3a3bec69d0, tls=0x2b3a3bec6700, child_tidptr=0x2b3a3bec69d0) 
= 61711
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, 
-1, 0) = 0x2b3a57ec9000
+++ killed by SIGSEGV (core dumped) +++

So I could then see that it was trying to do a memory map at the point where it 
died so I knew where to start playing.

Thanks again for the help.  I've learned some new techniques to try for next 
time!

Simon.


-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Skylar Thompson
Sent: 04 May 2018 15:10
To: [email protected]
Subject: Re: [gridengine users] Debugging crash when running program through 
GridEngine

Do you have any memory limits (in particular, h_vmem) imposed on your batch 
jobs?

On Fri, May 04, 2018 at 01:45:24PM +0000, Simon Andrews wrote:
> I've got a strange problem on our cluster where some python programs are 
> segfaulting when run through qsub, but work fine on the command line, or even 
> if run remotely through SSH.
>
> Really simple (hello world) programs work OK, but anything which does
> a significant amount of imports seems to fail.  So for example;
>
> htseq-count
>
> works locally, but
>
> qsub -o test.log -cwd -V -j y -b y htseq-count
>
> Produces a segfault in the executed program.
>
> ssh compute-0-0 htseq-count
>
> ..works fine (we're using ssh to launch jobs on our cluster)
>
> Any suggestions for how to go about trying to track this down?
>
> Thanks
>
> Simon.
>
> The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT 
> Registered Charity No. 1053902.
> The information transmitted in this email is directed only to the
> addressee. If you received this in error, please contact the sender
> and delete this email from your system. The contents of this e-mail
> are the views of the sender and do not necessarily represent the views
> of the Babraham Institute. Full conditions at:
> www.babraham.ac.uk<http://www.babraham.ac.uk/terms>

> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


--
-- Skylar Thompson ([email protected])
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered 
Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If 
you received this in error, please contact the sender and delete this email 
from your system. The contents of this e-mail are the views of the sender and 
do not necessarily represent the views of the Babraham Institute. Full 
conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to