Thanks for all of the helpful responses. It turns out that this was simply a memory limit problem. Apparently the combination of packages being loaded by this python script meant that even though I was only trying to show the help page it required more than 3GB of RAM allocation to allow it to run.
A few useful things that I learned whilst debugging this. Normally when jobs fail because of memory problems I'd expect to see a high memory usage in qacct -j [job_id]. In this case though that reported zero memory taken. It looks like it only reports the amount of memory successfully allocated and this was an initial failure of a large allocation, so for jobs which don't increase their memory allocation sequentially you might not see anything in here. I also kind of expected that programs failing to allocate memory would generate some kind of error saying something to this effect (some certainly do, as we've seen that before), but in this case it just segfaulted immediately. The way I eventually diagnosed this was to switch the command to: qsub -o test.log strace htseq-count ..which gave me the set of system calls for the program until it failed. The end of this log said: mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a39ac5000 mprotect(0x2b3a39ac5000, 4096, PROT_NONE) = 0 clone(child_stack=0x2b3a39cc4ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b3a39cc59d0, tls=0x2b3a39cc5700, child_tidptr=0x2b3a39cc59d0) = 61710 mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a3bcc6000 mprotect(0x2b3a3bcc6000, 4096, PROT_NONE) = 0 clone(child_stack=0x2b3a3bec5ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b3a3bec69d0, tls=0x2b3a3bec6700, child_tidptr=0x2b3a3bec69d0) = 61711 mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a57ec9000 +++ killed by SIGSEGV (core dumped) +++ So I could then see that it was trying to do a memory map at the point where it died so I knew where to start playing. Thanks again for the help. I've learned some new techniques to try for next time! Simon. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Skylar Thompson Sent: 04 May 2018 15:10 To: [email protected] Subject: Re: [gridengine users] Debugging crash when running program through GridEngine Do you have any memory limits (in particular, h_vmem) imposed on your batch jobs? On Fri, May 04, 2018 at 01:45:24PM +0000, Simon Andrews wrote: > I've got a strange problem on our cluster where some python programs are > segfaulting when run through qsub, but work fine on the command line, or even > if run remotely through SSH. > > Really simple (hello world) programs work OK, but anything which does > a significant amount of imports seems to fail. So for example; > > htseq-count > > works locally, but > > qsub -o test.log -cwd -V -j y -b y htseq-count > > Produces a segfault in the executed program. > > ssh compute-0-0 htseq-count > > ..works fine (we're using ssh to launch jobs on our cluster) > > Any suggestions for how to go about trying to track this down? > > Thanks > > Simon. > > The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT > Registered Charity No. 1053902. > The information transmitted in this email is directed only to the > addressee. If you received this in error, please contact the sender > and delete this email from your system. The contents of this e-mail > are the views of the sender and do not necessarily represent the views > of the Babraham Institute. Full conditions at: > www.babraham.ac.uk<http://www.babraham.ac.uk/terms> > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users -- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
