IIRC the memory data that qstat reports is sampled so if you have a job that quickly changes memory usage, and then dies, you're not necessarily going to get accurate information.
Glad you figured it out! On Tue, May 08, 2018 at 08:44:21AM +0000, Simon Andrews wrote: > Thanks for all of the helpful responses. It turns out that this was simply a > memory limit problem. Apparently the combination of packages being loaded by > this python script meant that even though I was only trying to show the help > page it required more than 3GB of RAM allocation to allow it to run. > > A few useful things that I learned whilst debugging this. > > Normally when jobs fail because of memory problems I'd expect to see a high > memory usage in qacct -j [job_id]. In this case though that reported zero > memory taken. It looks like it only reports the amount of memory > successfully allocated and this was an initial failure of a large allocation, > so for jobs which don't increase their memory allocation sequentially you > might not see anything in here. > > I also kind of expected that programs failing to allocate memory would > generate some kind of error saying something to this effect (some certainly > do, as we've seen that before), but in this case it just segfaulted > immediately. > > The way I eventually diagnosed this was to switch the command to: > > qsub -o test.log strace htseq-count > > ..which gave me the set of system calls for the program until it failed. The > end of this log said: > > mmap(NULL, 2101248, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a39ac5000 > mprotect(0x2b3a39ac5000, 4096, PROT_NONE) = 0 > clone(child_stack=0x2b3a39cc4ff0, > flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, > parent_tidptr=0x2b3a39cc59d0, tls=0x2b3a39cc5700, > child_tidptr=0x2b3a39cc59d0) = 61710 > mmap(NULL, 2101248, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a3bcc6000 > mprotect(0x2b3a3bcc6000, 4096, PROT_NONE) = 0 > clone(child_stack=0x2b3a3bec5ff0, > flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, > parent_tidptr=0x2b3a3bec69d0, tls=0x2b3a3bec6700, > child_tidptr=0x2b3a3bec69d0) = 61711 > mmap(NULL, 2101248, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a57ec9000 > +++ killed by SIGSEGV (core dumped) +++ > > So I could then see that it was trying to do a memory map at the point where > it died so I knew where to start playing. > > Thanks again for the help. I've learned some new techniques to try for next > time! > > Simon. > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Skylar Thompson > Sent: 04 May 2018 15:10 > To: [email protected] > Subject: Re: [gridengine users] Debugging crash when running program through > GridEngine > > Do you have any memory limits (in particular, h_vmem) imposed on your batch > jobs? > > On Fri, May 04, 2018 at 01:45:24PM +0000, Simon Andrews wrote: > > I've got a strange problem on our cluster where some python programs are > > segfaulting when run through qsub, but work fine on the command line, or > > even if run remotely through SSH. > > > > Really simple (hello world) programs work OK, but anything which does > > a significant amount of imports seems to fail. So for example; > > > > htseq-count > > > > works locally, but > > > > qsub -o test.log -cwd -V -j y -b y htseq-count > > > > Produces a segfault in the executed program. > > > > ssh compute-0-0 htseq-count > > > > ..works fine (we're using ssh to launch jobs on our cluster) > > > > Any suggestions for how to go about trying to track this down? > > > > Thanks > > > > Simon. > > > > The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT > > Registered Charity No. 1053902. > > The information transmitted in this email is directed only to the > > addressee. If you received this in error, please contact the sender > > and delete this email from your system. The contents of this e-mail > > are the views of the sender and do not necessarily represent the views > > of the Babraham Institute. Full conditions at: > > www.babraham.ac.uk<http://www.babraham.ac.uk/terms> > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > > -- > -- Skylar Thompson ([email protected]) > -- Genome Sciences Department, System Administrator > -- Foege Building S046, (206)-685-7354 > -- University of Washington School of Medicine > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT > Registered Charity No. 1053902. > The information transmitted in this email is directed only to the addressee. > If you received this in error, please contact the sender and delete this > email from your system. The contents of this e-mail are the views of the > sender and do not necessarily represent the views of the Babraham Institute. > Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms> > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users -- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
