IIRC the memory data that qstat reports is sampled so if you have a
job that quickly changes memory usage, and then dies, you're not
necessarily going to get accurate information.

Glad you figured it out!

On Tue, May 08, 2018 at 08:44:21AM +0000, Simon Andrews wrote:
> Thanks for all of the helpful responses.  It turns out that this was simply a 
> memory limit problem.  Apparently the combination of packages being loaded by 
> this python script meant that even though I was only trying to show the help 
> page it required more than 3GB of RAM allocation to allow it to run.
> 
> A few useful things that I learned whilst debugging this.
> 
> Normally when jobs fail because of memory problems I'd expect to see a high 
> memory usage in qacct -j [job_id].  In this case though that reported zero 
> memory taken.  It looks like it only reports the amount of memory 
> successfully allocated and this was an initial failure of a large allocation, 
> so for jobs which don't increase their memory allocation sequentially you 
> might not see anything in here.
> 
> I also kind of expected that programs failing to allocate memory would 
> generate some kind of error saying something to this effect (some certainly 
> do, as we've seen that before), but in this case it just segfaulted 
> immediately.
> 
> The way I eventually diagnosed this was to switch the command to:
> 
> qsub -o test.log strace htseq-count
> 
> ..which gave me the set of system calls for the program until it failed.  The 
> end of this log said:
> 
> mmap(NULL, 2101248, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a39ac5000
> mprotect(0x2b3a39ac5000, 4096, PROT_NONE) = 0
> clone(child_stack=0x2b3a39cc4ff0, 
> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
>  parent_tidptr=0x2b3a39cc59d0, tls=0x2b3a39cc5700, 
> child_tidptr=0x2b3a39cc59d0) = 61710
> mmap(NULL, 2101248, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a3bcc6000
> mprotect(0x2b3a3bcc6000, 4096, PROT_NONE) = 0
> clone(child_stack=0x2b3a3bec5ff0, 
> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
>  parent_tidptr=0x2b3a3bec69d0, tls=0x2b3a3bec6700, 
> child_tidptr=0x2b3a3bec69d0) = 61711
> mmap(NULL, 2101248, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x2b3a57ec9000
> +++ killed by SIGSEGV (core dumped) +++
> 
> So I could then see that it was trying to do a memory map at the point where 
> it died so I knew where to start playing.
> 
> Thanks again for the help.  I've learned some new techniques to try for next 
> time!
> 
> Simon.
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf Of Skylar Thompson
> Sent: 04 May 2018 15:10
> To: [email protected]
> Subject: Re: [gridengine users] Debugging crash when running program through 
> GridEngine
> 
> Do you have any memory limits (in particular, h_vmem) imposed on your batch 
> jobs?
> 
> On Fri, May 04, 2018 at 01:45:24PM +0000, Simon Andrews wrote:
> > I've got a strange problem on our cluster where some python programs are 
> > segfaulting when run through qsub, but work fine on the command line, or 
> > even if run remotely through SSH.
> >
> > Really simple (hello world) programs work OK, but anything which does
> > a significant amount of imports seems to fail.  So for example;
> >
> > htseq-count
> >
> > works locally, but
> >
> > qsub -o test.log -cwd -V -j y -b y htseq-count
> >
> > Produces a segfault in the executed program.
> >
> > ssh compute-0-0 htseq-count
> >
> > ..works fine (we're using ssh to launch jobs on our cluster)
> >
> > Any suggestions for how to go about trying to track this down?
> >
> > Thanks
> >
> > Simon.
> >
> > The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT 
> > Registered Charity No. 1053902.
> > The information transmitted in this email is directed only to the
> > addressee. If you received this in error, please contact the sender
> > and delete this email from your system. The contents of this e-mail
> > are the views of the sender and do not necessarily represent the views
> > of the Babraham Institute. Full conditions at:
> > www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
> 
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
> 
> --
> -- Skylar Thompson ([email protected])
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT 
> Registered Charity No. 1053902.
> The information transmitted in this email is directed only to the addressee. 
> If you received this in error, please contact the sender and delete this 
> email from your system. The contents of this e-mail are the views of the 
> sender and do not necessarily represent the views of the Babraham Institute. 
> Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

-- 
-- Skylar Thompson ([email protected])
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to