Re: [gridengine users] Rescheduling jobs leaving zombie process on compute node

2017-05-03 Thread Lars van der Bijl
it's not on the command line. it's in the configuration. from the command line you modify the configuration like this. *qconf -mconf* replace the line with *execd_params *and make it: *execd_params ENABLE_ADDGRP_KILL=TRUE* On Wed, May 3, 2017 at 9:59 AM, Guillermo Marco

Re: [gridengine users] "Decoding gridengine" workshop

2016-08-24 Thread Lars van der Bijl
Hey Mark, I would be up for that sounds like a good idea On Wed, Aug 24, 2016 at 1:41 PM, Jones, Thomas wrote: > Hi Mark, > > I think that would be a really good idea. > > Regards, > > Thomas Jones > thomas.jo...@ucl.ac.uk > work: 02076795136 > mobile: 07580144349 >

Re: [gridengine users] reporting doesn't log which host receives a task

2016-08-19 Thread Lars van der Bijl
at 3:53 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > > Am 19.08.2016 um 15:58 schrieb Lars van der Bijl <com48...@gmail.com>: > > > > hey all, > > > > I'm trying to parse the reporting file and parse this data into a > mongodb.

[gridengine users] reporting doesn't log which host receives a task

2016-08-19 Thread Lars van der Bijl
hey all, I'm trying to parse the reporting file and parse this data into a mongodb. in the reporting with the joblog on we get message that look like: *1471613763:job_log:1471613763:sent:203471:1140:NONE:t:master:qmaster:0:924:1471613584:jobname:lars:users:job:defaultdepartment:sge:sent to

Re: [gridengine users] array tasks memory usage

2014-04-15 Thread Lars van der bijl
question is, does gridengine remember all PID's over the lifetime of the submission and aggregates that? On 14 April 2014 16:40, Reuti re...@staff.uni-marburg.de wrote: Hi, Am 14.04.2014 um 16:20 schrieb Lars van der bijl: over the last few weeks we have been having some problems with s_vmem

[gridengine users] array tasks memory usage

2014-04-14 Thread Lars van der bijl
hey everyone. over the last few weeks we have been having some problems with s_vmem and array tasks. we submit a task with a memory requirement s_vmem of 2G. it starts running and I follow it on the machine using pmap [root@atom08 ~]# pmap -x 19893 hythonRender -f 80 170 -i 1.0

[gridengine users] problems with maxvmem

2014-03-03 Thread Lars van der bijl
hey everyone, I'm having some issues with gridengine and it's memory usage. I'm submitting a task to my queue with smp = 1 mem_free = 1.9G v_smem = 2.0G as resources, Now I'm measuring my task memory usage with psutil from in python and I'm hitting about 864 Mb but it's hitting the v_smem

[gridengine users] what is IO in qstat

2013-03-20 Thread Lars van der bijl
hey everyone, a few weeks ago we where having issue with users submitting jobs the did massive IO to the file system. each task was writing out about 4 GB of data and reading in about 2GB. as this was happening in parallel from our farm it brought our server to it knee's and cause people to moan

Re: [gridengine users] 152 kills tasks.

2013-02-14 Thread Lars van der bijl
On 14 February 2013 12:50, Reuti re...@staff.uni-marburg.de wrote: Am 13.02.2013 um 16:05 schrieb Lars van der bijl: On 13 February 2013 15:35, Reuti re...@staff.uni-marburg.de wrote: Am 13.02.2013 um 15:16 schrieb Lars van der bijl: hey everyone, we always set a v_smem values

[gridengine users] 152 exit_code not caught in epilog.

2012-11-09 Thread Lars van der bijl
hey everyone, when i submit a task with a s_vmem limit and the task takes to much memory it throws a 152 exit code. I see this in my epilog and i try and raise my own 99 exit code to re-try it but it doesn't seem to take. I understand this behavior for a 137 but not for 152. other exit codes

Re: [gridengine users] subordinate_list not suspending tasks

2012-10-24 Thread Lars van der bijl
On 23 October 2012 22:08, Reuti re...@staff.uni-marburg.de wrote: Hi, Am 23.10.2012 um 21:41 schrieb Lars van der bijl: I've got 2 queue's $ qconf -sq final.q qname final.q hostlist @allhosts suspend_thresholdsNONE nsuspend 1

[gridengine users] reschedule instead of suspend on subordinate_list

2012-10-16 Thread Lars van der bijl
hey guys, is it possible to reschedule a task from a machine when it gets taken in a subordinate queue instead of suspended? Lars ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

[gridengine users] new users getting little action on the queue

2012-09-18 Thread Lars van der bijl
hey guys, I've recently added 3 new users to the queue and there all getting very little action. older users will always get higher priority then them. any ideas? Lars ___ users mailing list users@gridengine.org

Re: [gridengine users] task exit status problems

2012-09-12 Thread Lars van der bijl
aa ok. thanks Reuti. thanks for taking the time to help me out with these things. On 12 September 2012 13:11, Reuti re...@staff.uni-marburg.de wrote: Am 12.09.2012 um 13:08 schrieb Lars van der bijl: On 11 September 2012 17:25, Reuti re...@staff.uni-marburg.de wrote: Am 11.09.2012 um 17:20

Re: [gridengine users] task exit status problems

2012-09-12 Thread Lars van der bijl
so in the epilog i've added the code to re-queue the task but now it's dependencies won't start. anyway to get those to start while retaining the parent task in the queue? On 12 September 2012 13:13, Lars van der bijl l...@realisestudio.com wrote: aa ok. thanks Reuti. thanks for taking

[gridengine users] task exit status problems

2012-09-07 Thread Lars van der bijl
Hey everyone, We have been using the grid for VFX for a few years and our job dependencies have grown a lot. A job is a collection of tasks. All our tasks have batches so we regularly run a job of 50 tasks with 2000+ batches. Very often a batch dies for reasons such as memory limits, seg fault,

Re: [gridengine users] task exit status problems

2012-09-07 Thread Lars van der bijl
On 7 September 2012 17:23, Reuti re...@staff.uni-marburg.de wrote: Hi Lars, Am 07.09.2012 um 16:55 schrieb Lars van der bijl: Hey everyone, We have been using the grid for VFX for a few years and our job dependencies have grown a lot. A job is a collection of tasks. All our tasks have

Re: [gridengine users] task exit status problems

2012-09-07 Thread Lars van der bijl
On 7 September 2012 19:41, Reuti re...@staff.uni-marburg.de wrote: Am 07.09.2012 um 18:39 schrieb Lars van der bijl: On 7 September 2012 17:48, Reuti re...@staff.uni-marburg.de wrote: Am 07.09.2012 um 17:45 schrieb Lars van der bijl: On 7 September 2012 17:23, Reuti re...@staff.uni

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Lars van der bijl
um 16:33 schrieb Lars van der bijl: is there a way to tell the difference? if i reschedual a job i get these values in the usage file in the epilog wait_status=3727362 exit_status=137 signal=9 start_time=1333549517 end_time=1333549565 ru_wallclock=48 ru_utime=0.226965 ru_stime=0.306953

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

2012-04-04 Thread Lars van der bijl
in our case the application has no checkpointing capabilities. for us a reschedule is just run from start on a new host. so a checkpoint with a signal 9 should be enough? On 4 April 2012 17:50, Reuti re...@staff.uni-marburg.de wrote: Am 04.04.2012 um 17:42 schrieb Lars van der bijl: Hey

[gridengine users] re-submitting finished taskes

2012-03-30 Thread Lars van der bijl
hey everyone, is it possible to resubmit a task that is listed in qmon as finished? or has anyone build a system to allow for such functionality? Lars ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] re-submitting finished taskes

2012-03-30 Thread Lars van der bijl
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-November/032986.html found my answer. On 30 March 2012 10:19, Lars van der bijl l...@realisestudio.com wrote: hey everyone, is it possible to resubmit a task that is listed in qmon as finished? or has anyone build a system to allow

[gridengine users] strange reschedule behavior

2012-03-23 Thread Lars van der bijl
Hey everyone, I have a small script. #!/bin/bash echo start NUMBER=$[ ( $RANDOM % 100 ) + 1 ] path=/production/people/lars/sge-test/output.$NUMBER.txt for I in {1..20}; do echo $I $path; sleep 1; done echo end I submit it to sge (6.2u5) qsub -r y -ckpt realise-checkpoint -q

Re: [gridengine users] strange reschedule behavior

2012-03-23 Thread Lars van der bijl
On 23 March 2012 13:03, Reuti re...@staff.uni-marburg.de wrote: Am 23.03.2012 um 11:55 schrieb Lars van der bijl: On 23 March 2012 11:46, Reuti re...@staff.uni-marburg.de wrote: Hi, Am 23.03.2012 um 10:46 schrieb Lars van der bijl: Hey everyone, I have a small script. #!/bin/bash

[gridengine users] hosts in multiple queues base on calender

2012-03-19 Thread Lars van der bijl
hey everyone, I've got the following request. I currently have 1 queue. all.q all our hosts are in this queue. we now want to have a addition queue where Monday to Friday 9am to 8pm people can submit to, and get quick turnaround job send through. some of the machines that are in all.q should be

Re: [gridengine users] vfx/animation users of grid

2011-12-31 Thread Lars van der bijl
hey Ben, We are using it here at Realise studio. it handles a lot of our simple stuff well and doing multi-machine fluid sims didn't take long to get working either. Lars On 29 December 2011 17:02, Ben De Luca bdel...@gmail.com wrote: Hi All,      I know there are a lot of science types on

[gridengine users] round robin PE config

2011-12-13 Thread Lars van der bijl
Hey everyone, we have been running our sge for a while now but we implemented a new technique and I'm having trouble figuring out how to make the grid help with it. I have the following task / dependency structure. task1 task2_seed_0 = dependent on output of task1 task2_seed_1 = dependent on

[gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
Hey everyone. Where having some issues with job's being killed with exit status 137. This causes the task to finish and start it dependent task which is causing all kind of havoc. submitting a job with a very small max memory limit gives me this this as a example. $ qacct -j 21141

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
April 2011 11:41, Reuti re...@staff.uni-marburg.de wrote: Hi, Am 01.04.2011 um 12:33 schrieb lars van der bijl: Hey everyone. Where having some issues with job's being killed with exit status 137. 137 = 128 + 9 $ kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5

Re: [gridengine users] Jobs being killed with exit status 137

2011-04-01 Thread lars van der bijl
on: you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there. -- Reuti Am 01.04.2011 um 16:39 schrieb lars van der bijl: the problem is that i don't have any such limit's enforced currently on submission. the submission to qsub are hidden