Re: [gridengine users] Tightly integrated parallel environment - Cleanly stopping "qrsh -inherit" sub-processes

2012-08-29 Thread Julien Nicoulaud
Yes, as I said in my one of me previous messages, I was wrong about this, the job only dies if you explicity send a SIGKILL/INT/TERM to the qrsh process, so this is a non-issue. Thanks for your help. 2012/8/28 Reuti > Am 28.08.2012 um 19:20 schrieb Julien Nicoulaud: > > > Yes, exactly that. > >

[gridengine users] RE : Dispatching job over grid nodes

2012-08-29 Thread Lionel SPINELLI
Hello all, thanks for the advises. I applied the procedure detailed on the wiki and jobs are now correctly dispatched. Just note that the attributes of the qconf -msconf must all be changed to match those displayed on the procedure (queue_sort_method and load_formula but also the others). Than

[gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Ben De Luca
I was wondering, how people deal with oom conditions on there cluster. We constantly have machines that die because the oom killer takes out critical system services. Has any experiance with the oom_adj proc value, or a patch to grid to support it? /proc/[pid]/oom_adj (since Linux 2.6.11)

Re: [gridengine users] Verifying behavior of max_reservations

2012-08-29 Thread Brian Smith
I have a mix of high-throughput and long wait jobs. We classify and prioritize jobs based on runtime. We use a jsv to set devel # job length < 1hr short # job length < 6hr medium # job length < 2day long # job length < 1wk xlong # job length > 1wk (goes to ACLed queue) Users have to spec

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Reuti
Am 29.08.2012 um 17:02 schrieb Ben De Luca: > I was wondering, how people deal with oom conditions on there cluster. > We constantly have machines that die because the oom killer takes out > critical system services. > > Has any experiance with the oom_adj proc value, or a patch to grid to > supp

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Brian Smith
We use mem_free variable as a consumable. Then, we use a cronjob called memkiller that terminates jobs if they go over their requested (or default) memory allocation and 1. Swap space on node is used 2. Swap rate is greater than 100 I/Os per second The user gets emailed with a report if this

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Reuti
Am 29.08.2012 um 17:21 schrieb Brian Smith: > We use mem_free variable as a consumable. Then, we use a cronjob called > memkiller that terminates jobs if they go over their requested (or default) > memory allocation and It would be more straight forward to use directly h_vmem. This is controll

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Iwona Sakrejda
On Wed, Aug 29, 2012 at 8:33 AM, Reuti wrote: > Am 29.08.2012 um 17:21 schrieb Brian Smith: > >> We use mem_free variable as a consumable. Then, we use a cronjob called >> memkiller that terminates jobs if they go over their requested (or default) >> memory allocation and > > It would be more s

Re: [gridengine users] Linux OOM killer oom_adj

2012-08-29 Thread Brian Smith
We found h_vmem to be highly unpredictable, especially with java-based applications. Stack settings were screwed up, certain applications wouldn't launch (segfaults), and hard limits were hard to determine for things like MPI applications. When your master has to launch 1024 MPI sub-tasks (qr

Re: [gridengine users] sge inspect

2012-08-29 Thread Chakravarthy Girda
Dave, Thats exiting to know..so if I understand correct. * I should be able to build a full blown cluster "possibly 400-Nodes" , I know you tested it till 250. * If I install the following snapshot.. Eg:- http://arc.liv.ac.uk/downloads/SGE/snapshots/sge-20120816.tar.gz

Re: [gridengine users] sge inspect

2012-08-29 Thread Chakravarthy Girda
Dave, I didn't have much of a success in installing the sgeinspect. May be the old instructions are not good enough. I am trying to do this on a ubuntu based "11.04" GridEngine cluster Platfrom-RTM is a web-based realtime solutions build on cacti. Maintains the full history of cluster

[gridengine users] Solaris 5.8 no go?

2012-08-29 Thread Harris He, Kun - CD
Dear All, I encounter a problem recently. The version GE2011.11p1 that installed at Linux OS was OK. Then I wanna add Solaris_64 into the group, but no go. ./aimk -no-java -no-jni -no-secure -spool-classic -no-dump -no-hwloc -only-depend Readlink: command not found Making in SOLARIS64/

Re: [gridengine users] Solaris 5.8 no go?

2012-08-29 Thread Rayson Ho
Hi Harris, >From the error message, it looks like you don't have make installed... Are you able to run "make" from an interactive shell?? Rayson On Wed, Aug 29, 2012 at 10:08 PM, Harris He, Kun - CD wrote: > Dear All, > > > > I encounter a problem recently. > > The version GE2011.11p1 that i