[gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-11 Thread Brendan Moloney
I seem to have found a combination of resource quotas that is preventing the scheduler from scheduling parallel jobs across multiple queues. I have multiple queues for jobs with different run times: veryshort.q, short.q , long.q, and verylong.q. Each of these queues has an increasing 'h_rt' limi

Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Brendan Moloney
Hello, >> { >> name shortlimit >> description NONE >> enabled TRUE >> limitqueues short.q hosts * to slots=32 > I think you can leave the "hosts *" out here and the other RQS below. It > means "used slots across all machines" limited to 32 in this queue. The same >

Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Brendan Moloney
>> All the queues are on the same machines. I am not sure which "algorithm" you >> refer to. > >I refer to the internal algorithm of SGE how to collect slots from various >queues. > >> As mentioned, the scheduler sorts by sequence number so the queues are >> checked in shortest to longest order.

Re: [gridengine users] RQS Help

2012-06-19 Thread Brendan Moloney
In my experience, resource quotas that try to enforce multiple constraints are quite buggy. Brendan From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of Ray Spence [r3spe...@gmail.com] Sent: Tuesday, June 19, 2012 4:40 PM To: use

Re: [gridengine users] Futex leap-second bug for GridEngine?

2012-07-13 Thread Brendan Moloney
er-crashes-during-a-leap-second Brendan Moloney Senior Research Assistant / Programmer Advanced Imaging Research Center Oregon Health Science University From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of Daniel Povey [dpo...

Re: [gridengine users] Memory values reported by SGE too high

2012-09-26 Thread Brendan Moloney
Virtual memory includes things like shared libraries (even though these are only loaded into memory once for all processes that use them). -Brendan From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of Jérémie Dubois-Lacoste [jere

Re: [gridengine users] Memory values reported by SGE too high

2012-09-26 Thread Brendan Moloney
o check whether they exceed the memory limit for the node or not. While actually this shared-libraries-memory will be only once in the RAM, thus SGE overestimating their usage. I guess I am wrong somewhere... Jérémie 2012/9/26 Brendan Moloney : > Virtual memory includes things like shared libr

[gridengine users] Intermittent commlib errors with MPI jobs

2012-11-07 Thread Brendan Moloney
Hello, I have MPICH2 tightly integrated with OGS 2011.11. Everything is working great in general. I have noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each using two cores) that I will get intermittent commlib errors like: commlib error: got select error (Broken pi

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-11-08 Thread Brendan Moloney
>> Hello, >> >> I have MPICH2 tightly > >Which version? It should work out-of-the-box with SGE. Version is 1.4 and yes it does have built in integration. >> integrated with OGS 2011.11. Everything is working great in general. I >> have noticed when I submit a moderate number of small MPI jobs

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-11-09 Thread Brendan Moloney
I spent some time researching this issue in the context of OpenSSH and found some mentions of similar problems due to the initial handshake package being too large (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer). I was dubious that this was my problem but

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-11-09 Thread Brendan Moloney
s, Brendan ____ From: Brendan Moloney Sent: Friday, November 09, 2012 3:31 PM To: Reuti Cc: users@gridengine.org Subject: RE: [gridengine users] Intermittent commlib errors with MPI jobs I spent some time researching this issue in the context of OpenSSH and found some mentions of similar probl

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-11-12 Thread Brendan Moloney
I suppose it could be the switch. Is the only way to test this to swap it out for a different switch? Thanks again, Brendan From: Reuti [re...@staff.uni-marburg.de] Sent: Monday, November 12, 2012 4:17 AM To: Brendan Moloney Cc: users@gridengine.org

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-11-13 Thread Brendan Moloney
Ok I will test that out once I can schedule some down time. I might even be able to get my hands on another switch by then. I appreciate all the help. From: Reuti [re...@staff.uni-marburg.de] Sent: Tuesday, November 13, 2012 3:33 AM To: Brendan Moloney

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-12-11 Thread Brendan Moloney
Sent: Wednesday, November 14, 2012 10:02 AM To: Brendan Moloney Cc: users@gridengine.org Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs Am 14.11.2012 um 00:56 schrieb Brendan Moloney: > Ok I will test that out once I can schedule some down time. I might even be > abl

Re: [gridengine users] MPI jobs on a multi-architecture cluster?

2012-12-12 Thread Brendan Moloney
> We've got a locally-written program that dynamically links against > a package that's compiled with optimizations for different chipsets > (ATLAS[2]). We've built ATLAS with multiple versions, optimized > for each architecture in our cluster. > > This is fine for serial jobs--the login environme

Re: [gridengine users] Intermittent commlib errors with MPI jobs

2012-12-17 Thread Brendan Moloney
E for small MPI jobs. However, I would still like to know the root cause of this issue. Thanks, Brendan From: users-boun...@gridengine.org [users-boun...@gridengine.org] On Behalf Of Brendan Moloney [molo...@ohsu.edu] Sent: Tuesday, December 11, 2012 2:08

[gridengine users] Problems with multi-queue parallel jobs

2013-09-12 Thread Brendan Moloney
Hello, I use multiple queues to divide up available resources based on job run times. Large parallel jobs will typically span multiple queues and this has generally been working fine thus far. However I recently increased the number of queues (from 4 to 9) so that the time limits can be more f

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-16 Thread Brendan Moloney
Hi Reuti, >> I use multiple queues to divide up available resources based on job run >> times. > > So you are requesting "-l h_rt=..."? Yes. > Yes, the problem is that you can't address a specific queue in `qrsh -inherit > ...` and if you get several queues on a machine you might have used up

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-16 Thread Brendan Moloney
From: Dave Love,,, [d.l...@liverpool.ac.uk] Sent: Monday, September 16, 2013 3:01 PM To: Brendan Moloney Cc: users@gridengine.org Subject: Re: [gridengine users] Problems with multi-queue parallel jobs Brendan Moloney writes: > Hello, > > I use multiple queues to divide up available

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-17 Thread Brendan Moloney
Hi Reuti, > Yes. But this queue can have the total slot count of the machine. Or are you > assigning right now 4 cores to a short queue, 8 to a medium one and the > remaining 4 cores of a 16 cores machine to a long queue? I limit the total number of slots for each queue using an RQS. The short

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-20 Thread Brendan Moloney
ime a node is added or removed from the cluster. Any thoughts or suggestions? Thanks, Brendan From: Brendan Moloney Sent: Monday, September 16, 2013 5:03 PM To: Dave Love Cc: users@gridengine.org Subject: RE: [gridengine users] Problems with multi-

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-24 Thread Brendan Moloney
F jsv_sub_add_param l_hard slots_vvl 1 fi jsv_accept "Job OK" return } . ${SGE_ROOT}/util/resources/jsv/jsv_include.sh jsv_main I hope someone else finds this helpful. Thanks, Brendan From: users-boun...@gridengine.org

Re: [gridengine users] Problems with multi-queue parallel jobs

2013-09-26 Thread Brendan Moloney
away from the multiple queue configuration, as this does not seem to be very well supported. Thanks, Brendan From: Reuti [re...@staff.uni-marburg.de] Sent: Thursday, September 26, 2013 2:39 AM To: Brendan Moloney Cc: users@gridengine.org Subjec

Re: [gridengine users] suggestions on setting up queues

2015-01-16 Thread Brendan Moloney
hich each user can use at most one slot for 12 hours. Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf of Prentice Bisbal [prentice

Re: [gridengine users] no memory usage for qlogin jobs

2015-04-10 Thread Brendan Moloney
No, that is not normal. I am guessing you use SSH for qlogin and don't have tight integration setup? From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf of Michael Stauffer [mgsta...@gmail.com] Sent: Friday, April 10, 2015 12:31 PM To: Gri

Re: [gridengine users] no memory usage for qlogin jobs

2015-04-10 Thread Brendan Moloney
Reuti posted this link yesterday: https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html section "SSH TIGHT INTEGRATION" Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University From: Michael Stauf

[gridengine users] Dealing with CUDA and virtual memory

2015-08-18 Thread Brendan Moloney
Hi, I have a small cluster so exclusive node access isn't really an option. We enforce memory constraints on jobs using h_vmem. I discovered that more recent versions of CUDA will always allocate a huge amount of virtual memory (RAM + SWAP + GPU_RAM) so that it can supply a "Unified Virtual Add