Re: [gridengine users] Memory errors even after setting h_vmem

2015-02-24 Thread Mikael Brandström Durling
> 24 feb 2015 kl. 17:55 skrev Simon Andrews : > > > >> On 24/02/2015 16:34, "Reuti" wrote: >> >> >>> >>> Thanks for the reply. I kind of guessed this might be the answer. I >>> think the issue is that we're assuming that the allocated memory will be >>> available as a single block, wherea

Re: [gridengine users] can't get password entry for user "xxxx". Either the user does not exist or NIS error!

2014-04-22 Thread Mikael Brandström Durling
21 apr 2014 kl. 19:59 skrev Prentice Bisbal : > After one of these qrsh jobs fails, I get the following e-mail: > > Job 5326173 caused action: Job 5326173 set to ERROR > User= > Queue =pow1...@. > Start Time = > End Time= > failed assumedly before job:can't get p

Re: [gridengine users] How to manage grid nodes

2013-10-03 Thread Mikael Brandström Durling
Hi, I was about to suggest oneSIS when I saw that Dave was suggesting it. I would still put my two cents on it, given that we run a small, but very heterogenous cluster using oneSIS for node management. One of the aspects I like most with it, is that adding a new node is just a matter of settin

Re: [gridengine users] how to configure queue to terminate job not including h_rt option in batch file

2013-03-27 Thread Mikael Brandström Durling
Another option would be to make h_rt a forced complex in the complex configuration with no default. We have the following line: #name shortcut typerelop requestable consumable default urgency #--

Re: [gridengine users] gridengine 8.1.3 and queue in alarm state: null np_load_avg

2013-03-13 Thread Mikael Brandström Durling
Try setting your locale to C or en_US.UTF-8 (eg export LC_ALL=C) before running qstat. qhost failed on me when using sv_SE.UTF-8. Mikael 13 mar 2013 kl. 23:31 skrev "Stefano Bridi" : > The setup is a fresh installed: Centos6.4 with > gridengine-qmaster-8.1.3-1.el6.x86_64 > gridengine-execd-8.

Re: [gridengine users] SGE 8.1.3 and USE_CGROUPS sets hosts in error state

2013-03-13 Thread Mikael Brandström Durling
ssion denied Mikael --- Sent from a crippled computer (a.k.a a phone) 13 mar 2013 kl. 23:10 skrev "Mikael Brandström Durling" : > Hi sge users, > I have been testing the USE_CGROUPS option that is available to execd. When > USE_CGROUPS is enabled it works fine to submit job

[gridengine users] SGE 8.1.3 and USE_CGROUPS sets hosts in error state

2013-03-13 Thread Mikael Brandström Durling
Hi sge users, I have been testing the USE_CGROUPS option that is available to execd. When USE_CGROUPS is enabled it works fine to submit jobs one by one. But when I submitted 70 serial jobs, all queues on all hosts were set to error state. It happens after 2 or more jobs have started on the host

Re: [gridengine users] SGE 8.1.3 and cpusets/cgroups

2013-03-12 Thread Mikael Brandström Durling
12 mar 2013 kl. 15:33 skrev Reuti : > >> The problem we would like to solve is when users submit a job with software >> that defaults to start a number of worker threads equal to the number of >> cores, and thus parasitising on other jobs allocations within the node. > > Often this comes due

Re: [gridengine users] SGE 8.1.3 and cpusets/cgroups

2013-03-12 Thread Mikael Brandström Durling
12 mar 2013 kl. 12:12 skrev Reuti : > Hi, > > Am 12.03.2013 um 10:41 schrieb Mikael Brandström Durling: > >> I noticed in the man page for sge_conf that there is an experimental option >> for enabling cpusets in age. I tried to search the mail list archives and >

[gridengine users] SGE 8.1.3 and cpusets/cgroups

2013-03-12 Thread Mikael Brandström Durling
Hi, I noticed in the man page for sge_conf that there is an experimental option for enabling cpusets in age. I tried to search the mail list archives and google for documentation on how to enable it, which led to the util/resources/scripts/setup-cgroups-etc. Calling it from the sgeexecd init

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-12 Thread Mikael Brandström Durling
8 mar 2013 kl. 23:38 skrev Dave Love : > > I'm using builtin remote startup. Ok, I switched to that and now the h_vmem limits behave as expected. > >> I'm using the rshd-wrapper and pam_sge-qrsh-setup.so. However, now >> after the upgrade I noticed that the rshd-wrapper will just create an >>

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-08 Thread Mikael Brandström Durling
6 mar 2013 kl. 23:08 skrev Mikael Brandström Durling : > > > 6 mar 2013 kl. 19:33 skrev Dave Love > : > >> Reuti writes: >> >>>> I can't reproduce that (with openmpi tight integration). Doing this >>>> (which gets three four-co

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-03-06 Thread Mikael Brandström Durling
6 mar 2013 kl. 19:33 skrev Dave Love : > Reuti writes: > >>> I can't reproduce that (with openmpi tight integration). Doing this >>> (which gets three four-core nodes): >>> >>> qsub -pe openmpi 12 -l h_vmem=256M >>> echo "Script $(hostname): $TMPDIR $NSLOTS" >>> ulimit -v >>> for HOST in $(

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-02-27 Thread Mikael Brandström Durling
27 feb 2013 kl. 20:38 skrev Reuti : > Am 27.02.2013 um 16:22 schrieb Mikael Brandström Durling: > >> Ok, seems somewhat hard to patch without deep knowledge of the inner >> workings of GE. Interstingly, if I manually start a qrsh -pe openmpi_span N, >> and then qrsh

Re: [gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-02-27 Thread Mikael Brandström Durling
sonable request. Thanks for your rapid reply, Mikael 26 feb 2013 kl. 21:32 skrev Reuti : > Am 26.02.2013 um 19:45 schrieb Mikael Brandström Durling: > >> I have recently been trying to run OpenMPI jobs spanning several nodes on >> our small cluster. However, it seems to

[gridengine users] MPI jobs spanning several nodes and h_vmem limits

2013-02-26 Thread Mikael Brandström Durling
Hi, I have recently been trying to run OpenMPI jobs spanning several nodes on our small cluster. However, it seems to me as sub-jobs launched with qsub -inherit (by openmpi) gets killed at a memory limit of h_vmem, instead of h_vmem times the number of slots allocated to the sub-node. Is there