Re: [gridengine users] SoGE file descriptor limit and MAX_DYN_EC

2020-02-20 Thread Daniel Povey
That's a GridEngine bug whereby the event client ids or whatever they are
called don't properly get cleaned up.
The workaround is to restart the qmaster when that happens.
Be careful, sometimes restarting the service doesn't work and you may need
to kill the process.
At the cluster I used to manage at JHU, we have a process which checks the
output of
qconf -secl
and if it returns a number greater than 900 we restart the qmaster.


On Fri, Feb 21, 2020 at 1:35 AM Lana Deere  wrote:

> On CentOS 7 using  SoGE 8.1.9, I'm getting an error using qsub:
> QSUB:Unable to initialize environment because of error: cannot register
> event client. Only 979 event clients are allowed in the system
>
> Supposedly I have this limit configured much higher:
> root# qconf -sconf | grep MAX_DYN_EC
> qmaster_params   MAX_DYN_EC=25000,gdi_retries=5
>
> However, the qmaster at startup is reporting that it is not honoring the
> limit:
> |nr of dynamic event clients exceeds max file descriptor limit, setting
> MAX_DYN_EC=979
> |qmaster hard descriptor limit is set to 4096
> |qmaster soft descriptor limit is set to 1024
> |qmaster will use max. 1004 file descriptors for communication
> |qmaster will accept max. 979 dynamic event clients
> |starting up SGE 8.1.9 (lx-amd64)
>
> This is surprising to me since my system's file descriptor limit is set
> much higher than 1024/4096:
> root# pwd
> /etc/security/limits.d
> root# cat 99*nofile*conf
> * soft nofile 10
> * hard nofile 10
> root# ulimit -a -S | grep 'open files'
> open files  (-n) 10
>
> I hacked the script in /etc/init.d which starts the qmaster and it shows
> the higher limit.  However, if I look at /proc//limits I can
> see that it has the lower limits it reports.  What I can't figure out is
> why it is seeing the lower limit.  Anyone know whether there's a
> configuration parameter somewhere overriding the system limit?  Any
> suggestions on how to make it get the system's limit?
>
> Thanks.
>
> .. Lana (lana.de...@gmail.com)
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qsh not working

2019-11-20 Thread Daniel Povey
The accidental cc!

On Wed, Nov 20, 2019 at 9:22 PM Friedrich Ferstl  wrote:

> This person was at the Smithsonian until very recently and shows up in
> our support records but he is now CFA/Harvard, apparently, using open
> source.
>
>
> Am 19.11.2019 um 23:41 schrieb Korzennik, Sylvain <
> skorzen...@cfa.harvard.edu>:
>
> While qrsh and qlogin works fine, qsh fails on
> % qsh
> Your job 2657108 ("INTERACTIVE") has been submitted
> waiting for interactive job to be scheduled ...
> Could not start interactive job (could be some network/firewall related
> problem)
>
> from the same prompt/login node, I can
> % ssh -X compute-64-16 /usr/bin/xterm
> or
> % ssh -Y compute-64-16 /usr/bin/xterm
> but not
> % ssh compute-64-16 /usr/bin/xterm
>
> I do not want to run these 'out-of-band'. I use ssh -Y in the conf (qconf
> -sconf), yet it fails and I can trace this to:
>
> 11/19/2019 17:18:19.296800 [10541:143099]: closing all filedescriptors
> from fd 0 to fdmax=1024
> 11/19/2019 17:18:19.296828 [10541:143099]: further messages are in "error"
> and "trace"
> 11/19/2019 17:18:19.299121 [10541:143099]: now running with uid=10541,
> euid=10541
> 11/19/2019 17:18:19.299172 [10541:143099]: execvp(/usr/bin/xterm,
> "/usr/bin/xterm" "-display" "localhost:15.0" "-n" "SGE Interactive
>  Job 2657108 on compute-64-16.cm.cluster in Queue qrsh.iq" "-e"
> "/bin/csh")
> 11/19/2019 17:18:19.303787 [446:143093]: wait3 returned 143099 (status:
> 256; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 1, WTERMSIG: 0)
> 11/19/2019 17:18:19.303843 [446:143093]: job exited with exit status 1
> 11/19/2019 17:18:19.303872 [446:143093]: reaped "job" with pid 143099
> 11/19/2019 17:18:19.303893 [446:143093]: job exited not due to signal
> 11/19/2019 17:18:19.303914 [446:143093]: job exited with status 1
>
> What magic is needed for the GE to start xterm right? Is this some xauth
> problem?
>
>   Thanks,
> Sylvain
> --
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] issue compiling SoGE on Debian 10.1

2019-10-30 Thread Daniel Povey
That looks like it's by design... he was signing the builds using his
secret key.  You'd have to figure out where he configured that and either
insert your own details or turn off signing (if that's allowed).


On Wed, Oct 30, 2019 at 10:35 AM Jerome  wrote:

> Dear all
>
> I've trying to compile deb package of SoGE, using the repo on Gitlab
> "https://gitlab.com/loveshack/sge.git;.
>
> I could generate some deb files, as sge_8.1.10-1_amd64.deb,
> sge-common_8.1.10-1_all.deb. But got this issue :
>
> $ dpkg-buildpackage -b
>
> ../..
> dpkg-deb: building package 'sge-dbg' in '../sge-dbg_8.1.10-1_amd64.deb'.
>  dpkg-genbuildinfo --build=binary
>  dpkg-genchanges --build=binary >../sge_8.1.10-1_amd64.changes
> dpkg-genchanges: info: binary-only upload (no source code included)
>  dpkg-source --after-build .
> dpkg-buildpackage: info: binary-only upload (no source included)
>  signfile sge_8.1.10-1_amd64.buildinfo
> gpg: skipped "Dave Love ": No secret key
> gpg: dpkg-sign.rxsZlToR/sge_8.1.10-1_amd64.buildinfo: clear-sign failed:
> No secret key
>
> dpkg-buildpackage: error: failed to sign .buildinfo file
>
>
> Someone know what's this ?
>
> Regards!
>
> --
> -- Jérôme
> L'adulte ne croit pas au père Noël. Il vote.
> (Pierres Desproges)
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-28 Thread Daniel Povey
I always use the FQDN.  I recall running into problems with SunRPC if
not... there may be ways to get around that, e.g. have each host announce
it's raw hostname as its FQDN, but it might not be compatible with the
hosts having normal network access.
I forget what specific mechanism SunRPC uses to find the hostname.

On Mon, Oct 28, 2019 at 2:18 PM Mun Johl  wrote:

> Hi all,
>
>
>
> I do have a follow-up question: When I am specifying hostnames for the
> execution hosts, admin hosts, etc.; do I need to use the FQDN?  Or can I
> simply use the hostname in order for grid to operate correctly?  That is,
> do I have to use hostname.domain.com (as I am currently doing).  Or is it
> sufficient to simply use “hostname”?
>
>
>
> Regards,
>
>
>
> --
>
> Mun
>
>
>
>
>
> *From:* Mun Johl 
> *Sent:* Friday, October 25, 2019 5:42 PM
> *To:* dpo...@gmail.com
> *Cc:* Skylar Thompson ; users@gridengine.org
> *Subject:* RE: [gridengine users] What is the easiest/best way to update
> our servers' domain name?
>
>
>
> Hi Daniel,
>
>
>
> Thank you for your reply.
>
>
>
> *From:* Daniel Povey 
>
> You may have to write a script to do that, but it could be something like
>
>
>
> for exechost in $(qconf -sel); do
>
>qconf -se $exechost  | sed s/old_domain_name/new_domain_name/ > tmp
>
>qconf -de $exechost
>
>qconf -Ae tmp
>
> done
>
>
>
> but you might need to tweak that to get it to work, e.g. get rid of
> load_values from the tmp file.
>
>
>
> *[Mun] Understood.  Since we have a fairly small set of servers currently,
> I may just update them by hand via “qconf -me ”; and then address
> the queues via “qconf -mq ”.  Oh, and I just noticed I can modify
> hostgroups via “qconf -mhgrp @name”.*
>
>
>
> *After that I can re-start the daemons and I “should” be good to go,
> right?*
>
>
>
> *Thanks again Daniel.*
>
>
>
> *Best regards,*
>
>
>
> *-- *
>
> *Mun*
>
>
>
>
>
> On Fri, Oct 25, 2019 at 5:24 PM Mun Johl  wrote:
>
> Hi Daniel and Skylar,
>
> Thank you for your replies.
>
> > -Original Message-
> > I think it might depend on the setting of ignore_fqdn in the bootstrap
> file
> > (can't remember if this just tunes load reporting or also things like
> which
> > qmaster the execd's talk to). I wouldn't count on it working, though, and
> > agree with Daniel that you probably want to plan on an outage.
>
> [Mun] An outage is acceptable; but I'm not sure what is the best/easiest
> approach to take in order to change the domain names within SGE for all of
> the servers as well as update the hostgroups and queues.  I mean, I know I
> can delete the hosts and add them back in; and the same for the queue
> specifications, etc.  However, I'm not sure if that is an adequate solution
> or one that will cause problems for me.  I'm also not sure if that is the
> best approach to take for this task.
>
> Thanks,
>
> --
> Mun
>
>
> >
> > On Fri, Oct 25, 2019 at 04:12:11PM -0700, Daniel Povey wrote:
> > > IIRC, GridEngine is very picky about machines having a consistent
> > > hostname, e.g. that what hostname they think they have matches with
> > > how they were addressed.  I think this is because of SunRPC.  I think
> > > it may be hard to do what you want without an interruption  of some
> kind.
> > But I may be wrong.
> > >
> > > On Fri, Oct 25, 2019 at 3:37 PM Mun Johl  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I need to update the domain names of our SGE servers.  What is the
> > > > easiest way to do that?  Can I simply update the domain name somehow
> > > > and have that propagate to hostgroupgs, queue specifications, etc.?
> > > >
> > > >
> > > >
> > > > Or do I have to delete the current hosts and add the new ones?
> > > > Which I think also implies setting up the hostgroups and queues
> > > > again as well for our implementation.
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Mun
> > > > ___
> > > > users mailing list
> > > > users@gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > > >
> >
> > > ___
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> > --
> > -- Skylar Thompson (skyl...@u.washington.edu)
> > -- Genome Sciences Department, System Administrator
> > -- Foege Building S046, (206)-685-7354
> > -- University of Washington School of Medicine
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Daniel Povey
You may have to write a script to do that, but it could be something like

for exechost in $(qconf -sel); do
   qconf -se $exechost  | sed s/old_domain_name/new_domain_name/ > tmp
   qconf -de $exechost
   qconf -Ae tmp
done

but you might need to tweak that to get it to work, e.g. get rid of
load_values from the tmp file.


On Fri, Oct 25, 2019 at 5:24 PM Mun Johl  wrote:

> Hi Daniel and Skylar,
>
> Thank you for your replies.
>
> > -Original Message-
> > I think it might depend on the setting of ignore_fqdn in the bootstrap
> file
> > (can't remember if this just tunes load reporting or also things like
> which
> > qmaster the execd's talk to). I wouldn't count on it working, though, and
> > agree with Daniel that you probably want to plan on an outage.
>
> [Mun] An outage is acceptable; but I'm not sure what is the best/easiest
> approach to take in order to change the domain names within SGE for all of
> the servers as well as update the hostgroups and queues.  I mean, I know I
> can delete the hosts and add them back in; and the same for the queue
> specifications, etc.  However, I'm not sure if that is an adequate solution
> or one that will cause problems for me.  I'm also not sure if that is the
> best approach to take for this task.
>
> Thanks,
>
> --
> Mun
>
>
> >
> > On Fri, Oct 25, 2019 at 04:12:11PM -0700, Daniel Povey wrote:
> > > IIRC, GridEngine is very picky about machines having a consistent
> > > hostname, e.g. that what hostname they think they have matches with
> > > how they were addressed.  I think this is because of SunRPC.  I think
> > > it may be hard to do what you want without an interruption  of some
> kind.
> > But I may be wrong.
> > >
> > > On Fri, Oct 25, 2019 at 3:37 PM Mun Johl  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I need to update the domain names of our SGE servers.  What is the
> > > > easiest way to do that?  Can I simply update the domain name somehow
> > > > and have that propagate to hostgroupgs, queue specifications, etc.?
> > > >
> > > >
> > > >
> > > > Or do I have to delete the current hosts and add the new ones?
> > > > Which I think also implies setting up the hostgroups and queues
> > > > again as well for our implementation.
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Mun
> > > > ___
> > > > users mailing list
> > > > users@gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > > >
> >
> > > ___
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> > --
> > -- Skylar Thompson (skyl...@u.washington.edu)
> > -- Genome Sciences Department, System Administrator
> > -- Foege Building S046, (206)-685-7354
> > -- University of Washington School of Medicine
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Daniel Povey
IIRC, GridEngine is very picky about machines having a consistent hostname,
e.g. that what hostname they think they have matches with how they were
addressed.  I think this is because of SunRPC.  I think it may be hard to
do what you want without an interruption  of some kind.  But I may be wrong.

On Fri, Oct 25, 2019 at 3:37 PM Mun Johl  wrote:

> Hi,
>
>
>
> I need to update the domain names of our SGE servers.  What is the easiest
> way to do that?  Can I simply update the domain name somehow and have that
> propagate to hostgroupgs, queue specifications, etc.?
>
>
>
> Or do I have to delete the current hosts and add the new ones?  Which I
> think also implies setting up the hostgroups and queues again as well for
> our implementation.
>
>
>
> Best regards,
>
>
>
> --
>
> Mun
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened

2019-10-18 Thread Daniel Povey
Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster)
should be a very routine and harmless operation that should be invisible to
users except for a temporary inaccessibility of `qstat`.

On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael  wrote:

> Hi folks,
>
> Our instance of (quite old, 2011.11p1_155) SGE rolled over 10,000,000 jobs
> at the start of the
> month, and then started again at 1 as expected. About ten days later we
> started the qmaster
> a few times (it was segfaulting, originally we thought that a user was
> using newer qstat
> binaries to query an old qmaster) with JID nearing ~20k, only after each
> of the restarts the JID
> started at about 1100, not the number we were expecting. Because of this
> there's duplicate JID
> entries in accounting and it's causing a bit of a problem for people who
> monitor for failed jobs.
>
> Because of the nature of the workload the currently-running JIDs are now
> all over the place,
> with some JIDs in the queue still in the 9,99n,nnn range and some in four
> figures. If we need to
> restart the qmaster again, will the jobseqnum file be overwritten with the
> largest JID still in
> the queue (as suggested in
> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-January/028661.html)?
>
> Am aware that this is an old version of SGE and we're in the middle of
> transitioning to a
> much newer one, but this is a bit of an issue while we're still shifting
> workloads over.
>
> Thanks,
> Mike
> --
> Mike Wallis x503305
> University of Edinburgh, Research Services,
> Argyle House, 3 Lady Lawson Street,
> Edinburgh, EH3 9DR
>
>
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] limit CPU/slot resource to the number of reserved slots

2019-08-26 Thread Daniel Povey
I don't think it's supported in Son of GridEngine.  Ondrej Valousek (cc'd)
described in the first thread here
http://arc.liv.ac.uk/pipermail/sge-discuss/2019-August/thread.html
how he was able to implement it, but it required code changes, i.e. you
would need to figure out how to build and install SGE from source, which is
a task in itself.

Dan


On Mon, Aug 26, 2019 at 12:46 PM Dietmar Rieder 
wrote:

> Hi,
>
> thanks for your reply. This sounds promising.
> We are using Son of Grid Engine though. Can you point me to the right
> docs to get cgroup enabled in the exec host (CentOS 7). I must admit I
> have no experience with cgroups.
>
> Thanks again
>   Dietmar
>
> On 8/26/19 4:03 PM, Skylar Thompson wrote:
> > At least for UGE, you will want to use the CPU set integration, which
> will
> > assign the job to a cgroup that has one CPU per requested slot. Once you
> > have cgroups enabled in the exec host OS, you can then set these options
> in
> > sge_conf:
> >
> > cgroup_path=/cgroup
> > cpuset=1
> >
> > You can use this mechanism to have the m_mem_free request enforced as
> well.
> >
> > On Mon, Aug 26, 2019 at 02:15:22PM +0200, Dietmar Rieder wrote:
> >> Hi,
> >>
> >> may be this is a stupid question, but I'd like to limit the used/usable
> >> number of cores to the number of slots that were reserved for a job.
> >>
> >> We often see that people reserve 1 slot, e.g. "qsub -pe smp 1 [...]"
> >> but their program is then running in parallel on multiple cores. How can
> >> this be prevented? Is it possible that with reserving only one slot a
> >> process can not utilize more than this?
> >>
> >> I was told the this should be possible in slurm (which we don't have,
> >> and to which we don't want to switch to currently).
> >>
> >> Thanks
> >>   Dietmar
> >
>
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Institute of Bioinformatics
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-02 Thread Daniel Povey
Could it relate to when the daemons were started on those nodes?  I'm not
sure exactly at what point those limits are applied, and how they are
inherited by child processes.  If you changed those files recently it might
not have taken effect.

On Tue, Jul 2, 2019 at 10:36 PM Derrick Lin  wrote:

> Hi guys,
>
> We have custom settings for user open files in /etc/security/limits.conf
> in all Compute Node. When checking if the configuration is effective with
> "ulimit -a" by SSH to each node, it reflects the correct settings.
>
> but when ran the same command through SGE (both qsub and qrsh), we found
> that some Compute Nodes do not reflects the correct settings but the rest
> are fine.
>
> I am wondering if this is SGE related? And idea is welcomed.
>
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
Sorry, but I think this conversation shouldn't continue.
This list is for system administrators, not for users with basic questions
about bash.  People will unsubscribe if it goes on much longer.

On Thu, Jun 13, 2019 at 2:49 PM VG  wrote:

> Hi Feng,
> I did something like this
>
> for i in *
> do
> if [ -d "$i" ]
> then cd "$i"
> a=$(ls *.tar.gz)
> echo $PWD/"$a"
> cd ..
> fi
> done
>
> This gave me the full path of my tar.gz files. Should I save this in a
> separate text file and then run an array script on it?
>
> Thanks
>
> Regards
> VARUN
>
> On Thu, Jun 13, 2019 at 1:46 PM Feng Zhang  wrote:
>
>> You can try to write the script to first scan all the files to get their
>> full path names and then run the Array jobs.
>>
>>
>> On Jun 13, 2019, at 1:20 PM, VG  wrote:
>>
>> HI Joshua,
>> I like the array job option because essentially it will still be 1 job
>> and it will run them in parallel.
>>
>> I have one issue though. I can create an array script, but here I
>> presented a simple problem. Actually my individual tar.gz files are under
>> respective directories
>> For example
>> dir1 has file1.tar.gz
>> dir2 has file2.tar.gz
>> dir3 has file3.tar.gz
>>
>> The way I was then submitting them was
>>
>> for i in *
>> do
>>   if [ -d "$i" ]
>>   then cd "$i"
>>   qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xf
>> *.tar"
>>   cd ..
>>   fi
>> done
>>
>> One way is I can pull out all the tar.gz in one folder and run array
>> script as you told, other wise is there a work around where everything runs
>> and also remains in the respective directories.
>>
>> Thanks for your help.
>>
>> Regards
>> Varun
>>
>>
>>
>>
>>
>> On Thu, Jun 13, 2019 at 1:11 PM Joshua Baker-LePain 
>> wrote:
>>
>>> On Thu, 13 Jun 2019 at 9:32am, VG wrote
>>>
>>> > I have a scripting question regarding submitting jobs to the cluster.
>>> > There is a limitation per user of 1000 jobs only.
>>> > Let's say I have 1200 tar.gz files
>>> >
>>> > I tried to submit all the jobs together but after 1000 jobs it gave me
>>> an
>>> > error message saying per user limit is 1000 and after that it did not
>>> > queued the remaining jobs.
>>> > I want to write a script where if the submitted jobs goes below
>>> > 1000(because they finished running), then next jobs are submitted in
>>> the
>>> > queue. How can I do that?
>>> > I have written something like this:
>>> >
>>> > for i in *tar.gz
>>> > do
>>> >   qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar
>>> -xzf $i"
>>> > done
>>>
>>> The right answer to this problem (not the scripting issue, but how to
>>> untar all 1200 files without running afoul of the 1000 job limit) is an
>>> array job.  You can submit 1 job with 1200 tasks to untar all the files.
>>> The relevant bits of the job script would include (assuming bash):
>>>
>>> targzs=(0 file1.tar.gz file2.tar.gz ... file1200.tar.gz)
>>> tar xzf ${targzs[$SGE_TASK_ID]}
>>>
>>> To submit it:
>>>
>>> qsub -t 1-1200 job.sh
>>>
>>> --
>>> Joshua Baker-LePain
>>> QB3 Shared Cluster Sysadmin
>>> UCSF
>>>
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
By the way, un-tarring a file is an I/O bound process and it will usually
give you no benefit to run on more than about 4 machines.  Fastest and best
for the network would be to log into the file server if you have access,
and do it sequentially from there.


On Thu, Jun 13, 2019 at 12:44 PM Daniel Povey  wrote:

>
> for i in *tar.gz;
> do
>   while true; do
>   if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi;
>   sleep 60;
>done
>qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf
> $i"
> done
>
> On Thu, Jun 13, 2019 at 12:39 PM Skylar Thompson  wrote:
>
>> We've used resource quota sets to accomplish that on a per-queue or
>> per-project basis. I don't know that you can limit on jobs in RQSs but you
>> certainly can on slots; the sge_resource_quota(5) man page has some
>> examples.
>>
>> On Thu, Jun 13, 2019 at 12:32:51PM -0400, VG wrote:
>> >  I have a scripting question regarding submitting jobs to the cluster.
>> > There is a limitation per user of 1000 jobs only.
>> > Let's say I have 1200 tar.gz files
>> >
>> >
>> > I tried to submit all the jobs together but after 1000 jobs it gave me
>> an
>> > error message saying per user limit is 1000 and after that it did not
>> > queued the remaining jobs.
>> > I want to write a script where if the submitted jobs goes below
>> > 1000(because they finished running), then next jobs are submitted in the
>> > queue. How can I do that?
>> > I have written something like this:
>> >
>> > for i in *tar.gz
>> > do
>> >qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar
>> -xzf $i"
>> > done
>> >
>> > Hope to hear from you soon.
>> >
>> > Regards
>> > Varun
>>
>> > ___
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>>
>> --
>> -- Skylar Thompson (skyl...@u.washington.edu)
>> -- Genome Sciences Department, System Administrator
>> -- Foege Building S046, (206)-685-7354
>> -- University of Washington School of Medicine
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
for i in *tar.gz;
do
  while true; do
  if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi;
  sleep 60;
   done
   qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf $i"
done

On Thu, Jun 13, 2019 at 12:39 PM Skylar Thompson  wrote:

> We've used resource quota sets to accomplish that on a per-queue or
> per-project basis. I don't know that you can limit on jobs in RQSs but you
> certainly can on slots; the sge_resource_quota(5) man page has some
> examples.
>
> On Thu, Jun 13, 2019 at 12:32:51PM -0400, VG wrote:
> >  I have a scripting question regarding submitting jobs to the cluster.
> > There is a limitation per user of 1000 jobs only.
> > Let's say I have 1200 tar.gz files
> >
> >
> > I tried to submit all the jobs together but after 1000 jobs it gave me an
> > error message saying per user limit is 1000 and after that it did not
> > queued the remaining jobs.
> > I want to write a script where if the submitted jobs goes below
> > 1000(because they finished running), then next jobs are submitted in the
> > queue. How can I do that?
> > I have written something like this:
> >
> > for i in *tar.gz
> > do
> >qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf
> $i"
> > done
> >
> > Hope to hear from you soon.
> >
> > Regards
> > Varun
>
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
> --
> -- Skylar Thompson (skyl...@u.washington.edu)
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.

qconf -sconf | grep  gid_range
gid_range5-51000


On Tue, May 14, 2019 at 10:42 AM Reuti  wrote:

> AFAICS the sent kill by SGE happens after a task returned already with an
> error. SGE would in this case use the kill signal to be sure to kill all
> child processes. Hence the question would  be: what was the initial command
> in the job script, and what output/error did it generate?
>
> -- Reuti
>
> > Am 14.05.2019 um 11:36 schrieb hiller :
> >
> > Dear all,
> > i have a problem that jobs sent to gridengine randomly die.
> > The gridengine version is 8.1.9
> > The OS is opensuse 15.0
> > The gridengine messages file says:
> > 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed -
> killing job
> > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10
> assumedly after job because: job 635659.1 died through signal KILL (9)
> >
> > qacct -j 635659 says:
> > failed   100 : assumedly after job
> > exit_status  137  (Killed)
> >
> >
> > The was no kill triggered by the user. Also there are no other
> limitations, neither ulimit nor in the gridengine queue
> > The 'qconf -sq all.q' command gives:
> > s_rt  INFINITY
> > h_rt  INFINITY
> > s_cpu INFINITY
> > h_cpu INFINITY
> > s_fsize   INFINITY
> > h_fsize   INFINITY
> > s_dataINFINITY
> > h_dataINFINITY
> > s_stack   INFINITY
> > h_stack   INFINITY
> > s_coreINFINITY
> > h_coreINFINITY
> > s_rss INFINITY
> > h_rss INFINITY
> > s_vmemINFINITY
> > h_vmemINFINITY
> >
> > Years ago there were some threads about the same issue, but i did not
> find a solution.
> >
> > Does somebody have a hint what i can do or check/debug?
> >
> > With kind regards and many thanks for any help, ulrich
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-12 Thread Daniel Povey
When I see weird things like this (and it happens), my reaction is usually,
"It's probably a bug somewhere deep in the code.  Just change something
about your setup to make it go away".
In future I hope to switch to slurm.  It doesn't have great architecture
but I think it's better maintained, and it's certainly much newer / more
modern.  (E.g. no low-level "C" stuff with linked lists).

On Tue, Mar 12, 2019 at 1:02 PM David Trimboli  wrote:

> The "threads" PE is referenced by all hosts (as "@/") in the queue
> configuration. There are no user lists or restrictions in the PE.
> On 3/12/2019 12:56 PM, Ian Kaufman wrote:
>
> And do you define host groups in the PE?
>
> On Tue, Mar 12, 2019 at 9:53 AM David Trimboli  wrote:
>
>>
>> On 3/12/2019 12:05 PM, Ian Kaufman wrote:
>> > Are mynode{17-24} in a queue that is configured to use your "threads"
>> PE?
>>
>>
>> Yes. If you disable the limit, the submission works just fine. Jobs go
>> to the all.q queue, and that queue references the threads PE.
>>
>>
>
> --
> Ian Kaufman
> Research Systems Administrator
> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Different GDI version between client and qmaster

2019-02-26 Thread Daniel Povey
presumably a combination of newer and older packages.

On Tue, Feb 26, 2019 at 10:02 PM Radhouane Aniba  wrote:

> Hi everyone
>
> I am trying to run a python code on SGE but I am running through this issue
>
> *drmaa.errors.DrmCommunicationException: code 2: denied: client (xxx) uses
> old GDI version 6.2u5 while qmaster uses newer version 8.1.9*
>
> I am using ubuntu 14 and I am wondering if anyone went through this issue
> before. More importantly, I would like to understand why this happens in
> the first place.
>
> Thanks in advance
>
> Rad
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Daniel Povey
It may depend on specific features of those large job arrays.  You could
try deleting them and see if the problem disappears.

On Sat, Jan 26, 2019 at 2:23 PM Joseph Farran  wrote:

> Hi Daniel.
>
> Yes I do have large job-arrays around 7k tasks BUT I have had larger job
> arrays of 500k without seeing this kind of slowdown.
>
> Joseph
>
>
> On 1/26/2019 10:16 AM, Daniel Povey wrote:
> > Check if there are any huge jobs in the queue. Sometimes very large task
> ranges, or large numbers of jobs, can make it slow.
> >
> > On Sat, Jan 26, 2019 at 7:05 AM Reuti  <mailto:re...@staff.uni-marburg.de>> wrote:
> >
> > Hi,
> >
> > > Am 26.01.2019 um 10:20 schrieb Joseph Farran  <mailto:jfar...@uci.edu>>:
> > >
> > > Hi.
> > > Our Grid Engine is running very sluggish all of a sudden.
> Sqe_qmaster stays at 100% all the time where is used to be 100% for a few
> seconds every 30 seconds or so.
> > > I ran the qping command but not sure how to read it.  Any helpful
> insight much appreciated
> >
> > Did you try to stop and start the qmaster?
> >
> > -- Reuti
> >
> >
> > > qping -i 5 -info hpc-s 6444 qmaster 1
> > > 01/26/2019 01:12:18:
> > > SIRM version: 0.1
> > > SIRM message id:  1
> > > start time:   01/26/2019 01:10:13 (1548493813)
> > > run time [s]: 125
> > > messages in read buffer:  0
> > > messages in write buffer: 0
> > > no. of connected clients: 296
> > > status:   0
> > > info: MAIN: R (125.20) | signaler000: R
> (123.69) | event_master000: R (0.14) | timer000: R (4.52) | worker000: R
> (0.14) | worker001: R (3.44) | worker002: R (7.33) |
> > worker003: R (3.43) | worker004: R (3.08) | worker005: R (1.42) | OK
> > > malloc:   arena(34410496) |ordblks(9370) |
> smblks(164269) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(7726000) |
> uordblks(24248176) | fordblks(10162320) | keepcost(119856)
> > > Monitor:
> > > 01/26/2019 01:10:13 | MAIN: no monitoring data available
> > > 01/26/2019 01:10:14 | signaler000: no monitoring data available
> > > 01/26/2019 01:12:14 | event_master000: runs: 4.82r/s (clients:
> 1.00 mod: 0.02/s ack: 0.02/s blocked: 0.00 busy: 0.81 | events: 5.52/s
> added: 5.47/s skipt: 0.05/s) out: 0.00m/s APT: 0.0002s/m
> > idle: 99.89% wait: 0.00% time: 60.00s
> > > 01/26/2019 01:12:14 | timer000: runs: 0.47r/s (pending: 12.00
> executed: 0.45/s) out: 0.00m/s APT: 0.0002s/m idle: 99.99% wait: 0.00%
> time: 60.00s
> > > 01/26/2019 01:11:19 | worker000: runs: 0.68r/s (EXECD
> (l:0.32,j:0.28,c:0.32,p:0.00,a:0.00)/s GDI
> (a:0.25,g:1.08,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.82m/s APT: 0.0036s/m
> > idle: 99.75% wait: 0.00% time: 64.96s
> > > 01/26/2019 01:12:15 | worker001: runs: 0.81r/s (EXECD
> (l:0.02,j:0.02,c:0.02,p:0.00,a:0.00)/s GDI
> (a:0.00,g:1.92,m:0.08,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.81m/s APT: 0.0008s/m
> > idle: 99.93% wait: 0.00% time: 59.27s
> > > 01/26/2019 01:11:16 | worker002: runs: 0.73r/s (EXECD
> (l:0.28,j:0.23,c:0.26,p:0.00,a:0.00)/s GDI
> (a:0.34,g:1.13,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.71m/s APT: 0.0030s/m
> > idle: 99.78% wait: 0.17% time: 61.75s
> > > 01/26/2019 01:12:15 | worker003: runs: 0.75r/s (EXECD
> (l:0.03,j:0.02,c:0.03,p:0.00,a:0.00)/s GDI
> (a:0.02,g:1.23,m:0.07,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.73m/s APT: 0.0008s/m
> > idle: 99.94% wait: 0.02% time: 60.40s
> > > 01/26/2019 01:11:26 | worker004: runs: 0.68r/s (EXECD
> (l:0.23,j:0.21,c:0.23,p:0.00,a:0.00)/s GDI
> (a:0.27,g:1.69,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.65m/s APT: 0.0012s/m
> > idle: 99.92% wait: 0.00% time: 71.11s
> > > 01/26/2019 01:11:31 | worker005: runs: 0.56r/s (EXECD
> (l:0.25,j:0.24,c:0.25,p:0.00,a:0.00)/s GDI
> (a:0.20,g:1.05,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.55m/s APT: 0.0011s/m
> > idle: 99.94% wait: 0.00% time: 76.48s
> > >
> > > Joseph
> > >
> > >
> > > ___
> > > users mailing list
> > > users@gridengine.org <mailto:users@gridengine.org>
> > > https://gridengine.org/mailman/listinfo/users
> > >
> >
> >
> > ___
> > users mailing list
> > users@gridengine.org <mailto:users@gridengine.org>
> > https://gridengine.org/mailman/listinfo/users
> >
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Daniel Povey
Check if there are any huge jobs in the queue.  Sometimes very large task
ranges, or large numbers of jobs, can make it slow.

On Sat, Jan 26, 2019 at 7:05 AM Reuti  wrote:

> Hi,
>
> > Am 26.01.2019 um 10:20 schrieb Joseph Farran :
> >
> > Hi.
> > Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster
> stays at 100% all the time where is used to be 100% for a few seconds every
> 30 seconds or so.
> > I ran the qping command but not sure how to read it.   Any helpful
> insight much appreciated
>
> Did you try to stop and start the qmaster?
>
> -- Reuti
>
>
> > qping -i 5 -info hpc-s 6444 qmaster 1
> > 01/26/2019 01:12:18:
> > SIRM version: 0.1
> > SIRM message id:  1
> > start time:   01/26/2019 01:10:13 (1548493813)
> > run time [s]: 125
> > messages in read buffer:  0
> > messages in write buffer: 0
> > no. of connected clients: 296
> > status:   0
> > info: MAIN: R (125.20) | signaler000: R (123.69) |
> event_master000: R (0.14) | timer000: R (4.52) | worker000: R (0.14) |
> worker001: R (3.44) | worker002: R (7.33) | worker003: R (3.43) |
> worker004: R (3.08) | worker005: R (1.42) | OK
> > malloc:   arena(34410496) |ordblks(9370) |
> smblks(164269) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(7726000) |
> uordblks(24248176) | fordblks(10162320) | keepcost(119856)
> > Monitor:
> > 01/26/2019 01:10:13 | MAIN: no monitoring data available
> > 01/26/2019 01:10:14 | signaler000: no monitoring data available
> > 01/26/2019 01:12:14 | event_master000: runs: 4.82r/s (clients: 1.00 mod:
> 0.02/s ack: 0.02/s blocked: 0.00 busy: 0.81 | events: 5.52/s added: 5.47/s
> skipt: 0.05/s) out: 0.00m/s APT: 0.0002s/m idle: 99.89% wait: 0.00% time:
> 60.00s
> > 01/26/2019 01:12:14 | timer000: runs: 0.47r/s (pending: 12.00 executed:
> 0.45/s) out: 0.00m/s APT: 0.0002s/m idle: 99.99% wait: 0.00% time: 60.00s
> > 01/26/2019 01:11:19 | worker000: runs: 0.68r/s (EXECD
> (l:0.32,j:0.28,c:0.32,p:0.00,a:0.00)/s GDI
> (a:0.25,g:1.08,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.82m/s APT: 0.0036s/m idle: 99.75% wait: 0.00% time: 64.96s
> > 01/26/2019 01:12:15 | worker001: runs: 0.81r/s (EXECD
> (l:0.02,j:0.02,c:0.02,p:0.00,a:0.00)/s GDI
> (a:0.00,g:1.92,m:0.08,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.81m/s APT: 0.0008s/m idle: 99.93% wait: 0.00% time: 59.27s
> > 01/26/2019 01:11:16 | worker002: runs: 0.73r/s (EXECD
> (l:0.28,j:0.23,c:0.26,p:0.00,a:0.00)/s GDI
> (a:0.34,g:1.13,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.71m/s APT: 0.0030s/m idle: 99.78% wait: 0.17% time: 61.75s
> > 01/26/2019 01:12:15 | worker003: runs: 0.75r/s (EXECD
> (l:0.03,j:0.02,c:0.03,p:0.00,a:0.00)/s GDI
> (a:0.02,g:1.23,m:0.07,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.73m/s APT: 0.0008s/m idle: 99.94% wait: 0.02% time: 60.40s
> > 01/26/2019 01:11:26 | worker004: runs: 0.68r/s (EXECD
> (l:0.23,j:0.21,c:0.23,p:0.00,a:0.00)/s GDI
> (a:0.27,g:1.69,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.65m/s APT: 0.0012s/m idle: 99.92% wait: 0.00% time: 71.11s
> > 01/26/2019 01:11:31 | worker005: runs: 0.56r/s (EXECD
> (l:0.25,j:0.24,c:0.25,p:0.00,a:0.00)/s GDI
> (a:0.20,g:1.05,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out:
> 0.55m/s APT: 0.0011s/m idle: 99.94% wait: 0.00% time: 76.48s
> >
> > Joseph
> >
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-17 Thread Daniel Povey
It seems to me that this likely isn't closely related to GridEngine itself,
but more about something going on on that node.  You'd have to debug by
looking at 'top' output, system logs, iostat, ifstat, to see if it's about
heavy usage by some existing job, or some kind of kernel hang.  But it's
likely a system issue, not a GridEngine issue.

Dan


On Thu, Jan 17, 2019 at 9:58 PM Derek Stephenson <
derek.stephen...@awaveip.com> wrote:

> Hello,
>
>
> I should preface this with I've just recently started getting my head
> around grid engine and as such may not have all the information I should
> for administering the grid but someone's has to do it. Anyways...
>
>
> Our company across an issue recently where a one of the nodes seems to
> become very delayed in its response to grid submissions.  Whether it be a
> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min
> to successfully submit. In particular, while users may complain a qsub job
> looks like it has submitted but do nothing, doing a qlogin to the node in
> question will give the following:
>
>
> Your job 287104 ("QLOGIN") has been submitted
> waiting for interactive job to be scheduled ...timeout (3 s) expired while
> waiting on socket fd 7
>
> Now I've seen  a series of forum articles bring up this message while
> seaching through back logs but there never seems to be any conclusions in
> those threads for me to start delving into on our end.
>
> Our past attempts to resolve the issue have only succeeded by rebooting
> the node in question, and not having any real ideas on why is becoming a
> general frustration.
>
> Any initial thoughts/pointers would be greatly appreciated
>
> Kind Regards,
>
> Derek
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Processes not exiting

2018-11-15 Thread Daniel Povey
Make sure the gid_range is set to a range in which none of your system's
users have group-ids.  Otherwise it will kill the wrong things.

On Thu, Nov 15, 2018 at 6:10 PM  wrote:

> Hay, William wrote on 11/14/18 04:21:
> > Do you have ENABLE_ADDGRP_KILL set?  Can be helpful in killing processes
> left behind when a job exits.
>
> We don't have that set yet.  I will try setting ENABLE_ADDGRP_KILL=TRUE
> in the execd_params for the global configuration and see if it helps.
> Thanks
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Alternatives to Son of GridEngine

2018-11-12 Thread Daniel Povey
Everyone,
I'm trying to understand the landscape of alternatives to Son of
GridEngine, since the maintenance situation isn't great right now and I'm
not sure that it has a long term future.
If you guys were to switch to something in the same universe of products,
what would it be to?  Univa GridEngine?  slurm?  Which of these, as far as
you know, is better maintained and has a better future?
I'm not interested in fancy new things like mesos that have a different
programming model or are too new.

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Sorry, what I wrote was confusing due to an errant paste.  Edited below.

On Sat, Nov 10, 2018 at 5:03 PM Daniel Povey  wrote:

> I was able to fix it, although I suspect that my fix may have been
> disruptive to the jobs.
>
> Firstly, I  believe the problem was that gridengine does not handle a
> deleted job (state 'dr') that is on a host that has been deleted, and it
> dies when it sees it.   Presumably the bug is in allowing it to be deleted
> in the first place.
>
> Anyway, my fix (after backing up the directory /var/spool/gridengine) was
> to move the file /var/spool/gridengine/spooldb/sge_job to a temporary
> location, restart the qmaster, add the host back with qconf -ah, stop the
> qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job,
> and restart the qmaster.
>
> Before doing that whole procedure, to stop the hosts getting confused I
> stopped all the gridengine-exec services.  That probably wasn't optimal
> because clients like qsub and qstat would still have been able to access
> the queue in the interim, and it definitely would have confused them and
> killed some processes.  Unfortunately I had to do this on short notice and
> wasn't sure how to use iptables to close off those ports from outside the
> qmaster while I did the maintenance-- that would have been a better
> solution.
>
> Also I encountered a hiccup that `systemctl stop gridengine-qmaster`
> didn't actually work the second time, the process was still running, with
> the old database, so I had to manually kill it and retry.
>
> Anyway this whole episode is making me think more seriously about moving
> to Univa GridEngine.  I've known for a long time that the free version has
> a lot of bugs, and I just don't have time to deal with this type of thing.
>
>
> On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <
> john.marsha...@canada.ca> wrote:
>
>> Hi,
>>
>> I've never seen this but I would start with:
>> 1) strace qmaster during restart to try to see at which point it is dying
>> (e.g.,
>> loading a config file)
>> 2) look for any reference to the name of the host you deleted in the spool
>> area and do some cleanup
>> 3) clean out the jobs spool area
>>
>> HTH,
>> John
>>
>> On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
>>
>> Has anyone found this error, and managed to fix it?
>> I am in a very difficult situation.
>> I deleted a host (qconf -de hostname) thinking that the machine no longer
>> existed, but it did exist, and there was a job in 'dr' state there.
>> After I attempted to force-delete that job (qdel -f job-id), the queue
>> master died with out-of-memory, and now I can't restart qmaster.
>>
>> So now I don't know hw to fix it.  Am I just completely lost now?
>>
>> Dan
>>
>> ___
>>
>> users mailing list
>>
>> users@gridengine.org
>>
>> https://gridengine.org/mailman/listinfo/users
>>
>>
>>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
/var/spool/gridengineI was able to fix it, although I suspect that my fix
may have been disruptive to the jobs.

Firstly, I  believe the problem was that gridengine does not handle a
deleted job that is on a host that has been deleted, and it dies when it
sees it.   Presumably the bug is in allowing it to be deleted in the first
place.

Anyway, my fix (after backing up the directory /var/spool/gridengine) was
to move the file /var/spool/gridengine/spooldb/sge_job to a temporary
location, restart the qmaster, add the host back with qconf -ah, stop the
qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job,
and restart the qmaster.

Before doing that whole procedure, to stop the hosts getting confused I
stopped all the gridengine-exec services.  That probably wasn't optimal
because clients like qsub and qstat would still have been able to access
the queue in the interim, and it definitely would have confused them and
killed some processes.  Unfortunately I had to do this on short notice and
wasn't sure how to use iptables to close off those ports from outside the
qmaster while I did the maintenance-- that would have been a better
solution.

Also I encountered a hiccup that `systemctl stop gridengine-qmaster` didn't
actually work the second time, the process was still running, with the old
database, so I had to manually kill it and retry.

Anyway this whole episode is making me think more seriously about moving to
Univa GridEngine.  I've known for a long time that the free version has a
lot of bugs, and I just don't have time to deal with this type of thing.


On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <
john.marsha...@canada.ca> wrote:

> Hi,
>
> I've never seen this but I would start with:
> 1) strace qmaster during restart to try to see at which point it is dying
> (e.g.,
> loading a config file)
> 2) look for any reference to the name of the host you deleted in the spool
> area and do some cleanup
> 3) clean out the jobs spool area
>
> HTH,
> John
>
> On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
>
> Has anyone found this error, and managed to fix it?
> I am in a very difficult situation.
> I deleted a host (qconf -de hostname) thinking that the machine no longer
> existed, but it did exist, and there was a job in 'dr' state there.
> After I attempted to force-delete that job (qdel -f job-id), the queue
> master died with out-of-memory, and now I can't restart qmaster.
>
> So now I don't know hw to fix it.  Am I just completely lost now?
>
> Dan
>
> ___
>
> users mailing list
>
> users@gridengine.org
>
> https://gridengine.org/mailman/listinfo/users
>
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Has anyone found this error, and managed to fix it?
I am in a very difficult situation.
I deleted a host (qconf -de hostname) thinking that the machine no longer
existed, but it did exist, and there was a job in 'dr' state there.
After I attempted to force-delete that job (qdel -f job-id), the queue
master died with out-of-memory, and now I can't restart qmaster.

So now I don't know hw to fix it.  Am I just completely lost now?

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
OK, well there's your problem.  You need to increase the start of gid_range
to a value larger than your largest possible 'real' userid: for instance,
1.
The name is a little confusing.  It needs to be a range that's disjoint
from the range of possible userids.


On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran  wrote:

> Hi Dan.
>
> Thank you for the suggestion.   Here is what I have:
>
> # qconf -sconf | grep gid_range
> gid_range200-70
>
> The highest gid is 3135.
> Best,
> Joseph
>
> On 11/8/2018 8:58 PM, Daniel Povey wrote:
>
> Do
> qconf -sconf | grep gid_range
> and check whether any of your users have group id's in that range.  That
> can lead to things being killed.
> Dan
>
>
> On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran  wrote:
>
>> Greetings.
>>
>> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>>
>> I am seeing job failures on nodes where the node's sge_execd
>> unexpectedly dies.
>>
>> I ran strace on the nodes sge_execd and it's not of much help.   It
>> always end with
>>
>> +++ killed by SIGKILL +++
>>
>> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
>> memory issues.  The sge_qmaster on the head node is never affected and
>> it runs just fine.  The issue is on the client's sge_execd and 80% of nodes
>> are not affected, only some 20% of the nodes.
>>
>> Here are some sge settings:
>>
>> qmaster_params   MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
>> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>>  H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity,
>> \
>>  S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>>  H_MAXPROC=infinity,S_LOCKS=infinity, \
>>  H_LOCKS=infinity,
>> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>>
>> max_aj_instances 2000
>> max_aj_tasks 0
>> max_u_jobs   90
>> max_jobs 90
>> max_advance_reservations 300
>>
>> I also tried playing with vm settings to:
>>
>> /sbin/sysctl vm.overcommit_ratio=100
>> /sbin/sysctl vm.overcommit_memory=2
>>
>> But it has not been of much help - sge_execd keeps dying.
>>
>> Any help on how I can track down what is causing the node client
>> sge_execd to die?
>>
>> Joseph
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
Do
qconf -sconf | grep gid_range
and check whether any of your users have group id's in that range.  That
can lead to things being killed.
Dan


On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran  wrote:

> Greetings.
>
> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>
> I am seeing job failures on nodes where the node's sge_execd unexpectedly
> dies.
>
> I ran strace on the nodes sge_execd and it's not of much help.   It
> always end with
>
> +++ killed by SIGKILL +++
>
> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
> memory issues.  The sge_qmaster on the head node is never affected and it
> runs just fine.  The issue is on the client's sge_execd and 80% of nodes
> are not affected, only some 20% of the nodes.
>
> Here are some sge settings:
>
> qmaster_params   MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>  H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
>  S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>  H_MAXPROC=infinity,S_LOCKS=infinity, \
>  H_LOCKS=infinity,
> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>
> max_aj_instances 2000
> max_aj_tasks 0
> max_u_jobs   90
> max_jobs 90
> max_advance_reservations 300
>
> I also tried playing with vm settings to:
>
> /sbin/sysctl vm.overcommit_ratio=100
> /sbin/sysctl vm.overcommit_memory=2
>
> But it has not been of much help - sge_execd keeps dying.
>
> Any help on how I can track down what is causing the node client sge_execd
> to die?
>
> Joseph
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Dave Love repository issue

2018-10-12 Thread Daniel Povey
There is an issue tracker here
https://arc.liv.ac.uk/trac
but it's not clear whether Dave Love still has access to it (he moved to
Manchester and for a while at least he did not have access; and he doesn't
seem to have been working on GridEngine lately anyway).  Also I couldn't
figure out where in the issue tracker you are supposed to make a new issue;
you probably have to create an account first.

I made an attempt to re-start a GitHub-based version of the repo, here
https://github.com/son-of-gridengine/sge
but the project is not exactly off the ground, partly due to Dave's
objections and also due to lack of clarity about whether he plans to
continue maintaining GridEngine.   You could create an issue on the github
if you want, but I don't promise that that project will necessarily live on.

If you look at the issues in the issue tracker
https://arc.liv.ac.uk/trac/SGE/query?status=!closed=3=priority
there are a rather scary number of existing, un-resolved issues.
To me it raises the question of whether GridEngine might be just too big,
to old, and too encumbered with features, to be maintainable as an
open-source project.  But I also don't know what the most viable
alternative is.

Dan

On Fri, Oct 12, 2018 at 1:59 PM Jerome  wrote:

> Dear all.
>
> I follow the discussion about the future of SoGe.
> I've download the git repository from Dave Love in GitLab
> (https://gitlab.com/loveshack/sge) , and try to use it on a debian based
> system.
> The problem is taht i'v got a segmentation fault in the sge_master binarie.
>
> Where can i report this issue? I admit that i'm a bit confuse about
> reporting issues with SGE (SoGE)..
>
> Regards
>
>
> --
> -- Jérôme
> La bêtise insiste toujours.
> (Albert Camus)
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Execution node no host information

2018-09-17 Thread Daniel Povey
Setting the hostname may depend slightly on the linux flavor... I'd start
with editing /etc/hostname and /etc/hosts, and then if systemctl is
installed, use `hostnamectl set-hostname your_hostname`.


On Mon, Sep 17, 2018 at 9:24 AM linux khbuex  wrote:

> Also
> on client, gethostname returns master. gethostbyname & gethostbyaddr
> return correct IP vs. name.
> I think gethostname should return client on the client machine. Where is
> this info stored?
>
> On Mon, Sep 17, 2018 at 12:43 PM, linux khbuex 
> wrote:
>
>> Hi,
>>
>> I have 2 machines master, client. qhost -q gives:
>> HOSTNAMEARCH NCPU NSOC NCOR NTHR  LOAD  MEMTOT
>> MEMUSE  SWAPTO  SWAPUS
>>
>> --
>> global  -   ---- -   -
>>-   -   -
>> masterlx-amd648488  0.00   15.7G1.1G2.0G
>>0.0
>>all.qBIP   0/0/7
>> client   -   ---- -   -   -   -
>>  -
>>
>> This happens when running qhost both on master and on client. There is no
>> host info for node client (also no q info even though configured same as
>> master).
>>
>> qconf -se client gives:
>> hostname  client
>> load_scaling  NONE
>> complex_valuesram_free=14G
>> load_values   NONE
>> processors0
>> user_listsNONE
>> xuser_lists   NONE
>> projects  NONE
>> xprojects NONE
>> usage_scaling NONE
>> report_variables  NONE
>>
>> notice the 0 in processors. qconf -sel gives:
>> master
>> client
>>
>> Not sure if matters but initially gridengine-master was also installed on
>> client but then removed.
>> service --status-all gives:
>> ...
>>  [ + ]  gridengine-exec
>> so no gridengine-master here.
>>
>> ps -ax |grep sge does give:
>>  1039 ?Sl 0:15 /usr/lib/gridengine/sge_qmaster
>> 21821 ?Sl 0:00 /usr/lib/gridengine/sge_execd
>> 22436 pts/0S+ 0:00 grep --color=auto sge
>>
>> not sure what that means. I tried reinstalling gridengine-client,
>> gridengine-exec as well. Also did dpkg-reconfigure gridengine-client.
>> Any suggestions what else should be checked? which other configurations?
>> Thanks.
>>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] cpu usage calculation

2018-08-31 Thread Daniel Povey
This gets back to the issue of who is going to maintain GridEngine.
Dave Love briefly resurfaced (enough to dissuade me from forming a
group to maintain it, we were going to make this its home
https://github.com/son-of-gridengine/sge) but seems to have gone under
again.  And actually I'm not sure that I have the time to lead that
project.  Are there other people other than Dave who have a good
understanding of its internals?

Dan
On Fri, Aug 31, 2018 at 10:56 AM William Hay  wrote:
>
> On Fri, Aug 31, 2018 at 10:27:39AM +, Marshall2, John (SSC/SPC) wrote:
> >Hi,
> >When gridengine calculates cpu usage (based on wallclock) it uses:
> >cpu usage = wallclock * nslots
> >This does not account for the number of cpus that may be used for
> >each slot, which is problematic.
> >I have written up an article at:
> >
> > https://expl.info/display/MISC/Slot+Multiplier+for+Calculating+CPU+Usage+in+Gridengine
> >which explains the issue and provides a patch (against sge-8.1.9)
> >so that:
> >cpu usage = wallclock * nslots * ncpus_per_slot
> >This makes the usage information much more useful/accurate
> >when using the fair share.
> >Have others encountered this issue? Feedback is welcome.
> >Thanks,
> >John
>
> Used to do something similar (our magic variable was thr short for
> threads).  The one thing that moved us away from that was in 8.x grid
> engine binds cores to slots via -binding.
>
> Rather than adding support for another mechanism to specify cores (slots,
> -binding) it might be a better idea to support calculating cores per
> slot based on -binding.
>
> That said I'm not a huge fan of -binding.  If a job has exclusive access
> to a node then the job can handle its own core binding.  If the job
> doesn't have exclusive access then binding strategies other than linear
> don't seem likely to be successful.
>
> William
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] different results on terminal and on submission via qsub

2018-07-09 Thread Daniel Povey
Using single quotes would fix it.
But fundamentally this is a bash question-- this list is for GridEngine
questions and more on this topic would bore the subscribers.  Best to look
for a bash tutorial online and learn more about it that way.


On Mon, Jul 9, 2018 at 1:42 PM, VG  wrote:

> Hi Daniel,
> I used a named script and it is working as I expected it to work. I am
> just trying to understand what is wrong here on the command line when I
> submit it to qsub. I know double quotes can be a real pain. I have one set
> of double quotes around the seq I want to grep and the other double quotes
> are the submission quotes to the qsub command.
>
> Regards
> Varun
>
>
> On Mon, Jul 9, 2018 at 1:30 PM, Daniel Povey  wrote:
>
>> You should put that in a bash script and invoke it by name.
>>
>> It's really a miracle that it gave you anything even remotely close to
>> what you intended when run like that, because of how bash interprets
>> double-quotes and variables.  Inside double quotes it will expand variables
>> with their values at the time the contents of the double-quotes are
>> evaluated.
>>
>> Best to sidestep all those issues by running it as a named scripts.
>>
>>
>>
>>
>>
>> On Mon, Jul 9, 2018 at 1:23 PM, VG  wrote:
>>
>>> Hi,
>>> I am trying to run a command on the terminal and also submit it to the
>>> cluster but I am getting different results.
>>>
>>> When I type on the terminal this :
>>>
>>> for i in *_1.fastq.gz; do echo $i >> t.txt; zcat $i | grep
>>> "GCTGGCAGAAGGTAACATG" >> t.txt ; echo >> t.txt ; done
>>>
>>> I get the output like this
>>>
>>> adrenal_4a_ERR315335_1.fastq.gz
>>> GCANAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4c_ERR315392_1.fastq.gz
>>>
>>> adrenal_4c_ERR315450_1.fastq.gz
>>>
>>> and so on..
>>>
>>>
>>> This is what the output is expected.
>>>
>>> When I submit the same command to the hpc cluster via qsub I am getting
>>> a completely different result
>>>
>>> *qsub -l h_vmem=4G -cwd -j y -b y -N n_tr -R y "for i in *_1.fastq.gz;
>>> do echo $i >> t.txt; zcat $i | grep "GCTGGCAGAAGGTAACATG" >> t.txt
>>> ; echo >> t.txt ; done"*
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>>
>>> appendix_4a_ERR315437_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>> GGACTGCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGAAACTAT
>>> GTAGCATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGA
>>>
>>> appendix_4a_ERR315465_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>>
>>> appendix_4b_ERR315345_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> adrenal_4a_ERR315452_1.fastq.gz
>>> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
>>> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>>> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
>>> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>>>
>>> What is it that I am doing wrong here.
>>>
>>> Thanks
>>>
>>> Regards
>>> Varun
>>>
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] different results on terminal and on submission via qsub

2018-07-09 Thread Daniel Povey
You should put that in a bash script and invoke it by name.

It's really a miracle that it gave you anything even remotely close to what
you intended when run like that, because of how bash interprets
double-quotes and variables.  Inside double quotes it will expand variables
with their values at the time the contents of the double-quotes are
evaluated.

Best to sidestep all those issues by running it as a named scripts.





On Mon, Jul 9, 2018 at 1:23 PM, VG  wrote:

> Hi,
> I am trying to run a command on the terminal and also submit it to the
> cluster but I am getting different results.
>
> When I type on the terminal this :
>
> for i in *_1.fastq.gz; do echo $i >> t.txt; zcat $i | grep
> "GCTGGCAGAAGGTAACATG" >> t.txt ; echo >> t.txt ; done
>
> I get the output like this
>
> adrenal_4a_ERR315335_1.fastq.gz
> GCANAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
>
> adrenal_4a_ERR315452_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4c_ERR315392_1.fastq.gz
>
> adrenal_4c_ERR315450_1.fastq.gz
>
> and so on..
>
>
> This is what the output is expected.
>
> When I submit the same command to the hpc cluster via qsub I am getting a
> completely different result
>
> *qsub -l h_vmem=4G -cwd -j y -b y -N n_tr -R y "for i in *_1.fastq.gz; do
> echo $i >> t.txt; zcat $i | grep "GCTGGCAGAAGGTAACATG" >> t.txt ;
> echo >> t.txt ; done"*
>
> adrenal_4a_ERR315452_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4a_ERR315452_1.fastq.gz
>
> appendix_4a_ERR315437_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4a_ERR315452_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4a_ERR315452_1.fastq.gz
> GGACTGCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGAAACTAT
> GTAGCATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGA
>
> appendix_4a_ERR315465_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4a_ERR315452_1.fastq.gz
>
> appendix_4b_ERR315345_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> adrenal_4a_ERR315452_1.fastq.gz
> GCAAAGGCCAATGTTGGTGCTGGCAGAAGGTAACATGAAGGAACTATGTAGC
> ATAGTGTCTTAACACCTCAGTAAAGAGATCGGAAGAGCACA
> CAAGAACAGAATGAAGAAAGTCAGACTGCAAAGGCCAATGTTGGTGCTGGCA
> GAAGGTAACATGAAGAAACTATGTAGCATAGTGTCTT
>
> What is it that I am doing wrong here.
>
> Thanks
>
> Regards
> Varun
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Possible opportunity for development work

2018-05-14 Thread Daniel Povey
Thanks for that info RE the code.

FWIW, we have set up GPU resources the same in our cluster, and
haven't run into that bug.

I wonder if deleting that execution host and adding it back again
might work around your issue.

Dan


On Mon, May 14, 2018 at 12:52 PM, Joshua Baker-LePain <j...@salilab.org> wrote:
> On Sun, 13 May 2018 at 8:49pm, Daniel Povey wrote
>
>> Can you show the full output from when you do `qstat -j ` for
>> the job that's pending?
>
>
> Unfortunately I had to change our setup so that GPU jobs would actually flow
> through the queues -- we're no longer using a consumable gpu complex. Our
> current setup, though, is far from perfect, which is why we're looking to
> help get this fixed.
>
> In this <http://gridengine.org/pipermail/users/2018-April/010120.html>
> message from the previous thread I mention that the 'qstat -j' output is
> unremarkable.  It details all the queues the job can't run in (all for
> legitimate reasons).  It's also notable that 'qalter -w p' always said
> "verification: found possible assignment with 5 slots" when jobs got stuck
> in this state.
>
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
oops my bad, looks like it means 'job granted'.  sorry for the spam.

On Mon, May 14, 2018 at 12:17 AM, Daniel Povey <dpo...@gmail.com> wrote:
> And an interesting tidbit from the house of horrors that is SGE code:
>
> A bunch of variable names in sge_follow.c, have JG in them, e.g. JG_qhostname.
> I thought: this must be some highly informative variable naming
> system, if only I could crack it (although for my own projects I don't
> allow naming systems that are this non-obvious).
>
> From grepping around, though, it actually seems to be the initials of
> an SGE developer, Joachim Graeber :-(
>
> If my students submitted code like this to me in a pull request, I not
> only wouldn't merge it, I would be angry at them for even thinking
> they might get it past me.  But I guess we don't get to choose when
> it's legacy code.
>
> Dan
>
>
> On Mon, May 14, 2018 at 12:04 AM, Daniel Povey <dpo...@gmail.com> wrote:
>> Can you please create an issue for this at the new github location?
>>   https://github.com/son-of-gridengine/sge/issues
>> I'm having a look at the code (sge_follow.c; search for
>> MSG_JOB_RESOURCESNOLONGERAVAILABLE_UU).
>>
>> It does look like it could potentially be a bug like you say, but I'm
>> having a hard time understanding the code.  It seems to have been
>> written before the days when documentation was expected in projects
>> like this.
>>
>>
>> On Sun, May 13, 2018 at 11:49 PM, Daniel Povey <dpo...@gmail.com> wrote:
>>> Can you show the full output from when you do `qstat -j ` for
>>> the job that's pending?
>>>
>>>
>>> On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain <j...@salilab.org> 
>>> wrote:
>>>> As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a
>>>> small (but growing) cluster here.  With the addition of GPUs to the 
>>>> cluster,
>>>> we ran into what appears to be a resource scheduling bug (see
>>>> <http://gridengine.org/pipermail/users/2018-April/010109.html>).  We would
>>>> like to explore the possibility of sponsoring any existing SoGE 
>>>> contributors
>>>> to fix this issue.  If you're interested, please contact me off list.  We'd
>>>> be looking to contribute any fixes made back to the community (we're quite
>>>> happy to see the renewed interest in maintaining SoGE).  Thanks!
>>>>
>>>> --
>>>> Joshua Baker-LePain
>>>> QB3 Shared Cluster Sysadmin
>>>> UCSF
>>>> ___
>>>> users mailing list
>>>> users@gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
And an interesting tidbit from the house of horrors that is SGE code:

A bunch of variable names in sge_follow.c, have JG in them, e.g. JG_qhostname.
I thought: this must be some highly informative variable naming
system, if only I could crack it (although for my own projects I don't
allow naming systems that are this non-obvious).

>From grepping around, though, it actually seems to be the initials of
an SGE developer, Joachim Graeber :-(

If my students submitted code like this to me in a pull request, I not
only wouldn't merge it, I would be angry at them for even thinking
they might get it past me.  But I guess we don't get to choose when
it's legacy code.

Dan


On Mon, May 14, 2018 at 12:04 AM, Daniel Povey <dpo...@gmail.com> wrote:
> Can you please create an issue for this at the new github location?
>   https://github.com/son-of-gridengine/sge/issues
> I'm having a look at the code (sge_follow.c; search for
> MSG_JOB_RESOURCESNOLONGERAVAILABLE_UU).
>
> It does look like it could potentially be a bug like you say, but I'm
> having a hard time understanding the code.  It seems to have been
> written before the days when documentation was expected in projects
> like this.
>
>
> On Sun, May 13, 2018 at 11:49 PM, Daniel Povey <dpo...@gmail.com> wrote:
>> Can you show the full output from when you do `qstat -j ` for
>> the job that's pending?
>>
>>
>> On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain <j...@salilab.org> 
>> wrote:
>>> As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a
>>> small (but growing) cluster here.  With the addition of GPUs to the cluster,
>>> we ran into what appears to be a resource scheduling bug (see
>>> <http://gridengine.org/pipermail/users/2018-April/010109.html>).  We would
>>> like to explore the possibility of sponsoring any existing SoGE contributors
>>> to fix this issue.  If you're interested, please contact me off list.  We'd
>>> be looking to contribute any fixes made back to the community (we're quite
>>> happy to see the renewed interest in maintaining SoGE).  Thanks!
>>>
>>> --
>>> Joshua Baker-LePain
>>> QB3 Shared Cluster Sysadmin
>>> UCSF
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
Can you please create an issue for this at the new github location?
  https://github.com/son-of-gridengine/sge/issues
I'm having a look at the code (sge_follow.c; search for
MSG_JOB_RESOURCESNOLONGERAVAILABLE_UU).

It does look like it could potentially be a bug like you say, but I'm
having a hard time understanding the code.  It seems to have been
written before the days when documentation was expected in projects
like this.


On Sun, May 13, 2018 at 11:49 PM, Daniel Povey <dpo...@gmail.com> wrote:
> Can you show the full output from when you do `qstat -j ` for
> the job that's pending?
>
>
> On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain <j...@salilab.org> 
> wrote:
>> As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a
>> small (but growing) cluster here.  With the addition of GPUs to the cluster,
>> we ran into what appears to be a resource scheduling bug (see
>> <http://gridengine.org/pipermail/users/2018-April/010109.html>).  We would
>> like to explore the possibility of sponsoring any existing SoGE contributors
>> to fix this issue.  If you're interested, please contact me off list.  We'd
>> be looking to contribute any fixes made back to the community (we're quite
>> happy to see the renewed interest in maintaining SoGE).  Thanks!
>>
>> --
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> UCSF
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
Can you show the full output from when you do `qstat -j ` for
the job that's pending?


On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain  wrote:
> As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a
> small (but growing) cluster here.  With the addition of GPUs to the cluster,
> we ran into what appears to be a resource scheduling bug (see
> ).  We would
> like to explore the possibility of sponsoring any existing SoGE contributors
> to fix this issue.  If you're interested, please contact me off list.  We'd
> be looking to contribute any fixes made back to the community (we're quite
> happy to see the renewed interest in maintaining SoGE).  Thanks!
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Son of GridEngine succession?

2018-05-13 Thread Daniel Povey
The consensus seems to be for 'Son of GridEngine' so I am keeping that name.
Right now I am working on importing Dave Love's repo (non-trivial
since 'git fsck' fails on it, see here
https://github.com/son-of-gridengine/sge/issues/1, but I'm mostly
done, it's just slow).

If you want to be involved, please let me know your github userid, and
please 'watch' the project.  My own experience of GridEngine is mainly
as a user, not a maintainer, so we need qualified people!

Dan


On Sat, May 12, 2018 at 6:28 PM, Daniel Povey <dpo...@gmail.com> wrote:
> Thanks for your responses to the poll!
> I have posted on the sge-discuss list at Liverpool to inform them of this
> http://arc.liv.ac.uk/pipermail/sge-discuss/2018-May/thread.html
> .. partly since the name "Son of GridEngine" seems to be leading in
> the poll, and I want to find out if there is anyone in that old team
> that would be upset if the name is re-used.
> Dan
>
>
> On Sat, May 12, 2018 at 5:01 PM, Daniel Povey <dpo...@gmail.com> wrote:
>> I've created a Doodle poll with the options named so far.
>>
>> https://doodle.com/poll/vcr2nasruzkg4r4g
>>
>> This is just so we can get a sense of peoples' opinions, it won't be binding!
>> Also maybe some people will respond that wouldn't want to reply to the
>> whole list.
>>
>>
>> On Sat, May 12, 2018 at 4:56 PM, Daniel Povey <dpo...@gmail.com> wrote:
>>> Thanks for the naming ideas!
>>>
>>> Since many of these ideas are family-related I had a look for synonyms
>>> for 'offspring' that begin with s (so that we have no difficulty
>>> explaining why the binaries still have 'sge' in the name).
>>>
>>>  https://www.merriam-webster.com/thesaurus/offspring
>>>
>>> There is 'spawn of GridEngine'.  (Also 'seed of GridEngine' but it
>>> seems a bit too sexual to me.. . maybe spawn would be better).  I feel
>>> like 'sister of GridEngine' maybe doesn't sit well with the fact that
>>> the project is a derivation of Son of GridEngine, not a parallel
>>> project.  'spawn' still sounds like it could be in a horror movie, but
>>> I'm OK with that.
>>>
>>> Dan
>>>
>>>
>>> On Sat, May 12, 2018 at 12:07 PM, Afif Elghraoui <a...@debian.org> wrote:
>>>>
>>>>
>>>> On May 11, 2018 7:46:39 PM EDT, Daniel Povey <dpo...@gmail.com> wrote:
>>>>>RE the name-- if we used the same name and the U. of Liverpool people
>>>>>were to become active again, or notice the duplication, I'm concerned
>>>>>that there might be confusion.  (Plus, a unique name is easier for
>>>>>google-search purposes, so there is no confusion about the location of
>>>>>the project).  I'd rather only use the same name with Dave Love's
>>>>>blessing, but that is depenent on being able to contact him.  However,
>>>>>if most others want to to keep the same name, I'd be OK with it.
>>>>>
>>>>
>>>> I tried to keep myself out of proposing a name, but...if you wanted to 
>>>> keep the SGE abbreviation again like Dave tried to do, another name is 
>>>> "sister of grid engine"
>>>>
>>>> Afif
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Son of GridEngine succession?

2018-05-12 Thread Daniel Povey
Thanks for your responses to the poll!
I have posted on the sge-discuss list at Liverpool to inform them of this
http://arc.liv.ac.uk/pipermail/sge-discuss/2018-May/thread.html
.. partly since the name "Son of GridEngine" seems to be leading in
the poll, and I want to find out if there is anyone in that old team
that would be upset if the name is re-used.
Dan


On Sat, May 12, 2018 at 5:01 PM, Daniel Povey <dpo...@gmail.com> wrote:
> I've created a Doodle poll with the options named so far.
>
> https://doodle.com/poll/vcr2nasruzkg4r4g
>
> This is just so we can get a sense of peoples' opinions, it won't be binding!
> Also maybe some people will respond that wouldn't want to reply to the
> whole list.
>
>
> On Sat, May 12, 2018 at 4:56 PM, Daniel Povey <dpo...@gmail.com> wrote:
>> Thanks for the naming ideas!
>>
>> Since many of these ideas are family-related I had a look for synonyms
>> for 'offspring' that begin with s (so that we have no difficulty
>> explaining why the binaries still have 'sge' in the name).
>>
>>  https://www.merriam-webster.com/thesaurus/offspring
>>
>> There is 'spawn of GridEngine'.  (Also 'seed of GridEngine' but it
>> seems a bit too sexual to me.. . maybe spawn would be better).  I feel
>> like 'sister of GridEngine' maybe doesn't sit well with the fact that
>> the project is a derivation of Son of GridEngine, not a parallel
>> project.  'spawn' still sounds like it could be in a horror movie, but
>> I'm OK with that.
>>
>> Dan
>>
>>
>> On Sat, May 12, 2018 at 12:07 PM, Afif Elghraoui <a...@debian.org> wrote:
>>>
>>>
>>> On May 11, 2018 7:46:39 PM EDT, Daniel Povey <dpo...@gmail.com> wrote:
>>>>RE the name-- if we used the same name and the U. of Liverpool people
>>>>were to become active again, or notice the duplication, I'm concerned
>>>>that there might be confusion.  (Plus, a unique name is easier for
>>>>google-search purposes, so there is no confusion about the location of
>>>>the project).  I'd rather only use the same name with Dave Love's
>>>>blessing, but that is depenent on being able to contact him.  However,
>>>>if most others want to to keep the same name, I'd be OK with it.
>>>>
>>>
>>> I tried to keep myself out of proposing a name, but...if you wanted to keep 
>>> the SGE abbreviation again like Dave tried to do, another name is "sister 
>>> of grid engine"
>>>
>>> Afif
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Son of GridEngine succession?

2018-05-12 Thread Daniel Povey
I've created a Doodle poll with the options named so far.

https://doodle.com/poll/vcr2nasruzkg4r4g

This is just so we can get a sense of peoples' opinions, it won't be binding!
Also maybe some people will respond that wouldn't want to reply to the
whole list.


On Sat, May 12, 2018 at 4:56 PM, Daniel Povey <dpo...@gmail.com> wrote:
> Thanks for the naming ideas!
>
> Since many of these ideas are family-related I had a look for synonyms
> for 'offspring' that begin with s (so that we have no difficulty
> explaining why the binaries still have 'sge' in the name).
>
>  https://www.merriam-webster.com/thesaurus/offspring
>
> There is 'spawn of GridEngine'.  (Also 'seed of GridEngine' but it
> seems a bit too sexual to me.. . maybe spawn would be better).  I feel
> like 'sister of GridEngine' maybe doesn't sit well with the fact that
> the project is a derivation of Son of GridEngine, not a parallel
> project.  'spawn' still sounds like it could be in a horror movie, but
> I'm OK with that.
>
> Dan
>
>
> On Sat, May 12, 2018 at 12:07 PM, Afif Elghraoui <a...@debian.org> wrote:
>>
>>
>> On May 11, 2018 7:46:39 PM EDT, Daniel Povey <dpo...@gmail.com> wrote:
>>>RE the name-- if we used the same name and the U. of Liverpool people
>>>were to become active again, or notice the duplication, I'm concerned
>>>that there might be confusion.  (Plus, a unique name is easier for
>>>google-search purposes, so there is no confusion about the location of
>>>the project).  I'd rather only use the same name with Dave Love's
>>>blessing, but that is depenent on being able to contact him.  However,
>>>if most others want to to keep the same name, I'd be OK with it.
>>>
>>
>> I tried to keep myself out of proposing a name, but...if you wanted to keep 
>> the SGE abbreviation again like Dave tried to do, another name is "sister of 
>> grid engine"
>>
>> Afif
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Son of GridEngine succession?

2018-05-11 Thread Daniel Povey
It's great that there seems to be enthusiasm for this.
(BTW, Afif is the one who maintains the Debian packaging of GridEngin).

As far as I can tell,
https://gitlab.com/loveshack/sge
would probably be the natural repo to use as the jumping-off point for
the project.

Afif, are you aware of any more downstream repos, e.g. any repo that
you use to maintain the Debian package?  Or would it be more natural
to clone Dave Love's repo and apply any patches that you maintain to
it, where appropriate?

RE the name-- if we used the same name and the U. of Liverpool people
were to become active again, or notice the duplication, I'm concerned
that there might be confusion.  (Plus, a unique name is easier for
google-search purposes, so there is no confusion about the location of
the project).  I'd rather only use the same name with Dave Love's
blessing, but that is depenent on being able to contact him.  However,
if most others want to to keep the same name, I'd be OK with it.


Dan


On Fri, May 11, 2018 at 7:08 PM, Jerome <jer...@ibt.unam.mx> wrote:
> Le 11/05/2018 à 18:02, Christopher Heiny a écrit :
>> On Fri, 2018-05-11 at 18:49 -0400, Daniel Povey wrote:
>>>
>>> I want to start a discussion about how to replace Son of GridEngine.
>>> As far as I can tell, Dave Love has had no online activity for a
>>> year,
>>> is not responding to emails, and my attempts to contact him
>>> indirectly
>>> via his workplace have come to nothing.  Even if he is still alive, I
>>> think it's clear that he's either unwilling or unable to continue to
>>> maintain the Son of GridEngine project.
>>>
>>> I am thinking we could create a repository on GitHub to replace the
>>> Liverpool-hosted Son of GridEngine project?  Maybe call it Grandson
>>> of
>>> GridEngine?  What do you think?
>>>
>>> I know there are people who have patches to contribute.  I do myself.
>>
>> Sounds good to me.  I'll happily contribute mine as well.
>>
>> I'd be fine with simply continuing to call it Son of Gridengine.  It's
>> just got different adoptive parents :-)
>>
>
> +1
>
> --
> -- Jérôme
> Les femmes des uns font le bonheur des autres.
> (Gustave Flaubert)
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Son of GridEngine succession?

2018-05-11 Thread Daniel Povey
Everyone,

I want to start a discussion about how to replace Son of GridEngine.
As far as I can tell, Dave Love has had no online activity for a year,
is not responding to emails, and my attempts to contact him indirectly
via his workplace have come to nothing.  Even if he is still alive, I
think it's clear that he's either unwilling or unable to continue to
maintain the Son of GridEngine project.

I am thinking we could create a repository on GitHub to replace the
Liverpool-hosted Son of GridEngine project?  Maybe call it Grandson of
GridEngine?  What do you think?

I know there are people who have patches to contribute.  I do myself.

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Futex leap-second bug for GridEngine?

2012-07-13 Thread Daniel Povey
Has anyone noticed their sge_execd proceses suddenly taking up a lot of
CPU, possibly since around July 2nd this year?
I think it might be to do with the Linux leap second bug, which affects
processes that use futexes.  It doesn't happen to all nodes on a queue,
just some.
The only way I know to resolve this is to reboot the machine.


If you do strace -p pid on the sge_execd process, you'll see output
like the following.

BTW, I think I am not on this list right now.

Dan

futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276699, {1342229400, 776755000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276701, {1342229400, 776871000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276703, {1342229400, 776988000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276705, {1342229400, 777102000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276707, {1342229400, 777219000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276709, {1342229400, 777335000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276711, {1342229400, 777452000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276713, {1342229400, 777567000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276715, {1342229400, 777683000}, ) = -1 ETIMEDOUT (Connection
timed out)
futex(0x7fb381006b30, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fb381006b94, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2027276717, {1342229400, 99000}, ) = -1 ETIMEDOUT (Connection
timed out)
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] GridEngine and soft user limits not being respected.

2012-06-23 Thread Daniel Povey
We have a problem in our queue that GridEngine is not respecting user
limits specified in /etc/security/limits.conf
We have in that file
*   softas  2000
and nothing else,
and when we log into a machine (Debian Linux) using ssh or qrsh the limit
will appear, so if you type ulimit -a you get)

virtual memory  (kbytes, -v) 2000

as it should be, but when logging in using qlogin to the same machine, it
comes up as unlimited.

Any ideas?

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] error with qsub -sync y

2011-10-08 Thread Daniel Povey
I have been getting occasional errors when using qsub -sync y.  It prints
out the error message:

Unable to initialize environment because of error: range_list containes no
elements
Exiting.

This is not reproducible, but seems to occur in batches.  This is with GE
6.2R5.
Looking for this online gives little information that is useful-- it seems
to be a bug in qsub.
Is anyone familiar?  What is the best way to debug this?  I don't have root
on the machines concerned.

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users