from:"Reuti"

Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti



Am 11.06.2020 um 22:44 schrieb Chris Dagdigian:

> 
> The root cause was strange so it's worth documenting here ...
> 
> I had created a new consumable and requestable resource called "gpu" 
> configured like this:
> 
> gpu gpuINT   <=YES YESNONE
> 0
> 
> And on host A I had set "complex_values gpu=1" and on host B I set 
> "complex_values gpu=2" etc. etc. across the cluster. 
> 
> My mistake was setting the default value of the new complex entry to "NONE" 
> instead of "0" which is what you probably want when the attribute is of type 
> INT
> 
> But this was bizzare;  basically I had a bad default value for a requestable 
> resource and as soon as we set that value down at the execution host level it 
> instantly broke all of our parallel environments.  SGE scheduler was treating 
> my mistake like I had created a requestable resource of type FORCED or 
> something. 

Aha, a couple of days ago I got a request in PM where someone swore that the 
configuration "h_vmem …  YES YES 0 0" was working fine all the time. Only after 
my suggestion to add h_vmem on an exechost level to avoid oversubscription all 
the jobs crashed then, due to no memory being available (as h_vmem = 0 was used 
this way as an automatically set limit).

Essentially: the default value in a complex definition is ignored, as long as 
there is nothing to consume from. If it's not ignored, then the type has to 
match.

-- Reuti


> 
> Strange but resolved now. 
> 
> Regards
> Chris
> 
> 
> 
> 
> Reuti wrote on 6/11/20 4:17 PM:
>> Hi,
>> 
>> Any consumables in place like memory or other resource requests? Any output 
>> of `qalter -w v …` or "-w p"?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian 
>>> :
>>> 
>>> Hi folks,
>>> 
>>> Got a bewildering situation I've never seen before with simple SMP/threaded 
>>> PE techniques
>>> 
>>> I made a brand new PE called threaded:
>>> 
>>> $ qconf -sp threaded
>>> pe_namethreaded
>>> slots  999
>>> user_lists NONE
>>> xuser_listsNONE
>>> start_proc_argsNONE
>>> stop_proc_args NONE
>>> allocation_rule$pe_slots
>>> control_slaves FALSE
>>> job_is_first_task  TRUE
>>> urgency_slots  min
>>> accounting_summary FALSE
>>> qsort_args NONE
>>> 
>>> 
>>> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
>>> threaded 1' argument
>>> 
>>> However all "qstat -j" data is showing this scheduler decision line:
>>> 
>>> cannot run in PE "threaded" because it only offers 0 slots
>>> 
>>> 
>>> I'm sort of lost on how to debug this because I can't figure out how to 
>>> probe where SGE is keeping track of PE specific slots.  With other stuff I 
>>> can look at complex_values reported by execution hosts or I can use an "-F" 
>>> argument to qstat to dump the live state and status of a requestable 
>>> resource but I don't really have any debug or troubleshooting ideas for 
>>> "how to figure out why SGE thinks there are 0 slots when the static PE on 
>>> an idle cluster has. been set to contain 999 slots" 
>>> 
>>> Anyone seen something like this before?  I don't think I've ever seen this 
>>> particular issue with an SGE parallel environment before ...
>>> 
>>> 
>>> Chris
>>> 
>>> ___
>>> users mailing list
>>> 
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti

Hi,

Any consumables in place like memory or other resource requests? Any output of 
`qalter -w v …` or "-w p"?

-- Reuti


> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian :
> 
> Hi folks,
> 
> Got a bewildering situation I've never seen before with simple SMP/threaded 
> PE techniques
> 
> I made a brand new PE called threaded:
> 
> $ qconf -sp threaded
> pe_namethreaded
> slots  999
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$pe_slots
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> qsort_args NONE
> 
> 
> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
> threaded 1' argument
> 
> However all "qstat -j" data is showing this scheduler decision line:
> 
> cannot run in PE "threaded" because it only offers 0 slots
> 
> 
> I'm sort of lost on how to debug this because I can't figure out how to probe 
> where SGE is keeping track of PE specific slots.  With other stuff I can look 
> at complex_values reported by execution hosts or I can use an "-F" argument 
> to qstat to dump the live state and status of a requestable resource but I 
> don't really have any debug or troubleshooting ideas for "how to figure out 
> why SGE thinks there are 0 slots when the static PE on an idle cluster has. 
> been set to contain 999 slots" 
> 
> Anyone seen something like this before?  I don't think I've ever seen this 
> particular issue with an SGE parallel environment before ...
> 
> 
> Chris
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to export an X11 back to the client?

2020-05-12 Thread Reuti

Hi,

Am 12.05.2020 um 23:27 schrieb Mun Johl:

> Hi,
> 
> Just some additional testing results ...
> 
> Our IT guy turned off the firewall on a Submit Host and Execution Host for 
> experimental purposes.  That got me further but not all the way.  Here is the 
> verbose log from qrsh:
> 
> waiting for interactive job to be scheduled ...
> Your interactive job 460937 has been successfully scheduled.
> Establishing /usr/bin/ssh -X session to host sim.domain.com ...
> ssh_exchange_identification: Connection closed by remote host
> /usr/bin/ssh -X exited with exit code 255
> reading exit code from shepherd ... 129
> 
> We aren't yet able to get around the ssh -X error.  Any ideas?

But a plain `ssh`to the nodes work? 

In case a different hostname must be used, there is an option 
"HostbasedUsesNameFromPacketOnly" in "sshd_config".


> But even if we could, we still need to figure out which ports of the firewall 
> need to be opened up.  Every time we ran an experiment, the port number that 
> was used for SSH was different.  I hope we don't have to open up too big a 
> range of ports.

Unfortunately the port is randomly chosen with any new connection.

But wouldn't it be possible to adjust the firewall to allow all ports only when 
connecting from the nodes in the cluster (are the nodes in a VLAN behind a head 
node or all submit machines and nodes also connected to the Internet?)

Also in SSH itself it is possible with the "match" option in "sshd_config" to 
allow only certain users from certain nodes.

Nevertheless: maybe adding "-v" to the `ssh` command will output additional 
info, also the messages of `sshd` might be in some log file.

-- Reuti


> Feedback would be welcomed.
> 
> Best regards,
> 
> -- 
> Mun
> 
> 
> 
>> -Original Message-
>> Hi William, et al.,
>> 
>>> On Mon, May 11, 2020 at 09:30:14PM +, Mun Johl wrote:
>>>> Hi William, et al.,
>>>> [Mun] Thanks for the tip; I'm still trying to get back to where I can 
>>>> launch qsrh again.  Even after I put the requisite
>> /etc/pam.d/sshd
>>> line at the head of the file I'm still getting the "Your "qrsh" request 
>>> could not be scheduled, try again later." message for some
>> reason.
>>> But I will continue to debug that issue.
>>> 
>>> The pam_sge-qrsh-setup.so shouldn't have anything to do with this since
>>> the message occurs before any attempt to launch the job.  You could try
>>> running a qrsh -w p or and/or qrsh -w v to get a report on why the qrsh
>>> isn't being scheduled.  They aren't always easy to read and -w v doesn't
>>> reliably ignore exclusive vars in use but can nevertheless be helpful.
>> 
>> [Mun] With 'qrsh -w p' and 'qrsh -w v' I got the following output:
>> verification: found suitable queue(s)
>> 
>> I then replaced the -w option with -verbose which produced the following 
>> output:
>> 
>> waiting for interactive job to be scheduled ...timeout (54 s) expired while 
>> waiting on socket fd 4
>> Your "qrsh" request could not be scheduled, try again later.
>> 
>> I have no idea what is meant by "socket fd 4"; but that leads me to believe 
>> we have some sort of blocked port or something.
>> 
>> Are there any additional ports that need to be opened up in order to use 
>> 'qrsh & ssh -X' ?
>> 
>> One last noteworthy item that recently occurred to me is that when SGE was 
>> initially installed on our servers, we had a different
>> domain name.  Late last year we were acquired and our domain changed.  
>> However, our /etc/hosts still has the old domain simply
>> because SGE couldn't deal with the change in the domain--or rather, it was 
>> the easiest course of action for me to take and keep SGE
>> working.  I wonder if that is in some way interfering with 'qrsh & ssh -X'?
>> 
>> I am going to try and do some additional debug today and will report any 
>> progress.
>> 
>> Thank you and regards,
>> 
>> --
>> Mun


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] About cpu time.

2020-05-06 Thread Reuti

Hi,

It might be, that the application is ignoring the set OMP_NUM_THREADS (or 
assumes a max value if unset) and using all cores in a machine. How many  cores 
are installed?

-- Reuti


Am 07.05.2020 um 01:04 schrieb Jerome IBt:

> Dear all
> 
> I'm facing a strange problem with some parralel programs.
> Ive run a job ina queue with 24 hours limit time. The job
> 
> Qacct report this (4 cores):
> qnameall.q
> hostname compute-0-3.local
> groupestudiante
> ownerxairarg
> project  NONE
> department   defaultdepartment
> jobname  RespBact
> jobnumber1335842
> taskid   24
> account  sge
> priority 0
> qsub_timeSat May  2 20:29:57 2020
> start_time   Sat May  2 22:19:00 2020
> end_time Sun May  3 19:40:54 2020
> granted_pe   thread
> slots4
> failed   0
> exit_status  0
> ru_wallclock 76914s
> ru_utime 1128016.632s
> ru_stime 2191.568s
> ru_maxrss20.811MB
> ru_ixrss 0.000B
> ru_ismrss0.000B
> ru_idrss 0.000B
> ru_isrss 0.000B
> ru_minflt351497264
> ru_majflt0
> ru_nswap 0
> ru_inblock   71047120
> ru_oublock   1087912
> ru_msgsnd0
> ru_msgrcv0
> ru_nsignals  0
> ru_nvcsw 43908490
> ru_nivcsw80940156
> cpu  1130208.200s
> mem  13575.491TBs
> io   2.022GB
> iow  0.000s
> maxvmem  20.813GB
> arid undefined
> ar_sub_time  undefined
> category -pe thread 4
> 
> 
> Ths job was running 21h22 approximaly. The roblem is that qacct report a
> cpu time of  1130208 seconds, in place of 4*76914 = 307656 . That is, 3
> time more! As it was using 12 cores.
> 
> I remember someone speak about this problem in the list.
> 
> What's wrong with this accounting?
> 
> Regards.
> 
> 
> -- 
> -- Jérôme
> -Ah évidemment j'en suis pas encore aux toiles de maître, mais enfin
> c'est un début
> -Oh c'est un début qui promet. Mais tu vois si j'étais chez moi comme tu
> le disais si gentiment,bah j'mettrai ça ailleurs.
> -Qu'est-ce que je disais, y s'rait mieux près de la fenêtre. Tu le
> verrais où toi ?
> -À la cave.
>   (Michel Audiard)
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to export an X11 back to the client?

2020-05-02 Thread Reuti


> Am 02.05.2020 um 00:15 schrieb Mun Johl :
> 
> Hi Reuti,
> 
> Thank you for your reply.
> Please see my comments below.
> 
>> Hi,
>> 
>> Am 01.05.2020 um 20:44 schrieb Mun Johl:
>> 
>>> Hi,
>>> 
>>> I am using SGE on RHEL6.  I am trying to launch a qsub job (a TCL script) 
>>> via grid that will result in a GUI application being opened on
>> the caller's display (which is a VNC session).
>>> 
>>> What I'm seeing is that if I set DISPLAY to the actual VNC display (e.g. 
>>> host1:4) in the wrapper script that invokes qsub, the GUI
>> application complains that it cannot make a connection.  On a side note, I 
>> noticed that when I use ssh -X to login to one of our grid
>> servers, my DISPLAY is set to something like localhost:10 .  Now, if I use 
>> localhost:10 (for example) in my grid wrapper script, the GUI
>> application _will_ open on my VNC display.
>> 
>> Yes, here X11 forwarding is provided by SSH. The forwarding of X11 is not 
>> built into SGE.
>> 
>> 
>>> Of course, with multiple users and multiple grid servers, I have no idea 
>>> what a particular qsub command's DISPLAY should be set to.
>> I must be missing something because I'm sure others have already solved this 
>> issue.
>> 
>> Inside the wrapper it should always by something like localhost:10 with a 
>> varying number. This is set by the login via SSH. Hence I'm
>> not sure what you are looking for to be set.
> 
> [Mun]  Bottom line is I need for the GUI application to open in the user's 
> VNC session.  It seems that unless I set DISPLAY to the "appropriate" 
> localhost:# from within the wrapper script which makes the qsub call, that I 
> cannot accomplish that goal.  Therefore, I need some way of setting the 
> DISPLAY env var correctly.  Unless there is some other way for me to 
> accomplish my goal?

As said: the X11 forward and automatic setting of $DISPLAY is not part of SGE, 
but is accomplished by setting up SSH as communication method inside SGE. Once 
this is defined, your wrapper should work without any further adjustments.

-- Reuti


> 
> Regards,
> 
> -- 
> Mun
> 
> 
>> Maybe you want to define in SGE to always use SSH -X?
>> 
>> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> Please advise.
>>> 
>>> Thank you and regards,
>>> 
>>> --
>>> Mun
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users

--
Philipps-University of Marburg
AG Berger / AG Tonner / AG Frenking / FB Chemie
Reuti
Admin
Hans-Meerwein-Straße 4
35032 Marburg (35043 for the delivery of goods)
Germany
FON +49-6421-28-25549
FAX +49-6421-28-21826
eMail re...@staff.uni-marburg.de


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to export an X11 back to the client?

2020-05-01 Thread Reuti

Hi,

Am 01.05.2020 um 20:44 schrieb Mun Johl:

> Hi,
>  
> I am using SGE on RHEL6.  I am trying to launch a qsub job (a TCL script) via 
> grid that will result in a GUI application being opened on the caller’s 
> display (which is a VNC session).
>  
> What I’m seeing is that if I set DISPLAY to the actual VNC display (e.g. 
> host1:4) in the wrapper script that invokes qsub, the GUI application 
> complains that it cannot make a connection.  On a side note, I noticed that 
> when I use ssh -X to login to one of our grid servers, my DISPLAY is set to 
> something like localhost:10 .  Now, if I use localhost:10 (for example) in my 
> grid wrapper script, the GUI application _will_ open on my VNC display.

Yes, here X11 forwarding is provided by SSH. The forwarding of X11 is not built 
into SGE.


> Of course, with multiple users and multiple grid servers, I have no idea what 
> a particular qsub command’s DISPLAY should be set to.  I must be missing 
> something because I’m sure others have already solved this issue.

Inside the wrapper it should always by something like localhost:10 with a 
varying number. This is set by the login via SSH. Hence I'm not sure what you 
are looking for to be set.

Maybe you want to define in SGE to always use SSH -X?

https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html

-- Reuti


>  
> Please advise.
>  
> Thank you and regards,
>  
> --
> Mun
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job in error states

2020-03-07 Thread Reuti

Hi,

is it alwys failing on one and the same node? Or are several nodes affected? 
One guess could be that the file system is full.

-- Reuti


> Am 05.03.2020 um 18:46 schrieb Jerome :
> 
> Dear all
> 
> I'm facing a strange error in SGE. One job is declared as in error, as i
> show in the following:
> 
> 
> ==
> job_number: 1311910
> exec_file:  job_scripts/1311910
> submission_time:Thu Mar  5 08:06:16 2020
> owner:  X
> 
> ../..
> 
> error reason  1:  03/05/2020 11:11:56 [6021:55928]:
> execvlp(/opt/gridengine/default/spool/compute-0-0/job_scripts/1311910,
> "/opt/gridengine/default/spool/compute-0-0/job_scripts/1311910") failed:
> No such file or directory
> 
> 
> It's seems to be a problem during the copy of the script file on the
> node.. But, when i clear it, with qmod -cj, the job  come back in error
> state?
> 
> How could explain me what could explain this error?
> 
> Thanks!

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] slots equals cores

2020-01-31 Thread Reuti



> Am 31.01.2020 um 18:23 schrieb Jerome IBt :
> 
> Le 31/01/2020 à 10:19, Reuti a écrit :
>> Hi Jérôme,
>> 
>> Personally I would prefer to keep the output of `qquota` short and use it 
>> only for users's limits. I.e. defining the slot limit on an exechost basis 
>> instead. This can also be done in a loop containing a command line like:
>> 
>> $ qconf -mattr exechost complex_values slots=16 node29
>> 
>> My experience is, that sometime RQS are screwed up especially if used in 
>> combination with some load values (although $num_proc is of course fixed in 
>> your case).
>> 
>> -- Reuti
>> Dear Reuti,
> 
> If i understand correctly, you recomend me to disable the RQS for the
> case of core, and add a complex_value of slots for all of the computes
> nodes?

Exactly. Doing it on the command line within a loop is not so laborious and 
it's a fixed feature of a node which will never change during its lifetime.

-- Reuti


> Thank's
> 
> -- 
> -- Jérôme
> Quand un arbre tombe, on l'entend ; quand la forêt pousse, pas un bruit.
>   (Proverbe africain)


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] slots equals cores

2020-01-31 Thread Reuti

Hi Jérôme,

Personally I would prefer to keep the output of `qquota` short and use it only 
for users's limits. I.e. defining the slot limit on an exechost basis instead. 
This can also be done in a loop containing a command line like:

$ qconf -mattr exechost complex_values slots=16 node29

My experience is, that sometime RQS are screwed up especially if used in 
combination with some load values (although $num_proc is of course fixed in 
your case).

-- Reuti


> Am 31.01.2020 um 17:00 schrieb Jerome :
> 
> Dear all
> 
> I'm facing a new problem on my cluster with SGE. I don't show this
> before.. O maybe I never detect it.
> I have some nodes with 2 queue, one (named "all.q" ) to run jobs no more
> than 24h , and another queue (named "lenta.q" ) to run jobs than need
> more than 24 h.
> I determine qa resource quota as i read some time in this email list,
> defined as following:
> 
> {
>   name slots_equals_cores
>   description  Prevent core over-subscription across queues
>   enabled  TRUE
>   limithosts {*} to slots=$num_proc
> }
> 
> 
> For now, i have a node with 64 cores, 40 cores for the normal queue ,
> and 24 for the large queue.
> 
> 
> all.q@compute-2-0.localBP0/16/4015.93lx-amd64
> 
> lenta.q@compute-2-0.local  BP0/0/24 15.93lx-amd64
> 
> Some jobs with 2 cores don't enter in this node on the large time queue,
> althougth there is no problem with memory or core. The qstat indicate me
> this:
> 
> "compute-2-0/" in rule "slots_equals_cores/1"
>cannot run because it exceeds limit
> "compute-2-0/" in rule "slots_equals_cores/1"
>cannot run because it exceeds limit
> "compute-0-4/" in rule "slots_equals_cores/1"
>cannot run in PE "thread" because it only
> offers 0 slots
> 
> I really don't understand why the job is not running on tis nodes, at
> for my opinion it's free for this.
> 
> Somenoe can help me about this?
> 
> REgards.
> 
> -- 
> -- Jérôme
> Le baiser est la plus sûre façon de se taire en disant tout.
>   (Guy de Maupassant)
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-30 Thread Reuti

Hi,

I never used SGE OGS/GE 2011.11p1, and for other derivates it seems to work as 
intended. Is there any output in the messages file of the executing host where 
it mentions to try to kill the process due to an exhausted wallclock time?

-- Reuti


> Am 28.01.2020 um 03:50 schrieb Derrick Lin :
> 
> Hi Reuti
> 
> No, we haven't configured qlogin, rlogin specifically, so their settings are 
> all "builtin".
> 
> qlogin_command   builtin
> qlogin_daemonbuiltin
> rlogin_command   builtin
> rlogin_daemonbuiltin
> rsh_command  builtin
> rsh_daemon   builtin
> 
> Cheers,
> Derrick
> 
> On Fri, Jan 24, 2020 at 11:26 PM Reuti  wrote:
> Hi,
> 
> > Am 24.01.2020 um 04:26 schrieb Derrick Lin :
> > 
> > Hi guys,
> > 
> > We have set a h_rt limit to be 48 hours in the queue, it seems that this 
> > limit is applied on normal qsub job only. Now I am having few QRSh/QRLOGIN 
> > sessions live on the compute nodes for much longer than 48 hours.
> 
> Are your directing these commands to SSH?
> 
> -- Reuti
> 
> 
> > I am wondering if this is a known issue?
> > 
> > I am running open source version of SGE OGS/GE 2011.11p1
> > 
> > Cheers,
> > Derrick
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] QRSH/QRLOGIN ignores queue level h_rt limit

2020-01-24 Thread Reuti

Hi,

> Am 24.01.2020 um 04:26 schrieb Derrick Lin :
> 
> Hi guys,
> 
> We have set a h_rt limit to be 48 hours in the queue, it seems that this 
> limit is applied on normal qsub job only. Now I am having few QRSh/QRLOGIN 
> sessions live on the compute nodes for much longer than 48 hours.

Are your directing these commands to SSH?

-- Reuti


> I am wondering if this is a known issue?
> 
> I am running open source version of SGE OGS/GE 2011.11p1
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qsub -V doesn't set $PATH

2020-01-22 Thread Reuti




> Am 22.01.2020 um 16:55 schrieb Hay, William :
> 
> Signierter PGP-Teil
> On Tue, Jan 21, 2020 at 03:51:01PM +, Skylar Thompson wrote:
>> -V strips out PATH and LD_LIBRARY_PATH for security reasons, since prolog
> 
> I don't think this is the case.  I've just experimented with one of our 8.1.9 
> clusters and I can set arbitrary PATHs run qsub -V and have the value I set
> show up in the environment of the job.  More likely the job is being run with
> a shell that is configured as a login shell and the init scripts for the shell
> are stomping on the value of PATH.

Another option could be an "adjustment" of the PATH variable by a JSV.

-- Reuti


> 
>> and epilog scripts run with the submission environment but possibly in the
>> context of a different user (i.e. a user could point a root-running prolog
>> script at compromised binaries or C library).
> 
> This is something slightly different. The prolog and epilog used to run with 
> the exact same environment as the job.  This opened up an attack vector , 
> especially if the prolog or epilog were run as a privileged user rather than
> the job owner.  The environment in which the prolog and eiplog
> are run is now sanitised.
> 
> William
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] finding out what jobs are using which PE

2020-01-22 Thread Reuti




> Am 22.01.2020 um 15:14 schrieb WALLIS Michael :
> 
> 
> From: Reuti 
> 
>>> (for the record, if the number of used_slots is higher than the number
>>> of slots, no jobs using that PE will run. Don't know how that's even
>>> possible.)
> 
>> You mean the setting of "slots" in the definition of a particular PE?
> 
> Hi Reuti,
> 
> Yes:
> 
> $ qconf -sp sharedmem
> pe_namesharedmem
> slots  5920
> used_slots 129996
> [...]
> 
> This was after a qmaster restart.

Usually there is no need to restart the qmaster, as all changes are live in the 
running process.

Was the number lowered after all the jobs started already, which would imply to 
wait until it drained to the lower value?


> An identical PE has been put in place which works, but qaltering everything 
> to use the new PE is time consuming.

And the same PE is attached in the same queues to the same set of machines? 
Often the changes are limited to adding the PE to a single queue and/or combine 
stes of hosts to a hostgroup to have again a single assignment and not listing 
all machines in the queue's definition.

-- Reuti


> Cheers,
> Mike
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] finding out what jobs are using which PE

2020-01-22 Thread Reuti



> Am 22.01.2020 um 14:31 schrieb WALLIS Michael :
> 
> Hi folks,
>  
> I'm trying to work out why jobs on our UGE instance is reporting that all of 
> the slots in a PE, which is considerably higher than the number of slots 
> available, are being used. Is there a way of finding out which PE is being 
> used by jobs that isn't qstatting each job and grepping for the PE?

Hi,

There is the command:

$ qstat -r -s r | grep -i granted
   Granted PE:   smp 8
   Granted PE:   smp 8
   Granted PE:   smp 4
   Granted PE:   smp 8
   Granted PE:   smp 8
   Granted PE:   smp 8

With `qstat -j "*" | grep "parallel environment:"` there is the problem, that 
it can't be limited to running jobs only.


> (for the record, if the number of used_slots is higher than the number of 
> slots, no jobs using that PE will run. Don't know how that's even possible.)

You mean the setting of "slots" in the definition of a particular PE?

-- Reuti


>  Cheers,
>  
> Mike
>  
> -- 
> Mike Wallis x503305
> University of Edinburgh, Research Services,
> Argyle House, 3 Lady Lawson Street,
> Edinburgh, EH3 9DR
>  
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336. 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] CPU and Mem usage for interactive jobs

2019-12-09 Thread Reuti

There are patches around to attach the additional group id to the ssh daemon:

https://arc.liv.ac.uk/SGE/htmlman/htmlman8/pam_sge-qrsh-setup.html

rlogin is used for an interactive login by `qrsh`, rsh for `qrsh` with a 
command.

-- Reuti


> Am 09.12.2019 um 18:39 schrieb Korzennik, Sylvain 
> :
> 
> Hi Reuti,
> 
>  we are using the BUILTIN b/c we discovered that UGE's accounting is broken 
> otherwise when we switched (recently) from SGE to UGE.
> 
> % qconf -sconf
> ...
> qlogin_command   /cm/shared/apps/uge/var/cm/qlogin_wrapper
> qlogin_daemon/usr/sbin/sshd -i
> rlogin_command   /usr/bin/ssh -o LogLevel=ERROR
> rlogin_daemon/usr/sbin/sshd -i
> rsh_command  builtin
> rsh_daemon   builtin
> 
>   Cheers,
> Sylvain
> --
> 
> 
> On Mon, Dec 9, 2019 at 12:32 PM Reuti  wrote:
> Hi,
> 
> > Am 09.12.2019 um 18:17 schrieb Korzennik, Sylvain 
> > :
> > 
> > We're running UGE, but usually this list get me good answers:
> > qstat and qacct do not report CPU (and mem) usage for interactive jobs 
> > (qrsh) on our system. Is this a "feature" of GE, or do we need something 
> > different in the GE configuration to enable this?
> 
> Are you using the "builtin" method for the startup or SSH, i.e. for the 
> settings in rsh_daemon resp. rsh_client?
> 
> -- Reuti
> 
> 
> >   Cheers,
> > Sylvain
> > --
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] CPU and Mem usage for interactive jobs

2019-12-09 Thread Reuti

Hi,

> Am 09.12.2019 um 18:17 schrieb Korzennik, Sylvain 
> :
> 
> We're running UGE, but usually this list get me good answers:
> qstat and qacct do not report CPU (and mem) usage for interactive jobs (qrsh) 
> on our system. Is this a "feature" of GE, or do we need something different 
> in the GE configuration to enable this?

Are you using the "builtin" method for the startup or SSH, i.e. for the 
settings in rsh_daemon resp. rsh_client?

-- Reuti


>   Cheers,
> Sylvain
> --
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qsh not working

2019-11-19 Thread Reuti

Hi,

Am 19.11.2019 um 23:41 schrieb Korzennik, Sylvain:

> While qrsh and qlogin works fine, qsh fails on
> % qsh

`qsh` uses the old-style invocation only. I.e. it will make a direct connection 
to the submission host, where access has to be granted by `xhost +` beforehand 
(and open ports 6000 upwards on the client, it will increase for each new 
window). Instead of the "localhost:15.0"  there should be the machine's name 
noted. Essentially this is judged as being unsafe nowadays. The various 
settings in "sge_conf" are not used.

The built-in version for the communication of the daemons/clients for `qrsh` … 
in SGE has no support for X11 forwarding. Hence the approach by `qrsh xterm` 
should give a suitable result when set up to use "/usr/bin/ssh -X -Y".

-- Reuti


> Your job 2657108 ("INTERACTIVE") has been submitted
> waiting for interactive job to be scheduled ...
> Could not start interactive job (could be some network/firewall related 
> problem)
> 
> from the same prompt/login node, I can
> % ssh -X compute-64-16 /usr/bin/xterm
> or
> % ssh -Y compute-64-16 /usr/bin/xterm
> but not
> % ssh compute-64-16 /usr/bin/xterm
> 
> I do not want to run these 'out-of-band'. I use ssh -Y in the conf (qconf 
> -sconf), yet it fails and I can trace this to:
> 
> 11/19/2019 17:18:19.296800 [10541:143099]: closing all filedescriptors from 
> fd 0 to fdmax=1024
> 11/19/2019 17:18:19.296828 [10541:143099]: further messages are in "error" 
> and "trace"
> 11/19/2019 17:18:19.299121 [10541:143099]: now running with uid=10541, 
> euid=10541
> 11/19/2019 17:18:19.299172 [10541:143099]: execvp(/usr/bin/xterm, 
> "/usr/bin/xterm" "-display" "localhost:15.0" "-n" "SGE Interactive
>  Job 2657108 on compute-64-16.cm.cluster in Queue qrsh.iq" "-e" "/bin/csh")
> 11/19/2019 17:18:19.303787 [446:143093]: wait3 returned 143099 (status: 256; 
> WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 1, WTERMSIG: 0)
> 11/19/2019 17:18:19.303843 [446:143093]: job exited with exit status 1
> 11/19/2019 17:18:19.303872 [446:143093]: reaped "job" with pid 143099
> 11/19/2019 17:18:19.303893 [446:143093]: job exited not due to signal
> 11/19/2019 17:18:19.303914 [446:143093]: job exited with status 1
> 
> What magic is needed for the GE to start xterm right? Is this some xauth 
> problem?
> 
>   Thanks,
> Sylvain
> --
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-28 Thread Reuti



Am 28.10.2019 um 22:18 schrieb Mun Johl:

> Hi all,
>  
> I do have a follow-up question: When I am specifying hostnames for the 
> execution hosts, admin hosts, etc.; do I need to use the FQDN?  Or can I 
> simply use the hostname in order for grid to operate correctly?  That is, do 
> I have to usehostname.domain.com (as I am currently doing).  Or is it 
> sufficient to simply use “hostname”?

It's sufficient to use hostnames. The queue names then get shorter in `qstat` 
too:

queue
-
common@node25
common@node29
ramdisk@node19   
common@node27
common@node23
common@node23
common@node28

-- Reuti


>  
> Regards,
>  
> --
> Mun
>  
>  
> From: Mun Johl  
> Sent: Friday, October 25, 2019 5:42 PM
> To: dpo...@gmail.com
> Cc: Skylar Thompson ; users@gridengine.org
> Subject: RE: [gridengine users] What is the easiest/best way to update our 
> servers' domain name?
>  
> Hi Daniel,
>  
> Thank you for your reply.
>  
> From: Daniel Povey 
> 
> You may have to write a script to do that, but it could be something like
>  
> for exechost in $(qconf -sel); do
>qconf -se $exechost  | sed s/old_domain_name/new_domain_name/ > tmp
>qconf -de $exechost
>qconf -Ae tmp
> done
>  
> but you might need to tweak that to get it to work, e.g. get rid of 
> load_values from the tmp file.
>  
> [Mun] Understood.  Since we have a fairly small set of servers currently, I 
> may just update them by hand via “qconf -me ”; and then address the 
> queues via “qconf -mq ”.  Oh, and I just noticed I can modify 
> hostgroups via “qconf -mhgrp @name”.
>  
> After that I can re-start the daemons and I “should” be good to go, right?
>  
> Thanks again Daniel.
>  
> Best regards,
>  
> --
> Mun
>  
>  
> On Fri, Oct 25, 2019 at 5:24 PM Mun Johl  wrote:
> Hi Daniel and Skylar,
> 
> Thank you for your replies.
> 
> > -Original Message-
> > I think it might depend on the setting of ignore_fqdn in the bootstrap file
> > (can't remember if this just tunes load reporting or also things like which
> > qmaster the execd's talk to). I wouldn't count on it working, though, and
> > agree with Daniel that you probably want to plan on an outage.
> 
> [Mun] An outage is acceptable; but I'm not sure what is the best/easiest 
> approach to take in order to change the domain names within SGE for all of 
> the servers as well as update the hostgroups and queues.  I mean, I know I 
> can delete the hosts and add them back in; and the same for the queue 
> specifications, etc.  However, I'm not sure if that is an adequate solution 
> or one that will cause problems for me.  I'm also not sure if that is the 
> best approach to take for this task.
> 
> Thanks,
> 
> -- 
> Mun
> 
> 
> > 
> > On Fri, Oct 25, 2019 at 04:12:11PM -0700, Daniel Povey wrote:
> > > IIRC, GridEngine is very picky about machines having a consistent
> > > hostname, e.g. that what hostname they think they have matches with
> > > how they were addressed.  I think this is because of SunRPC.  I think
> > > it may be hard to do what you want without an interruption  of some kind.
> > But I may be wrong.
> > >
> > > On Fri, Oct 25, 2019 at 3:37 PM Mun Johl  wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I need to update the domain names of our SGE servers.  What is the
> > > > easiest way to do that?  Can I simply update the domain name somehow
> > > > and have that propagate to hostgroupgs, queue specifications, etc.?
> > > >
> > > >
> > > >
> > > > Or do I have to delete the current hosts and add the new ones?
> > > > Which I think also implies setting up the hostgroups and queues
> > > > again as well for our implementation.
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Mun
> > > > ___
> > > > users mailing list
> > > > users@gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > > >
> > 
> > > ___
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> > 
> > 
> > --
> > -- Skylar Thompson (skyl...@u.washington.edu)
> > -- Genome Sciences Department, System Administrator
> > -- Foege Building S046, (206)-685-7354
> > -- University of Washington School of Medicine
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Reuti

Hi,

Am 26.10.2019 um 00:37 schrieb Mun Johl:

> I need to update the domain names of our SGE servers.  What is the easiest 
> way to do that?  Can I simply update the domain name somehow and have that 
> propagate to hostgroupgs, queue specifications, etc.?
>  
> Or do I have to delete the current hosts and add the new ones?  Which I think 
> also implies setting up the hostgroups and queues again as well for our 
> implementation.
>  

Are all machines on a single network and use the FQDN? And/or have the qmaster 
machines two network interfaces and only the external name changes, while the 
internal ones stay the same?

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs stuck in transitioning state

2019-09-27 Thread Reuti

Hi,

Am 27.09.2019 um 22:21 schrieb berg...@merctech.com:

> We're having a problem with submit scripts not being transferred to exec
> nodes and jobs being stuck in the [t]ransitioning state.

Did this issue to start out of the blue?


> The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.

But these are separate clusters, or you using both versions in one and the same 
cluster or just tried both on one cluster?


> We are using classic spooling. On the compute nodes, the spool directory
>   /var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/
> exists, is owned by user 'sge' (running the execd), is writeable, and
> has space.

Is the execd running as sge or initially as root? It must be run at root to be 
able to switch to any user but switches to the admin user:

$ ps -e f -o user,ruser,group,rgroup,command
…
sgeadmin root gridware root /usr/sge/bin/lx24-em64t/sge_execd
root root root root  \_ /bin/sh /usr/sge/cluster/tmpspace.sh
sgeadmin root gridware root  \_ sge_shepherd-311391 -bg



> There is successful communication between the qmaster and execd hosts:
>   
>   qping works in both directions
> 
>   jobs submitted as binaries (-b y) run correctly
> 
>   directives from the master to the execd (for example, to delete jobs) 
> work
> 
> If I read the qmaster debug logs correctly, it looks like the qmaster isn't 
> able to send the submit script to the compute node:
> 
> 1 worker001 debiting 8589934592.00 of h_vmem on host 
> 2115fmn001.foobar.local for 1 slots
> 2 worker001 debiting 40.00 of tmpfree on host 
> 2115fmn001.foobar.local for 1 slots
> 3 worker001 debiting 1.00 of jobs on queue all.q for 1 slots
> 4 worker001 debiting 1.00 of slots on queue all.q for 1 slots
> 5 worker001 user doesn't match
> 6 worker001 user doesn't match
> 7 worker001 queue doesn't match
> 8 worker001 queue doesn't match
> 9 worker001 user doesn't match
>10 worker001 user doesn't match
>11 worker001 spooling job 9899430.1 
>12 worker001 Making dir "jobs/00/0989/9430/1-4096/1"
>13 worker001 retval = 0
>14 worker001 spooling job 9899430.1 
>15 worker001 Making dir "jobs/00/0989/9430"
>16 worker001 retval = 0
>17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds
>18 worker001 successfully handed off job "9899430" to queue 
> "all.q@2115fmn001.foobar.local"
>19 worker001 NO TICKET DELIVERY
> 
> 
> We don't see corresponding log messages on the client.
> 
> 
> What mechanism is used by SGE to transfer submit scripts (something
> specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?

It uses its own protocol. No SSH inside the cluster is necessary.


> What are the system-level requirements for succesfully sending the
> submit scripts (for example: same UID for sge across the cluster, same
> UID<->username for the user submitting the job across the cluster, etc)?

Yes.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] preventing certain jobs from being suspended (subordinated)

2019-09-05 Thread Reuti



> Am 05.09.2019 um 13:57 schrieb Tina Friedrich :
> 
> We had this problem lots, and I can't quite remember how I solved it - I 
> think it might've been either a JSV or a qsub wrapper that shoves all 
> GPU jobs into the superordinate queue.
> 
> Now that I'm thinking about this again - does the subordinate queue 
> setting accept 'queueu@@hostgroup' syntax like everything else? Don't 
> remember if I ever tried that.

Yes, one can limit it to be available on certain machines only:

subordinate_list  NONE,[@intel2667v4=short]

-- Reuti


> Tina
> 
> On 04/09/2019 21:52, Reuti wrote:
>> 
>> Am 04.09.2019 um 21:58 schrieb berg...@merctech.com:
>> 
>>> Our SoGE (8.1.6) configuration has essentially two queues: one for "all"
>>> jobs and one for "short jobs". The all.q is subordinate to the short.q,
>>> and short jobs can suspend a job in the general queue. At the moment, the
>>> all.q has nodes with & without GPU resources (not ideal, not permanent,
>>> probably to be replaced in the future with multiple queues, but it's
>>> what we have now).
>>> 
>>> Our GPU jobs do not stop or free resources when suspended (OK, the CPU
>>> portion may respond correctly to SIGSTOP, but the GPU portion keeps
>>> running).
>>> 
>>> Is there any way, with our current number of queues, to exempt jobs
>>> using a GPU resource complex (-l gpu) from being suspended by short jobs?
>> 
>> Not that I'm aware of. Almost 10 years ago I had a similar idea:
>> 
>> https://arc.liv.ac.uk/trac/SGE/ticket/735
>> 
>> -- Reuti
>> 
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] preventing certain jobs from being suspended (subordinated)

2019-09-04 Thread Reuti



Am 04.09.2019 um 21:58 schrieb berg...@merctech.com:

> Our SoGE (8.1.6) configuration has essentially two queues: one for "all"
> jobs and one for "short jobs". The all.q is subordinate to the short.q,
> and short jobs can suspend a job in the general queue. At the moment, the
> all.q has nodes with & without GPU resources (not ideal, not permanent,
> probably to be replaced in the future with multiple queues, but it's
> what we have now).
> 
> Our GPU jobs do not stop or free resources when suspended (OK, the CPU
> portion may respond correctly to SIGSTOP, but the GPU portion keeps
> running).
> 
> Is there any way, with our current number of queues, to exempt jobs
> using a GPU resource complex (-l gpu) from being suspended by short jobs?

Not that I'm aware of. Almost 10 years ago I had a similar idea:

https://arc.liv.ac.uk/trac/SGE/ticket/735

-- Reuti

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] limit CPU/slot resource to the number of reserved slots

2019-08-27 Thread Reuti

[sorry for the (possibly) double post, they changed the mail server at the 
university and now my mail application got confused and uses often a wrong smtp 
server – not the one it claims to use]

Hi,

> Am 26.08.2019 um 14:15 schrieb Dietmar Rieder :
> 
> Hi,
> 
> may be this is a stupid question, but I'd like to limit the used/usable
> number of cores to the number of slots that were reserved for a job.
> 
> We often see that people reserve 1 slot, e.g. "qsub -pe smp 1 [...]"
> but their program is then running in parallel on multiple cores. How can
> this be prevented? Is it possible that with reserving only one slot a
> process can not utilize more than this?

Can't you just kill their jobs? They will learn to comply with the site's 
policy.

This can of course also happen by accident as applications like Gaussian or 
ORCA have the number of to be used cores inside the input file (although 
Gaussian has nowadays a command line option for it too).

We use a job generator which will also copy at runtime an "adjusted" input file 
for the job to $TMPDIR and uses this one. Hence whatever the users put inside 
the input file: the number of slots will be corrected in the copy of the input 
file to the number of granted slots. Let me know if you would like to get them.

-- Reuti


> I was told the this should be possible in slurm (which we don't have,
> and to which we don't want to switch to currently).
> 
> Thanks
> Dietmar
> 
> -- 
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Institute of Bioinformatics
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Sorting qhost and choosing qstat columns

2019-08-01 Thread Reuti

Hi,

> Am 01.08.2019 um 15:58 schrieb David Trimboli :
> 
> When I run qhost, the output is sorted alphabetically — which means 
> "cluster10" appears before "cluster2," and so on.
> 
> Before I go writing bash functions to manually sort this, which might lead to 
> output side-effects, is there any way to change the sort to a natural number 
> sort, so that "cluster2" would appear before "cluster10," etc.?
> When I run qstat, the normal wraps to a second line in my terminal set to 120 
> columns. I could fix that by eliminating the "jclass" column, which doesn't 
> contain any information, but I can only find ways to add columns, not take 
> them away. Is there a way to make this column go away?

Besides `cut -b`: what type of output are you looking for? There is a `qstatus` 
AWK script in SoGE which displays the relevant columns including a longer job 
name.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatically creating home directories before job execution

2019-07-30 Thread Reuti

Hi Ilya, 

Am 31.07.2019 um 00:55 schrieb Ilya M:

> Hi Reuti,
>  
> So /home is not mounted via NFS as it's usually done?
> Correct. 
>  
> 
> How exactly is your setup? I mean: you want to create some kind of pseudo 
> home directory on the nodes (hence "-b y" can't be used with user binaries) 
> and the staged job script (by SGE) will then execute the job and/or copy some 
> files thereto too? Afterwards this directory would have to be removed too I 
> guess (and the results copied back beforehand).
> 
> I want to create a normal home directory for the user if this directory does 
> not exist yet. /sbin/mkhomedir_helper can do exactly that (as it does if I 
> run prolog manually). I do not need to remove home directory after job 
> completes: next time the same user's job lands on the same host, there will 
> be no need to create the directory again.

I would fear that remains of old jobs will accumulate there and fill the disk 
over time.

This setup also has no means for the users to access their "home" on the nodes 
from the head node of the cluster?


> Is there any configurable process that SGE will start before the job and that 
> will not try to cd to user's home directory before start? I had hopes for 
> prolog (global), but apparently is it not a viable candidate.

What you can try: switch to a known directory with `qsub -wd /tmp …` or a 
similar path which exists where SGE should `chdir` to, and switch then in the 
job script to the user's home directory after it was created in the prolog (the 
prolog runs a child process, so you can't `chdir` therein for the real job – 
but in combination with a starter_method it might work to automate this). The 
-wd could be added in a JSV.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatically creating home directories before job execution

2019-07-30 Thread Reuti

Hi,

Am 30.07.2019 um 23:12 schrieb Ilya M:

> Hello,
> 
> I am setting up a new SGE cluster (with old SGE) with local users' home 
> directories on the nodes. Those directories might not exists at the time of 
> job execution, so I need to have SGE create them before jobs are executed.

So /home is not mounted via NFS as it's usually done?


> I added the following lines to my prolog script:
> 
> HOMEDIR=$(eval echo ~${USER});
> if [[ ! -d ${HOMEDIR} ]]; then
>   sudo /sbin/mkhomedir_helper ${USER}
> fi
> 
> I tried to run prolog both as a 'sgegrid' user that has sudo privileges to 
> execute the command, and as root. However, neither of the attempts worked: 
> the job still failed: 
> 
> 07/30/2019 20:36:28 [973525570:34481]: error: can't chdir to /home/ilya: No 
> such file or directory

Yes, as it first wants to change to the user's home before the prolog is 
started.


> I also tried to run this as a global prolog and as a queue-level prolog to no 
> avail.
> 
> Running the prolog script manually on the node creates home directory without 
> a problem, so the syntax and logic seem to be correct.
> 
> Furthermore, I have the following set at the top on prolog script to allow 
> some logging:
> 
> set -x
> 
> LOG=/tmp/prolog_${1}.log
> exec 6>&1
> exec > $LOG 2>$LOG
> 
> However, the log file is not getting created, which makes me think that the 
> failure happens before prolog starts to execute.

Yep.


> Shell settings for the queue are as follows:
> shell /bin/sh
> shell_start_mode  posix_compliant
> 
> starter_method is not set.
> 
> Would appreciate any suggestions for making this work or at least getting 
> meaningful debug output.

How exactly is your setup? I mean: you want to create some kind of pseudo home 
directory on the nodes (hence "-b y" can't be used with user binaries) and the 
staged job script (by SGE) will then execute the job and/or copy some files 
thereto too? Afterwards this directory would have to be removed too I guess 
(and the results copied back beforehand).

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Adding a requirement to limit to certain execd hosts

2019-07-26 Thread Reuti



Am 26.07.2019 um 07:10 schrieb Simon Matthews:

> I want to add a requirement for a specific OS version for some jobs.
> 
> I already use "-l arch=lx-amd64" to some jobs, but I would like to
> force jobs to a specific, defined OS, such as CentOS 6 or CentOS 7.
> 
> I would like to do this without adding another queue, but instead use
> a complex and request it.
> 
> I followed the instructions to add the complex to the global config,
> but I can't see how to add the complex to specific execd hosts. Can
> anyone suggest how to do this?

You can add them interactively on an exechost level (not the global config) 
with:

qconf -me nodeXY

and edit complex_values there or on the command line (e.g. to use a loop):

qconf -mattr exechost complex_values distribution=centos6 nodeXY

===

Alternatively one could also use a hostgroup and attach the complex to all 
queues:

qconf -mq all.q

with the entry:

complex_values 
NONE,[@centos6nodes=distribution=centos6],[@centos7nodes=distribution=centos7]

or also here on the command line:

qconf -mattr queue complex_values distribution=centos6 nodeXY

===

As it's a feature of the nodes, I would attach it the exechosts although 
attaching them to a queue might be shorter and a central location for all 
definitions.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Having issues getting sun grid engine running on new frontend

2019-07-25 Thread Reuti

Hi,

Am 25.07.2019 um 16:44 schrieb Pat Haley:

> 
> Hi All,
> 
> We have been trying to install Rocks 7 on a new frontend machine, using a 
> restore roll from our old front-end (running Rocks 6.2) to bring over our 
> users, groups and various customizations (more details are available in 
> https://marc.info/?l=npaci-rocks-discussion&m=154514980222760&w=2 ).  Our 
> latest issue is that the Sun Grid Engine service does not start.
> 
>  systemctl status -l sgemaster.mseas
> ● sgemaster.mseas.service - LSB: start Grid Engine qmaster, shadowd
>Loaded: loaded (/etc/rc.d/init.d/sgemaster.mseas; bad; vendor preset: 
> disabled)
>Active: failed (Result: exit-code) since Fri 2019-07-19 12:26:46 EDT; 
> 32min ago
>  Docs: man:systemd-sysv-generator(8)
>   Process: 355124 ExecStart=/etc/rc.d/init.d/sgemaster.mseas start 
> (code=exited, status=1/FAILURE)
> 
> Jul 19 12:25:44 mseas.mit.edu systemd[1]: Starting LSB: start Grid Engine 
> qmaster, shadowd...
> Jul 19 12:25:45 mseas.mit.edu sgemaster.mseas[355124]: Starting Grid Engine 
> qmaster
> Jul 19 12:26:46 mseas.mit.edu sgemaster.mseas[355124]: sge_qmaster start 
> problem
> Jul 19 12:26:46 mseas.mit.edu sgemaster.mseas[355124]: sge_qmaster didn't 
> start!
> Jul 19 12:26:46 mseas.mit.edu systemd[1]: sgemaster.mseas.service: control 
> process exited, code=exited status=1
> Jul 19 12:26:46 mseas.mit.edu systemd[1]: Failed to start LSB: start Grid 
> Engine qmaster, shadowd.
> Jul 19 12:26:46 mseas.mit.edu systemd[1]: Unit sgemaster.mseas.service 
> entered failed state.
> Jul 19 12:26:46 mseas.mit.edu systemd[1]: sgemaster.mseas.service failed.
> 
> 
> in poking around, we see 2 entries for sge in /etc/passwd on the new system
> 
> grep -in sge /etc/passwd
> 44:sge:x:990:985:GridEngine  System account:/opt/gridengine:/bin/true
> 64:sge:x:400:400:GridEngine:/opt/gridengine:/bin/true

It's definitely wrong two have two entries for one and the same account. First 
remove the first one which also points to an unknown group. Do you have a group 
with ID 985?

Then: are the files in /opt/gridengine owned by this (leftover) user?

But some files inside need a root-squash:

$ find . -perm /u+s
./utilbin/lx24-amd64/testsuidroot
./utilbin/lx24-amd64/rlogin
./utilbin/lx24-amd64/rsh
./utilbin/lx24-amd64/authuser
./bin/lx24-amd64/sgepasswd

There is the script /opt/sge/util/setfileperm.sh to correct this.


> and only one on the old system
> 
> grep -in sge /etc/passwd
> 37:sge:x:400:400:GridEngine:/opt/gridengine:/bin/true
> 
> looking at /etc/group both systems only show the old group id
> 
> grep -in sge /etc/group
> 49:sge:x:400:
> 
> looking at the qmaster logs in 
> /opt/gridengine/default/spool/qmaster/messages 
>  
> we’ve found the following message:
> error opening file "/opt/gridengine/default/spool/qmaster/./sharetree" for 
> reading: No such file or directory

Did you transfer the old configuration or does this pop up in a fresh installed 
system?

Unfortunately the procedure might be changed by the ROCKS distribution compared 
to the original sources.

-- Reuti


> However, we do not see that file on the old frontend either. 
> 
> Can anyone suggest what we can do to either correct or debug this issue?
> 
> Pat
> 
> -- 
> 
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Pat Haley  Email:  
> pha...@mit.edu
> 
> Center for Ocean Engineering   Phone:  (617) 253-6824
> Dept. of Mechanical EngineeringFax:(617) 253-8125
> MIT, Room 5-213
> http://web.mit.edu/phaley/www/
> 
> 77 Massachusetts Avenue
> Cambridge, MA  02139-4301
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-16 Thread Reuti


> Am 16.07.2019 um 02:33 schrieb Derrick Lin :
> 
> Thanks guys,
> 
> >> Correct. The limits in place when sgeexecd is started are used (i.e. the 
> >> one of the root user).
> I tried to simply restart the sgeexecd but it does not change anything.
> 
> In my /etc/security/limits.conf I have:
> * soft nofile 18000
> * hard nofile 2
> 
> That should apply to every account? the SGE daemons are run under user "sge".

The appear to run under sge, but it runs under to root account (and should be 
started by root):

$ ps -e f -o user,ruser,command
…
sgeadmin root /usr/sge/bin/lx24-em64t/sge_qmaster


> >> Several ulimits can be set in the queue configuration, and can so 
> >> different for each queue or exechost.
> 
> We don't have any ulimits setting inside queue or other SGE parts, 
> limits.conf is the only place of the config. 
> 
> It is so weird that most of the Compute Nodes pick up the settings correctly, 
> only a few fail to pick up.

Do you log in in by SSH to the node? Then you have to restart the SSH daemon 
too, as the login process inherits the values the SSH daemon got.

The changes of the "nofile" setting should be visible in the shell when you log 
in too.

-- Reuti


> Currently, my only workaround is to rebuild the Compute Node (reinstall OS 
> etc) so that it corrects this issue.
> 
> >> Can you check the limits that are set in the sge_execd and sge_shepherd
> processes (/proc//limits)?
> 
> I tried to look it up, but I could not find the  directory which is 
> corresponding to the sgeexecd.
> 
> Cheers,
> Derrick 
> 
> 
> On Thu, Jul 4, 2019 at 12:09 AM Skylar Thompson  wrote:
> Can you check the limits that are set in the sge_execd and sge_shepherd
> processes (/proc//limits)? It's possible that the user who ran the
> execd init script had limits applied, which would carry over to the execd
> process.
> 
> On Wed, Jul 03, 2019 at 12:36:00PM +1000, Derrick Lin wrote:
> > Hi guys,
> > 
> > We have custom settings for user open files in /etc/security/limits.conf in
> > all Compute Node. When checking if the configuration is effective with
> > "ulimit -a" by SSH to each node, it reflects the correct settings.
> > 
> > but when ran the same command through SGE (both qsub and qrsh), we found
> > that some Compute Nodes do not reflects the correct settings but the rest
> > are fine.
> > 
> > I am wondering if this is SGE related? And idea is welcomed.
> > 
> > Cheers,
> > Derrick
> 
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 
> -- 
> -- Skylar Thompson (skyl...@u.washington.edu)
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-02 Thread Reuti

Hi,

> Am 03.07.2019 um 04:39 schrieb Daniel Povey :
> 
> Could it relate to when the daemons were started on those nodes?  I'm not 
> sure exactly at what point those limits are applied, and how they are 
> inherited by child processes.

Correct. The limits in place when sgeexecd is started are used (i.e. the one of 
the root user).


>  If you changed those files recently it might not have taken effect.
> 
> On Tue, Jul 2, 2019 at 10:36 PM Derrick Lin  wrote:
> Hi guys,
> 
> We have custom settings for user open files in /etc/security/limits.conf in 
> all Compute Node. When checking if the configuration is effective with 
> "ulimit -a" by SSH to each node, it reflects the correct settings.
> 
> but when ran the same command through SGE (both qsub and qrsh), we found that 
> some Compute Nodes do not reflects the correct settings but the rest are fine.

Several ulimits can be set in the queue configuration, and can so different for 
each queue or exechost.

-- Reuti


> I am wondering if this is SGE related? And idea is welcomed.
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] h_vmem / m_mem_free

2019-06-27 Thread Reuti



> Am 27.06.2019 um 13:46 schrieb Dan Whitehouse :
> 
> First off, I am running UGE as opposed to SGE.
> We've got a couple of systems, one running 8.5.4 and the other 8.6.5.
> Users request memory resources in their job scripts by passing:
> 
> "-l h_vmem=1G" (for example).
> 
> We make use of a JSV and when this is set, what actually gets passed to 
> the scheduler is:
> 
> "h_vmem=1G,m_mem_free=1G"
> 
> We set "h_vmem_limit=true" in cgroups_params so it is enforced by cgroups.
> 
> The thing that I am not entirely sure about is what we are actually 
> limiting here!
> 
> If I write a program to malloc memory in a loop, then cgroups kills it 
> when it has allocated over 400G of ram (on a machine with about 24G).
> 
> Looking at the output of qacct, it has used ~1G to do so. So my 
> assumption here is that cgroups is killing on memory used as opposed to 
> virtual memory allocated. Which of the two settings (h_vmem / 
> m_mem_free) is responsible for this, and what is the other one for?
> 
> I'm sure this isn't the first time this has been asked, and for that I 
> apologise but I can't seem to find a clear explanation of this.

We don't use cgroups for now. But the allocation of memory is often delayed 
until you are really access the allocated space (even without cgroups in 
place). You could fill the allocated area with data, and test what happens then.

h_vmem will be enforced by the kernel, even without cgroups. In case you make 
h_vmem consumable and attach a sensible value to each exechost, SGE will also 
keep track of this and disallow further submissions.

With h_vmem used at jobsubmission, either the kernel or SGE will then notice 
that your job and the sum of all of its processes passed this limit and kill 
the job. SGE can do this, by using the additional group ID which is attached to 
all processes by a particular job, while the kernel might only watch a certain 
process.

Using now an assigned cgroup, even the kernel can keep track of the overall 
memory consumption of sevreal processes in this assigned cgroup.

-- Reuti


> Thanks!
> 
> -- 
> Dan Whitehouse
> Research Systems Administrator, IT Services
> Queen Mary University of London
> Mile End
> E1 4NS
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] parallel vs single job

2019-06-24 Thread Reuti

Hi,

Am 24.06.2019 um 12:59 schrieb Semi :
> 
> Hi 
> 
> I have a question, one of our users A. submit parallel jobs with 144 slots 
> wit high priority
> Other user B. submit a lot of single jobs with low priority.
> 
> How can I define that job A will be executed before B?
> In this case there is no 144 free slots for now.
> 
> I know that single job with low priority always will be executed before 
> parallel job,
> that will wait till 144 free slots.

Did you set up some slot reservation? I.e. submitting the large parallel jobs 
with `qsub -R y …` and enabling the reservation in SGE’s scheduler 
configuration:

$ qconf -ssconf
…
max_reservation   20
default_duration  8760:00:00

Usually this should avoid job starvation, and reserve the slots for the 
parallel job. If a proper amount of wallclock time (h_rt) is specified for the 
small serial jobs, backfilling might apply (instead of slots being idle) and 
they may run although slots are reserved – but they are supposed to end before 
any of the already running jobs will end.

Note the "default_duration" was set to a real value. The default "infinity" 
would lead to the effect, that infinity is always judged of being smaller than 
infinity and (endless) backfilling will occur all of the time. Even already 
running jobs must get this values applied (h_rt), as they would otherwise keep 
the queue open for the small jobs. Or just wait until all running jobs without 
this constraint were drained.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qmon finished jobs

2019-06-18 Thread Reuti

Hi,

> Am 18.06.2019 um 15:47 schrieb David Trimboli :
> 
> I know qmon is deprecated and all that, but I was just wondering if someone 
> could tell me where to customize the number of finished jobs visible in Job 
> Control | Finished Jobs.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

Inside `qmon` it's "Cluster Configuration" the entry "global" => "Modify" and 
then in "General Settings" in the top right location "Finished Jobs".

But it won't retrieve jobs which were already drained from the listing. Also 
after a restart of the `qmaster`, the list will initially be empty.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti

AFAICS the sent kill by SGE happens after a task returned already with an 
error. SGE would in this case use the kill signal to be sure to kill all child 
processes. Hence the question would  be: what was the initial command in the 
job script, and what output/error did it generate?

-- Reuti

> Am 14.05.2019 um 11:36 schrieb hiller :
> 
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - 
> killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 
> assumedly after job because: job 635659.1 died through signal KILL (9)
> 
> qacct -j 635659 says:
> failed   100 : assumedly after job
> exit_status  137  (Killed)
> 
> 
> The was no kill triggered by the user. Also there are no other limitations, 
> neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt  INFINITY
> h_rt  INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmemINFINITY
> h_vmemINFINITY
> 
> Years ago there were some threads about the same issue, but i did not find a 
> solution.
> 
> Does somebody have a hint what i can do or check/debug?
> 
> With kind regards and many thanks for any help, ulrich
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] I need a decoder ring for the qacct output

2019-04-25 Thread Reuti



> Am 25.04.2019 um 17:41 schrieb Mun Johl :
> 
> Hi Skyler, Reuti,
> 
> Thank you for your reply.
> Please see my comments below.
> 
> On Thu, Apr 25, 2019 at 08:03 AM PDT, Reuti wrote:
>> Hi,
>> 
>>> Am 25.04.2019 um 16:53 schrieb Mun Johl :
>>> 
>>> Hi,
>>> 
>>> I'm using 'qacct -P' in the hope of tracking metrics on a per project
>>> basis.  I am getting data out of qacct, however I don't fully comprehend
>>> what the data is trying to tell me.
>>> 
>>> I've searched the man pages and web for definitions of the output of
>>> qacct, but I have not been able to find a complete reference (just bits
>>> and pieces here and there).
>>> 
>>> Can anyone point me to a complete reference so that I can better
>>> understand the output of qacct?
>> 
>> There is a man page about it:
>> 
>> man accounting
> 
> Well, I _did_ look at that prior to posting but I guess I just didn't
> see the keywords I was looking for.  So maybe I'll just ask the specific
> questions regarding my confusion.
> 
> WALLCLOCK is pretty well defined by ru_wallclock.  So that's basically
> the total wall clock time the job was on the execution host.
> 
> UTIME is user time used.
> STIME is system time used.
> 
> Should (UTIME + STIME) >= WALLCLOCK?  It isn't in my case and is mainly
> why I am confused.  Or perhaps process wait time is not included?

You mean in case of a parallel application? You set "accounting_summary" to 
"true" and get only a single record back?

This depends how the used CPU time is acquired by the OS (and whether all 
created processes are taken into account, even if they jump out of the process 
tree [like with `setsid`]). More reliable is the CPU time collected by SGE by 
the additional group ID.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] I need a decoder ring for the qacct output

2019-04-25 Thread Reuti

Hi,

> Am 25.04.2019 um 16:53 schrieb Mun Johl :
> 
> Hi,
> 
> I'm using 'qacct -P' in the hope of tracking metrics on a per project
> basis.  I am getting data out of qacct, however I don't fully comprehend
> what the data is trying to tell me.
> 
> I've searched the man pages and web for definitions of the output of
> qacct, but I have not been able to find a complete reference (just bits
> and pieces here and there).
> 
> Can anyone point me to a complete reference so that I can better
> understand the output of qacct?

There is a man page about it:

man accounting

-- Reuti


> 
> Thank you,
> 
> -- 
> Mun
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti



Am 09.04.2019 um 21:08 schrieb Mun Johl:

> Hi Reuti,
> 
> One clarification question below ...
> 
> On Tue, Apr 09, 2019 at 09:05 AM PDT, Reuti wrote:
>>> Am 09.04.2019 um 17:43 schrieb Mun Johl :
>>> 
>>> Hi Reuti,
>>> 
>>> Thank you for your reply!
>>> Please see my comments below.
>>> 
>>> On Mon, Apr 08, 2019 at 10:27 PM PDT, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 09.04.2019 um 05:37 schrieb Mun Johl :
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My company is hiring a contractor for some development work.  As such, I
>>>>> need to modify our grid configuration so that he only has access to a
>>>>> single execution host.  That particular host (let's call it serverA)
>>>>> will not have all of our data disks mounted.
>>>>> 
>>>>> NOTE: We are running SGE v8.1.9 on systems running Red Hat Enterprise 
>>>>> Linux v6.8 .
>>>>> 
>>>>> I'm not really sure how to proceed.  I'm thinking of perhaps creating a
>>>>> new queue which only resides on serverA.
>>>> 
>>>> There is no need for an additional queue. You can add him to the 
>>>> xuser_lists of all oher queues. But a special queue with a limited number 
>>>> of slots might give the contractor more priority to check his develoment 
>>>> faster. Depends on personal taste whether this one is preferred. This 
>>>> queue could have a forced complex with a high urgency, which he always 
>>>> have to request (or you use JSV to add this to his job submissions).
>>> 
>>> How would I proceed if I did not create an additional queue?  You have
>>> me intrigued.  That is, if I add him to the xuser_lists of all queues,
>>> he wouldn't be able to submit a job, would he?  Perhaps I'm confused.
>> 
>> All entries in the (cluster) queue definition allow a list of different 
>> characteristics (similar to David's setup in the recent post):
>> 
>> $ qconf -sq all.q
>> ...
>> user_lists   NONE,[development_machine=banned_users]
>> xuser_lists   NONE,[@ordinary_hosts=banned_users]
> 
> I created a host group of servers only accessible by employees (not the
> contractor).  And then I created an ACL named "contractors" which
> contains the contractor's username.
> 
> So if I want to forbid the "contractors" from accessing the @EmpOnly
> servers on a given queue, would I simply modify the following
> xuser_lists line in the queue file as shown below?
> 
> xuser_lists   NONE,[@EmpOnly=contractors]

Yes.

If you don't want to do it in an editor, you can also use the command line:

$ qconf -aattr queue xuser_lists contractors your_qname_here@@EmpOnly

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti



Am 09.04.2019 um 21:08 schrieb Mun Johl:

> Hi Reuti,
> 
> One clarification question below ...
> 
> On Tue, Apr 09, 2019 at 09:05 AM PDT, Reuti wrote:
>>> Am 09.04.2019 um 17:43 schrieb Mun Johl :
>>> 
>>> Hi Reuti,
>>> 
>>> Thank you for your reply!
>>> Please see my comments below.
>>> 
>>> On Mon, Apr 08, 2019 at 10:27 PM PDT, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 09.04.2019 um 05:37 schrieb Mun Johl :
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My company is hiring a contractor for some development work.  As such, I
>>>>> need to modify our grid configuration so that he only has access to a
>>>>> single execution host.  That particular host (let's call it serverA)
>>>>> will not have all of our data disks mounted.
>>>>> 
>>>>> NOTE: We are running SGE v8.1.9 on systems running Red Hat Enterprise 
>>>>> Linux v6.8 .
>>>>> 
>>>>> I'm not really sure how to proceed.  I'm thinking of perhaps creating a
>>>>> new queue which only resides on serverA.
>>>> 
>>>> There is no need for an additional queue. You can add him to the 
>>>> xuser_lists of all oher queues. But a special queue with a limited number 
>>>> of slots might give the contractor more priority to check his develoment 
>>>> faster. Depends on personal taste whether this one is preferred. This 
>>>> queue could have a forced complex with a high urgency, which he always 
>>>> have to request (or you use JSV to add this to his job submissions).
>>> 
>>> How would I proceed if I did not create an additional queue?  You have
>>> me intrigued.  That is, if I add him to the xuser_lists of all queues,
>>> he wouldn't be able to submit a job, would he?  Perhaps I'm confused.
>> 
>> All entries in the (cluster) queue definition allow a list of different 
>> characteristics (similar to David's setup in the recent post):
>> 
>> $ qconf -sq all.q
>> ...
>> user_lists   NONE,[development_machine=banned_users]
>> xuser_lists   NONE,[@ordinary_hosts=banned_users]
> 
> I created a host group of servers only accessible by employees (not the
> contractor).  And then I created an ACL named "contractors" which
> contains the contractor's username.
> 
> So if I want to forbid the "contractors" from accessing the @EmpOnly
> servers on a given queue, would I simply modify the following
> xuser_lists line in the queue file as shown below?
> 
> xuser_lists   NONE,[@EmpOnly=contractors]
> 
> Best regards,
> 
> -- 
> Mun
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-09 Thread Reuti


> Am 09.04.2019 um 17:43 schrieb Mun Johl :
> 
> Hi Reuti,
> 
> Thank you for your reply!
> Please see my comments below.
> 
> On Mon, Apr 08, 2019 at 10:27 PM PDT, Reuti wrote:
>> Hi,
>> 
>>> Am 09.04.2019 um 05:37 schrieb Mun Johl :
>>> 
>>> Hi all,
>>> 
>>> My company is hiring a contractor for some development work.  As such, I
>>> need to modify our grid configuration so that he only has access to a
>>> single execution host.  That particular host (let's call it serverA)
>>> will not have all of our data disks mounted.
>>> 
>>> NOTE: We are running SGE v8.1.9 on systems running Red Hat Enterprise Linux 
>>> v6.8 .
>>> 
>>> I'm not really sure how to proceed.  I'm thinking of perhaps creating a
>>> new queue which only resides on serverA.
>> 
>> There is no need for an additional queue. You can add him to the xuser_lists 
>> of all oher queues. But a special queue with a limited number of slots might 
>> give the contractor more priority to check his develoment faster. Depends on 
>> personal taste whether this one is preferred. This queue could have a forced 
>> complex with a high urgency, which he always have to request (or you use JSV 
>> to add this to his job submissions).
> 
> How would I proceed if I did not create an additional queue?  You have
> me intrigued.  That is, if I add him to the xuser_lists of all queues,
> he wouldn't be able to submit a job, would he?  Perhaps I'm confused.

All entries in the (cluster) queue definition allow a list of different 
characteristics (similar to David's setup in the recent post):

$ qconf -sq all.q
…
user_lists   NONE,[development_machine=banned_users]
xuser_lists   NONE,[@ordinary_hosts=banned_users]

to keep him away from certain machines only. You don't need both entries, it 
depends whether there are machines for development use only, for ordinary users 
only, and a pool of machines for mixed use. Sure, one would it rename to 
"contractor_team" and not "banned_users", if it's used in "user_lists" too.


> 
>>> We would ask the contractor to
>>> specify this new queue for his jobs.  Furthermore, I would add the
>>> contractor to the xuser_lists of all other queues.
>>> 
>>> Does that sound reasonable
>> 
>> Yes.
>> 
>> 
>>> or is there an easier method for
>>> accomplishing this task within SGE?
>>> 
>>> IF it makes sense to proceed in this manner, what is the easiest way to
>>> add the username of the contractor to the xuser_lists parameter?  Can I
>>> simply add his username?  Or do I need to create a new access list for him?
>> 
>> Yes.
>> 
>> $ qconf -au john_doe banned_users
> 
> Okay, so to confirm: I create the banned_users ACL and add that ACL to
> all queues for which john_joe is banned.  Correct?
> 
> Thanks again for your time and knowledge!

Either this or create a hostlist to shorten the number of machines for the 
above setup.

===

Even a forced complex could be bound this way to a hostgroup only:

$ qconf -sq all.q
…
complex_valuesNONE,[@ ordinary_hosts =contractor=TRUE]

and the BOOL complex "contractor" with a high urgency.

-- Reuti


> Best regards,
> 
> -- 
> Mun
> 
> 
>>> Any and all examples of how to implement this type of configuration
>>> would be greatly appreciated since I am not an SGE expert by any stretch
>>> of the imagination.
>>> 
>>> By the way, would the contractor only need an account on serverA in
>>> order to utilize SGE?  Or would he need an account on the grid master as
>>> well?
>> 
>> Are you not using a central user administration by NIS or LDAP?
>> 
>> AFAICS he needs an entry only on the execution host (and on the submission 
>> host of course).
>> 
>> -- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Best way to restrict a user to a specific exec host?

2019-04-08 Thread Reuti

Hi,

> Am 09.04.2019 um 05:37 schrieb Mun Johl :
> 
> Hi all,
> 
> My company is hiring a contractor for some development work.  As such, I
> need to modify our grid configuration so that he only has access to a
> single execution host.  That particular host (let's call it serverA)
> will not have all of our data disks mounted.
> 
> NOTE: We are running SGE v8.1.9 on systems running Red Hat Enterprise Linux 
> v6.8 .
> 
> I'm not really sure how to proceed.  I'm thinking of perhaps creating a
> new queue which only resides on serverA.

There is no need for an additional queue. You can add him to the xuser_lists of 
all oher queues. But a special queue with a limited number of slots might give 
the contractor more priority to check his develoment faster. Depends on 
personal taste whether this one is preferred. This queue could have a forced 
complex with a high urgency, which he always have to request (or you use JSV to 
add this to his job submissions).


>  We would ask the contractor to
> specify this new queue for his jobs.  Furthermore, I would add the
> contractor to the xuser_lists of all other queues.
> 
> Does that sound reasonable

Yes.


> or is there an easier method for
> accomplishing this task within SGE?
> 
> IF it makes sense to proceed in this manner, what is the easiest way to
> add the username of the contractor to the xuser_lists parameter?  Can I
> simply add his username?  Or do I need to create a new access list for him?

Yes.

$ qconf -au john_doe banned_users


> Any and all examples of how to implement this type of configuration
> would be greatly appreciated since I am not an SGE expert by any stretch
> of the imagination.
> 
> By the way, would the contractor only need an account on serverA in
> order to utilize SGE?  Or would he need an account on the grid master as
> well?

Are you not using a central user administration by NIS or LDAP?

AFAICS he needs an entry only on the execution host (and on the submission host 
of course).

-- Reuti

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Limiting users' access to nodes

2019-04-08 Thread Reuti

Hi,

> Am 20.03.2019 um 17:19 schrieb David Trimboli :
> 
> Something I can't quite find in the manuals...
> 
> I have a couple of hosts in my cluster on which some users don't have 
> accounts. I want to configure Grid Engine to reject attempts by those users 
> to send jobs to those hosts. I've been trying to figure out how to set up a 
> host-based whitelist (it's shorter than a blacklist), but I can't quite get 
> it. How can I do this?

Please have a look at the user_lists entry. Each entry has to be an ACL, i.e. 
even a single user would need to be in an ACL with only his own name as a 
single entry. You can use them either on a per host basis (`qconf -me …`), or 
on a queue level (`qconf -mq …`). Which one you prefer depends on the number of 
hosts or queues you want to allow.

On a queue level you could use:

$ qconf -sq all.q
…
user_lists   NONE,[@special_hosts=allowed_users]

and the white listed users have to be in the allowed_users ACL and the 
@special_hosts are the machines where ordinary users are banned from.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-12 Thread Reuti


> Am 12.03.2019 um 15:55 schrieb David Trimboli :
> 
> 
> On 3/5/2019 12:34 PM, David Trimboli wrote:
>> 
>> On 3/5/2019 12:18 PM, Reuti wrote:
>>>> Am 05.03.2019 um 18:06 schrieb David Trimboli 
>>>> :
>>>> 
>>>> I'm looking at SGE limits, and I'm not sure when something applies to all 
>>>> users or each user individually. I want to find out how to limit each user 
>>>> to a certain number of slots across the entire cluster (just one queue).
>>>> 
>>>> I feel like this isn't it:
>>>> 
>>>> {
>>>> Name   limit-user-slots
>>>> descriptionLimit each user to 10 slots
>>>> enabledtrue
>>>> limit  users * queues {all.q} to slots=10
>>>> 
>>> limit users {*} queues all.q to slots=10
>>> 
>>> In principle {all.q} wouldn't hurt as it means "for each entry in the 
>>> list", and the only entry is all.q. But to lower the impact I would leave 
>>> this out.
>>> 
>> Ohhh! I didn't realize that {} meant to apply to each entry in the list. 
>> That gives me everything I need. Thanks to you and Bernd.
> 
> Now a followup question. I implemented this rule to ensure that no single 
> user takes more than 90% of our available slots:
> {
> namelimit90percent
> descriptionNONE
> enabledTRUE
> limitusers {*} to slots=536
> }
> 
> (Our cluster has a total of 596 slots.) This worked fine until someone tried 
> to submit a parallel environment job with the -pe option. On 16 out of our 24 
> nodes, it still worked. But if they sent a job hard-queued to one of the 
> upper nodes 17–24, it would never run, with this in the scheduling info:

What was the submission command? A plain '-q upper'? There was/is an issue 
where you have to specify instead '-q "*@@upper"' for a hostgroup named @upper. 
Or one can try to have a dedicated PE only for the upper nodes and request this 
PE (i.e. in the queue configuration "pe_list …,[@upper=upper]".

-- Reuti


> cannot run because it exceeds limit "trimboli/" in rule "limit90percent/1"
> cannot run in PE "threads" because it only offers 0 slots
> 
> (My username is trimboli.) Now, it's quite possible that the upper nodes are 
> set up differently than the lower nodes. The upper eight nodes were installed 
> later than the others and have been treated differently in the past. I'd like 
> to find what setting in the upper nodes is making this limit say that there 
> are 0 slots when a PE job is run. Where can I look to find the culprit?


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] A Virtual GridEngine Cluster in a cluster

2019-03-08 Thread Reuti

While the original idea was to use some workflow for low core count jobs in a 
SLURM cluster, it ended up with a setup of a Virtual Cluster in (possibly) any 
queuing system. Although it might depend on the site policy to allow and use 
such a set up, it's at least a working scenario and might add features to any 
actual installation, which are not available or not set up. On the other hand 
this provides some kind of micro-scheduling inside the given allocation which 
is not available otherwise.

We got access to a SLURM equipped cluster where one always get complete nodes 
and are asked to avoid single serial jobs or to pack them by scripting to fill 
the nodes. With the additional need for a workflow application (kinda DRMAA) 
and array job dependencies, I got the idea to run a GridEngine instance as a 
Virtual Cluster in a SLURM cluster to solve this.

Essentially it's quite easy, as GridEngine offers:

- one can start SGE as normal user (for a single user setup per Virtual Cluster 
it's exactly appropriate)
- SGE supports independent configurations, i.e. each Virtual Cluster is an 
SGE_CELL
- configuration files can be plain text files (classic), and hence are easily 
adjustable

After an untar of SGE somewhere like 
/YOUR/PATH/HERE/VC-common-installation/opt/sge (no need to install anything 
here), we need a planchet of a "classic" configuration put there named 
"__SGE_PLANCHET__", and like the /tmp directory everyone should be able to 
write at this level besides the "__SGE_PLANCHET__" (`chmod go=rwx,+t 
/YOUR/PATH/HERE/VC-common-installation/opt/sge`). To the planchet you can add 
items as needed, e.g. more PEs, complexes, queues,…

The enclosed script `multi-spawn.sh` gives an idea what has to be done then to 
start a virtual cluster, even several ones per user, i.e.:

$ sbatch multi-spawn.sh

Regarding DRMAA one doesn't need to run this on the login node or a dedicated 
job, instead the workflow application is already part of the (SLURM) job itself 
(to be put in the application section in `multi-spawn.sh`).

===

While the planchet was created still with 6.2u5, there are only a few steps 
necessary to create one for your version of SGE:

Run each install_* for qmaster and execd once. Essentially this will create 
only a configuration and choose "classic" for the spooling method (no need to 
add any exechost when you are asked for, in fact: remove the one which was 
added afterwards, and in the @allhosts hostgroup too). Then rename this created 
"default" configuration to "__SGE_PLANCHET__" and look in my planchet with 
`grep` for entries like __FOO__ (i.e. strings enclosed by a double underscore). 
These have to be replaced then there accordingly. The `multi-spawn.sh` will 
then change these in a copy of the planchet to the names and location of the 
actual SGE instance; i.e. each SGE_CELL has also its own spool directory.

Notably it's in sgemaster and sgeexecd:

SGE_ROOT=/usr/sge; export SGE_ROOT
SGE_CELL=default; export SGE_CELL

to:

SGE_ROOT=__SGE_INSTALLATION__; export SGE_ROOT
SGE_CELL=__SGE_CELL__; export SGE_CELL

===

You might need passphraseless `ssh` between the nodes, unless you start remote 
daemons by `srun`. If this is not working too, a pseudo MPI application whose 
only duty is to start the sgeexecd on each involved node should do.

===

In case you want to login to one of the nodes which were granted for your 
Virtual Cluster interactively, you need to:

$ source 
/YOUR/PATH/HERE/VC-common-installation/opt/sge/SGE_/common/settings.sh

there to gain access to the SGE commands in the interactive shell for this 
particular Virtual Cluster. Therefore two mini functions `sge-set 
` and `sge-done` are included to ease this.

While this works on the nodes instantly, it's necessary to add the head node(s) 
of the SLURM cluster in the planchet beforehand as submit and/or admin hosts.

===

In case one wants to send emails, note that the default for GridEngine is the 
account of the login node, which is in this case an exechost for SLURM. Either 
a special set up there is necessary to receive email on an exechost, or provide 
always an absolute eMail address with the option "-M" to GridEngine.

===

As every VC starts with job id 1, it might be helpful to create scratch 
directories (in a global prolog/epilog) consisting of 
"${SLURM_JOB_ID}_$(basename ${TMPDIR})". If you are getting always full nodes, 
you won't have this problem on a local scratch directory for $TMPDIR though.

===

BTW: did I mention it: no need to be root anywhere.

-- Reuti



multi-spawn.sh
Description: Binary data


__SGE_PLANCHET__.tgz
Description: Binary data


cluster.tgz
Description: Binary data
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Priority?

2019-03-06 Thread Reuti



> Am 07.03.2019 um 07:08 schrieb Simon Matthews :
> 
> I have a small grid running SoGE 8.1.8,
> 
> Prioritization doesn't seem to work. Jobs seem to run purely in the
> order they are submitted (after allowing for prerequisite jobs).
> Changing priority of the jobs doesn't seem to change the order in
> which they run.

You mean the value you set with "-p"?

To which value did you change this for certain jobs?

$ qstat -pri

might give a hint about the overall values which are assigned to a job in 
column "npprior".

-- Reuti


> 
> Can anyone else confirm this behaviour? Is there anything I might be
> doing wrong.
> 
> Simon
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-05 Thread Reuti

Hi,

> Am 05.03.2019 um 18:06 schrieb David Trimboli :
> 
> I'm looking at SGE limits, and I'm not sure when something applies to all 
> users or each user individually. I want to find out how to limit each user to 
> a certain number of slots across the entire cluster (just one queue).
> 
> I feel like this isn't it:
> 
> {
> Name   limit-user-slots
> descriptionLimit each user to 10 slots
> enabledtrue
> limit  users * queues {all.q} to slots=10

limit users {*} queues all.q to slots=10

In principle {all.q} wouldn't hurt as it means "for each entry in the list", 
and the only entry is all.q. But to lower the impact I would leave this out.

-- Reuti


> }
> 
> I get the feeling that will limit the number of slots that all users can 
> collectively use simultaneously to 10. I want Bob to have no more than 10 
> slots, Joe to have no more than 10 slots, etc.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fair share policy

2019-02-27 Thread Reuti

Hi,

> Am 27.02.2019 um 22:07 schrieb Kandalaft, Iyad (AAFC/AAC) 
> :
> 
> HI Reuti
> 
> I'm implementing only a share-tree.

Then you can set:

policy_hierarchy  S

The past usage is stored in the user object, hence auto_user_delete_time  
should be zero (and also in all the entries which were already created the 
delete_time should be zero: qconf -suserl) The fshare value set thererein 
shouldn't be honored in case you set up only the share_tree policy.

-- Reuti


>  The docs somewhere state something along the lines of use one or the other.
> I've seen the man page as  It explains most of the math but leaves out some 
> key elements.  For example, how are "tickets" handed out and in what quantity 
> (i.e. why do some job get 2 tickets based on my configuration below).  
> Also, the normalization function puts the values between 0 and 1 but based on 
> what?
>  Number of tickets issued to the job divided by the total?
> 
> Thanks for your help.
> 
> Iyad Kandalaft
> 
> -Original Message-
> From: Reuti  
> Sent: Wednesday, February 27, 2019 4:00 PM
> To: Kandalaft, Iyad (AAFC/AAC) 
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Fair share policy
> 
> Hi,
> 
> there is a man page "man sge_priority". Which policy do you intend to use: 
> share-tree (honors past usage) or functional (current use), or both?
> 
> -- Reuti
> 
> 
>> Am 25.02.2019 um 15:03 schrieb Kandalaft, Iyad (AAFC/AAC) 
>> :
>> 
>> Hi all,
>> 
>> I recently implemented a fair share policy using share tickets.  I’ve been 
>> monitoring the cluster for a couple of days using qstat -pri -ext -u “*” in 
>> order to see how the functional tickets are working and it seems to have the 
>> intended effect.  There are some anomalies where some running jobs have 0 
>> tickets but still get scheduled since there’s free resources; I assume this 
>> is normal.
>> 
>> I’ll admit that I don’t fully understand the scheduling as it’s somewhat 
>> complex.  So, I’m hoping someone can review the configuration to see if they 
>> can find any glaring issues such as conflicting options.
>> 
>> I created a share-tree and gave all users an equal value of 10:
>> $ qconf -sstree
>> id=0
>> name=Root
>> type=0
>> shares=1
>> childnodes=1
>> id=1
>> name=default
>> type=0
>> shares=10
>> childnodes=NONE
>> 
>> I modified the scheduling by setting the weight_tickets_share to 100. I 
>> reduced the weight_waiting_time weight_priority weight_urgency to well below 
>> the weight_ticket (what are good values?).
>> $ qconf -ssconf
>> algorithm default
>> schedule_interval 0:0:15
>> maxujobs  0
>> queue_sort_method seqno
>> job_load_adjustments  np_load_avg=0.50
>> load_adjustment_decay_time0:7:30
>> load_formula  np_load_avg
>> schedd_job_info   false
>> flush_submit_sec  0
>> flush_finish_sec  0
>> paramsnone
>> reprioritize_interval 0:0:0
>> halftime  168
>> usage_weight_list cpu=0.70,mem=0.20,io=0.10
>> compensation_factor   5.00
>> weight_user   0.25
>> weight_project0.25
>> weight_department 0.25
>> weight_job0.25
>> weight_tickets_functional 0
>> weight_tickets_share  100
>> share_override_ticketsTRUE
>> share_functional_shares   TRUE
>> max_functional_jobs_to_schedule   200
>> report_pjob_tickets   TRUE
>> max_pending_tasks_per_job 50
>> halflife_decay_list   none
>> policy_hierarchy  OFS
>> weight_ticket 0.50
>> weight_waiting_time   0.10
>> weight_deadline   360.00
>> weight_urgency0.01
>> weight_priority   0.01
>> max_reservation   0
>> default_duration  INFINITY
>> 
>> I modified all the users to set the fshare to 1000 $ qconf -muser XXX
>> 
>> I modified the general conf to auto_user_fsahre 1000 and 
>> auto_user_delete_time 7776000 (90 days).  Halftime is set to the default 7 
>> days (I assume I should increase this).  I don’t know if 
&g

Re: [gridengine users] Fair share policy

2019-02-27 Thread Reuti

Hi,

there is a man page "man sge_priority". Which policy do you intend to use: 
share-tree (honors past usage) or functional (current use), or both?

-- Reuti


> Am 25.02.2019 um 15:03 schrieb Kandalaft, Iyad (AAFC/AAC) 
> :
> 
> Hi all,
>  
> I recently implemented a fair share policy using share tickets.  I’ve been 
> monitoring the cluster for a couple of days using qstat -pri -ext -u “*” in 
> order to see how the functional tickets are working and it seems to have the 
> intended effect.  There are some anomalies where some running jobs have 0 
> tickets but still get scheduled since there’s free resources; I assume this 
> is normal.
>  
> I’ll admit that I don’t fully understand the scheduling as it’s somewhat 
> complex.  So, I’m hoping someone can review the configuration to see if they 
> can find any glaring issues such as conflicting options.
>  
> I created a share-tree and gave all users an equal value of 10:
> $ qconf -sstree
> id=0
> name=Root
> type=0
> shares=1
> childnodes=1
> id=1
> name=default
> type=0
> shares=10
> childnodes=NONE
>  
> I modified the scheduling by setting the weight_tickets_share to 100. I 
> reduced the weight_waiting_time weight_priority weight_urgency to well below 
> the weight_ticket (what are good values?).
> $ qconf -ssconf
> algorithm default
> schedule_interval 0:0:15
> maxujobs  0
> queue_sort_method seqno
> job_load_adjustments  np_load_avg=0.50
> load_adjustment_decay_time0:7:30
> load_formula  np_load_avg
> schedd_job_info   false
> flush_submit_sec  0
> flush_finish_sec  0
> paramsnone
> reprioritize_interval 0:0:0
> halftime  168
> usage_weight_list cpu=0.70,mem=0.20,io=0.10
> compensation_factor   5.00
> weight_user   0.25
> weight_project0.25
> weight_department 0.25
> weight_job0.25
> weight_tickets_functional 0
> weight_tickets_share  100
> share_override_ticketsTRUE
> share_functional_shares   TRUE
> max_functional_jobs_to_schedule   200
> report_pjob_tickets   TRUE
> max_pending_tasks_per_job 50
> halflife_decay_list   none
> policy_hierarchy  OFS
> weight_ticket 0.50
> weight_waiting_time   0.10
> weight_deadline   360.00
> weight_urgency0.01
> weight_priority   0.01
> max_reservation   0
> default_duration  INFINITY
>  
> I modified all the users to set the fshare to 1000
> $ qconf -muser XXX
>  
> I modified the general conf to auto_user_fsahre 1000 and 
> auto_user_delete_time 7776000 (90 days).  Halftime is set to the default 7 
> days (I assume I should increase this).  I don’t know if 
> auto_user_delete_time even matters.
> $ qconf -sconf
> #global:
> execd_spool_dir  /opt/gridengine/default/spool
> mailer   /opt/gridengine/default/commond/mail_wrapper.py
> xterm/usr/bin/xterm
> load_sensor  none
> prolog   none
> epilog   none
> shell_start_mode posix_compliant
> login_shells sh,bash
> min_uid  100
> min_gid  100
> user_lists   none
> xuser_lists  none
> projects none
> xprojectsnone
> enforce_project  false
> enforce_user auto
> load_report_time 00:00:40
> max_unheard  00:05:00
> reschedule_unknown   00:00:00
> loglevel log_info
> administrator_mail   none
> set_token_cmdnone
> pag_cmd  none
> token_extend_timenone
> shepherd_cmd none
> qmaster_params   none
> execd_params ENABLE_BINDING=true ENABLE_ADDGRP_KILL=true \
>  H_DESCRIPTORS=16K
> reporting_params accounting=true reporting=true \
>  flush_time=00:00:15 joblog=true sharelog=00:00:00
> finished_jobs100
> gid_range2-20100
> qlogin_command   /opt/gridengine/bin/rocks-qlogin.sh

Re: [gridengine users] Accessing qacct accounting file from login/compute nodes

2019-02-19 Thread Reuti

Hi,

> Am 20.02.2019 um 05:31 schrieb Derrick Lin :
> 
> Hi guys,
> 
> On our SGE cluster, the accounting file stored on the qmaster node and is not 
> accessible outside. qmaster node is not accessible by any user either.
> 
> Now we have users request to obtain accounting info via qacct. I am wondering 
> what is the common way to achieve this without giving access to the qmaster 
> node?

You mean, $SGE_ROOT is not shared in your cluster?

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] starting a new gridengine accounting file

2019-01-29 Thread Reuti

Hi,

> Am 29.01.2019 um 17:09 schrieb John Young :
> 
> The gridengine accounting file on our cluster has gotten
> rather large.  I have looked around in the Gridengine docs
> for information on how to close it and start another file
> but if it is there, I missed it.
> 
> Does anyone know how to do this?

Just rename it, SGE should start a new one.

There is even a script in $SGE_ROOT/util/logchecker.sh to be run as a cron-job, 
which renames and compresses the file(s) keeping certain versions as backup.

The compressed ones can later be used by e.g.:

$ qacct -f <(zcat accounting.0.gz)

-- Reuti



> -- 
>   JY
> --
> "All ideas and opinions expressed in this communication are
> those of the author alone and do not necessarily reflect the
> ideas and opinions of anyone else."
> 
> -- 
>   JY
> 
> John E. Young NASA LaRC B1148/R226
> Analytical Mechanics Associates, Inc. (757) 864-8659
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Reuti

Hi,

> Am 26.01.2019 um 10:20 schrieb Joseph Farran :
> 
> Hi.
> Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster stays 
> at 100% all the time where is used to be 100% for a few seconds every 30 
> seconds or so.
> I ran the qping command but not sure how to read it.   Any helpful insight 
> much appreciated

Did you try to stop and start the qmaster?

-- Reuti


> qping -i 5 -info hpc-s 6444 qmaster 1
> 01/26/2019 01:12:18:
> SIRM version: 0.1
> SIRM message id:  1
> start time:   01/26/2019 01:10:13 (1548493813)
> run time [s]: 125
> messages in read buffer:  0
> messages in write buffer: 0
> no. of connected clients: 296
> status:   0
> info: MAIN: R (125.20) | signaler000: R (123.69) | 
> event_master000: R (0.14) | timer000: R (4.52) | worker000: R (0.14) | 
> worker001: R (3.44) | worker002: R (7.33) | worker003: R (3.43) | worker004: 
> R (3.08) | worker005: R (1.42) | OK
> malloc:   arena(34410496) |ordblks(9370) | smblks(164269) | 
> hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(7726000) | uordblks(24248176) | 
> fordblks(10162320) | keepcost(119856)
> Monitor:
> 01/26/2019 01:10:13 | MAIN: no monitoring data available
> 01/26/2019 01:10:14 | signaler000: no monitoring data available
> 01/26/2019 01:12:14 | event_master000: runs: 4.82r/s (clients: 1.00 mod: 
> 0.02/s ack: 0.02/s blocked: 0.00 busy: 0.81 | events: 5.52/s added: 5.47/s 
> skipt: 0.05/s) out: 0.00m/s APT: 0.0002s/m idle: 99.89% wait: 0.00% time: 
> 60.00s
> 01/26/2019 01:12:14 | timer000: runs: 0.47r/s (pending: 12.00 executed: 
> 0.45/s) out: 0.00m/s APT: 0.0002s/m idle: 99.99% wait: 0.00% time: 60.00s
> 01/26/2019 01:11:19 | worker000: runs: 0.68r/s (EXECD 
> (l:0.32,j:0.28,c:0.32,p:0.00,a:0.00)/s GDI 
> (a:0.25,g:1.08,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.82m/s APT: 0.0036s/m idle: 99.75% wait: 0.00% time: 64.96s
> 01/26/2019 01:12:15 | worker001: runs: 0.81r/s (EXECD 
> (l:0.02,j:0.02,c:0.02,p:0.00,a:0.00)/s GDI 
> (a:0.00,g:1.92,m:0.08,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.81m/s APT: 0.0008s/m idle: 99.93% wait: 0.00% time: 59.27s
> 01/26/2019 01:11:16 | worker002: runs: 0.73r/s (EXECD 
> (l:0.28,j:0.23,c:0.26,p:0.00,a:0.00)/s GDI 
> (a:0.34,g:1.13,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.71m/s APT: 0.0030s/m idle: 99.78% wait: 0.17% time: 61.75s
> 01/26/2019 01:12:15 | worker003: runs: 0.75r/s (EXECD 
> (l:0.03,j:0.02,c:0.03,p:0.00,a:0.00)/s GDI 
> (a:0.02,g:1.23,m:0.07,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.73m/s APT: 0.0008s/m idle: 99.94% wait: 0.02% time: 60.40s
> 01/26/2019 01:11:26 | worker004: runs: 0.68r/s (EXECD 
> (l:0.23,j:0.21,c:0.23,p:0.00,a:0.00)/s GDI 
> (a:0.27,g:1.69,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.65m/s APT: 0.0012s/m idle: 99.92% wait: 0.00% time: 71.11s
> 01/26/2019 01:11:31 | worker005: runs: 0.56r/s (EXECD 
> (l:0.25,j:0.24,c:0.25,p:0.00,a:0.00)/s GDI 
> (a:0.20,g:1.05,m:0.00,d:0.00,c:0.00,t:0.00,p:0.00)/s OTHER (ql:0)) out: 
> 0.55m/s APT: 0.0011s/m idle: 99.94% wait: 0.00% time: 76.48s
> 
> Joseph
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-25 Thread Reuti



> Am 25.01.2019 um 15:24 schrieb David Triimboli :
> 
> On 1/24/2019 5:25 PM, Fred Youhanaie wrote:
>> 
>> I can see "Permission denied" errors such as this one in the trace:
>> 
>> openat(AT_FDCWD, "/opt/sge/man", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) 
>> = -1 EACCES (Permission denied)
>> 
>> I think it's worth revisiting the directory permissions, for each path 
>> component individually
>> 
>> cd /opt
>> cd /opt/sge
>> cd /opt/sge/man
> 
> 
> Everything looks correct to me:
> 
> trimboli@ubuntuclient2:~$ ls -ld /opt
> drwxr-xr-x 4 root root 4096 Jan 23 16:55 /opt
> trimboli@ubuntuclient2:~$ cd /opt
> trimboli@ubuntuclient2:/opt$ ls -ld sge
> drwxr-xr-x 14 sgeadmin sgeadmin 4096 Jan 24 13:28 sge
> trimboli@ubuntuclient2:/opt$ cd sge
> trimboli@ubuntuclient2:/opt/sge$ ls -ld man
> drwxr-xr-x 10 sgeadmin sgeadmin 4096 Jan 23 16:03 man
> trimboli@ubuntuclient2:/opt/sge$ cd man
> trimboli@ubuntuclient2:
> 
>> 
>> There are also lines like the following
>> 
>> stat("/opt/sge/bin/qhost", 0x7fff16f8c7a0) = -1 EACCES (Permission denied)
> 
> 
> That's a strange line, because qhost lives in /opt/sge/bin/lx-amd64. But the 
> following line also has a permission denied message, and it points to the 
> correct location of qhost. But of course,
> 
> trimboli@ubuntuclient2:/opt/sge$ ls -l /opt/sge/bin/lx-amd64/qhost
> -rwxr-xr-x 1 root root 1941408 Feb 28  2016 /opt/sge/bin/lx-amd64/qhost

In principle this could be masked by an ACL on the exporting machine. It's even 
possible to mount the fiel system on the exechost without honoring the ACL and 
wonder why one has no access (as the permission bits look fine). Essentially 
the exporting NFS machine will deny the access according to certain bits of the 
permission bits. Does the line form above contain a plus sign on the exporting 
machine like:

-rwxr-xr-x+ 1 root root 1941408 Feb 28  2016 /opt/sge/bin/lx-amd64/qhost

-- Reuti


> 
>> 
>> As already explored by Reuti and yourself, it looks like NFS related in the 
>> /opt/sge components. Anything in the system log?
> 
> 
> Hmm. Possibly, but it's beyond my ability to interpret. Here are a couple of 
> interesting things I found:
> 
> audit: type=1400 audit(1548425695.819:52): apparmor="DENIED" 
> operation="sendmsg" profile="/usr/bin/man" pid=3534 comm="man" laddr=10.0.2.4 
> lport=878 faddr=10.0.2.15 fport=2049 family="inet" sock_type="stream" 
> protocol=6 requested_mask="send" denied_mask="send"
> 
> nfs: RPC call returned error 13
> 
> 
> I tried stopping the AppArmor service and running "man qhost" again, but it 
> made no difference.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti

> Am 24.01.2019 um 20:29 schrieb David Triimboli :
> 
> On 1/24/2019 2:05 PM, Reuti wrote:
>> Do the permissions for the directories include the x flag and not only r?
>> 
>> drwxr-xr-x 2 root root 4.0K Jan 13  2010 man1
>> drwxr-xr-x 2 root root 4.0K Jan 13  2010 man3
>> drwxr-xr-x 2 root root 4.0K Jan 13  2010 man5
>> drwxr-xr-x 2 root root 4.0K Jan 13  2010 man8
> 
> 
> Yes. They are owned by sgeadmin, not root, but the directories all have 
> drwxr-xr-x.

This could be masked by the exec or noexec option to the NFS mount. But as the 
SGE applications itself are working, I would assume it's mounted executable?
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti


> Am 24.01.2019 um 19:28 schrieb David Triimboli :
> 
> On 1/24/2019 1:14 PM, Reuti wrote:
>>> Am 24.01.2019 um 19:10 schrieb David Triimboli :
>>> 
>>> On 1/24/2019 12:44 PM, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 24.01.2019 um 18:28 schrieb David Triimboli :
>>>>> 
>>>>> This is just a silly question. Using Son of Grid Engine 8.1.9, I 
>>>>> installed a master and execution host on one machine. The man pages work 
>>>>> fine. I installed just an execution host on another. The man pages aren't 
>>>>> recognized; "man qhost" says "No manual entry for qhost." My $MANPATH 
>>>>> includes $SGE_ROOT/man on both machines, and $SGE_ROOT on the execution 
>>>>> host is just $SGE_ROOT on the master host through NFS.
>>>>> 
>>>>> Why doesn't the execution host recognize the man pages? How do I get it 
>>>>> to do so?
>>>> What is the output of the command:
>>>> 
>>>> manpath
>>>> 
>>>> on both machines?
>>> 
>>> Both output:
>>> 
>>> manpath: warning: $MANPATH set, ignoring /etc/manpath.config
>>> /opt/sge/man:/usr/share/man:/usr/local/share/man
>>> 
>>> Both machines can see everything in /opt/sge/man just fine. Every user has 
>>> read permissions to the files there.
>> Does the manpage open when you specify the complete path like:
>> 
>> man /opt/sge/man/man1/qhost.1
> 
> 
> On the master host, yes. On the execution host, no. It returns:
> 
> man: /opt/sge/man/man1/qhost.1: Permission denied
> No manual entry for /opt/sge/man/man/qhost.1

Do the permissions for the directories include the x flag and not only r?

drwxr-xr-x 2 root root 4.0K Jan 13  2010 man1
drwxr-xr-x 2 root root 4.0K Jan 13  2010 man3
drwxr-xr-x 2 root root 4.0K Jan 13  2010 man5
drwxr-xr-x 2 root root 4.0K Jan 13  2010 man8


> It fails like this even if I run the command as root. I have no trouble 
> "cat"ting that file and seeing its contents on the execution host.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti



> Am 24.01.2019 um 19:28 schrieb David Triimboli :
> 
> On 1/24/2019 1:14 PM, Reuti wrote:
>>> Am 24.01.2019 um 19:10 schrieb David Triimboli :
>>> 
>>> On 1/24/2019 12:44 PM, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 24.01.2019 um 18:28 schrieb David Triimboli :
>>>>> 
>>>>> This is just a silly question. Using Son of Grid Engine 8.1.9, I 
>>>>> installed a master and execution host on one machine. The man pages work 
>>>>> fine. I installed just an execution host on another. The man pages aren't 
>>>>> recognized; "man qhost" says "No manual entry for qhost." My $MANPATH 
>>>>> includes $SGE_ROOT/man on both machines, and $SGE_ROOT on the execution 
>>>>> host is just $SGE_ROOT on the master host through NFS.
>>>>> 
>>>>> Why doesn't the execution host recognize the man pages? How do I get it 
>>>>> to do so?
>>>> What is the output of the command:
>>>> 
>>>> manpath
>>>> 
>>>> on both machines?
>>> 
>>> Both output:
>>> 
>>> manpath: warning: $MANPATH set, ignoring /etc/manpath.config
>>> /opt/sge/man:/usr/share/man:/usr/local/share/man
>>> 
>>> Both machines can see everything in /opt/sge/man just fine. Every user has 
>>> read permissions to the files there.
>> Does the manpage open when you specify the complete path like:
>> 
>> man /opt/sge/man/man1/qhost.1
> 
> 
> On the master host, yes. On the execution host, no. It returns:
> 
> man: /opt/sge/man/man1/qhost.1: Permission denied

Man pages for other applications are working?


> No manual entry for /opt/sge/man/man/qhost.1
> 
> It fails like this even if I run the command as root. I have no trouble 
> "cat"ting that file and seeing its contents on the execution host.



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti



> Am 24.01.2019 um 19:10 schrieb David Triimboli :
> 
> On 1/24/2019 12:44 PM, Reuti wrote:
>> Hi,
>> 
>>> Am 24.01.2019 um 18:28 schrieb David Triimboli :
>>> 
>>> This is just a silly question. Using Son of Grid Engine 8.1.9, I installed 
>>> a master and execution host on one machine. The man pages work fine. I 
>>> installed just an execution host on another. The man pages aren't 
>>> recognized; "man qhost" says "No manual entry for qhost." My $MANPATH 
>>> includes $SGE_ROOT/man on both machines, and $SGE_ROOT on the execution 
>>> host is just $SGE_ROOT on the master host through NFS.
>>> 
>>> Why doesn't the execution host recognize the man pages? How do I get it to 
>>> do so?
>> What is the output of the command:
>> 
>> manpath
>> 
>> on both machines?
> 
> 
> Both output:
> 
> manpath: warning: $MANPATH set, ignoring /etc/manpath.config
> /opt/sge/man:/usr/share/man:/usr/local/share/man
> 
> Both machines can see everything in /opt/sge/man just fine. Every user has 
> read permissions to the files there.

Does the manpage open when you specify the complete path like:

man /opt/sge/man/man1/qhost.1
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Installing man pages

2019-01-24 Thread Reuti

Hi,

> Am 24.01.2019 um 18:28 schrieb David Triimboli :
> 
> This is just a silly question. Using Son of Grid Engine 8.1.9, I installed a 
> master and execution host on one machine. The man pages work fine. I 
> installed just an execution host on another. The man pages aren't recognized; 
> "man qhost" says "No manual entry for qhost." My $MANPATH includes 
> $SGE_ROOT/man on both machines, and $SGE_ROOT on the execution host is just 
> $SGE_ROOT on the master host through NFS.
> 
> Why doesn't the execution host recognize the man pages? How do I get it to do 
> so?

What is the output of the command:

manpath

on both machines?

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-22 Thread Reuti


> Am 18.01.2019 um 18:03 schrieb Derek Stephenson 
> :
> 
> There are 32 cores on the machine and it use is split between interactive and 
> non-interactive jobs. This mix is similar on other nodes as well that we 
> don't experience this issue. The split is doen as our interactive jobs tend 
> to be memory intensive but CPU light and the non-interactive tend to be CPU 
> heavy and memory light. So there are other process running on the node that 
> are inside SGE. But only root related system processes are running outside of 
> SGE.
> 
> I did find a few processes that were left behind but cleaning those out has 
> no impact. 
> 
> The gid_range is the default:
> gid_range2-20100

This is fine. I thought that SGE is waiting for a free GID to start the new job.

Is there anything left behind in memory, e.g. shared memory listed by `ipcs` 
and it starts to swap?

-- Reuti


> 
> Regards,
> 
> Derek
> -Original Message-
> From: Reuti  
> Sent: January 18, 2019 11:26 AM
> To: Derek Stephenson 
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading
> 
> 
>> Am 18.01.2019 um 16:26 schrieb Derek Stephenson 
>> :
>> 
>> Hi Reuti,
>> 
>> I don't believe anyone has adjusted the scheduler from defaults but I see:
>> schedule_interval 00:00:04
>> flush_submit_sec  1
>> flush_finish_sec  1
> 
> With a schedule interval of 4 seconds I would set the flush values to zero to 
> avoid a too high load on the qmaster. But this shouldn't be related to the 
> behavior you observe. Are you running jobs with only a few seconds runtime? 
> Otherwise even a larger schedule interval would do.
> 
> 
>> For the qlogin side, I've confirmed that there is no firewall and previously 
>> a reboot alleviated all issues we were seeing for atleast some time, though 
>> the duration seems to be getting smaller... we had to reboot the server 3 
>> weeks ago for the same issue.
> 
> Was there anything else running on the node – inside or outside SGE?
> 
> Were any processes left behind by a former interactive session?
> 
> What is the value of:
> 
> $ qconf -sconf
> …
> gid_range2-20100
> 
> and how many cores are available per node?
> 
> -- Reuti
> 
> 
>> Regards,
>> 
>> Derek
>> -Original Message-
>> From: Reuti 
>> Sent: January 18, 2019 4:51 AM
>> To: Derek Stephenson 
>> Cc: users@gridengine.org
>> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness 
>> degrading
>> 
>> 
>>> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
>>> :
>>> 
>>> Hello,
>>> 
>>> I should preface this with I've just recently started getting my head 
>>> around grid engine and as such may not have all the information I should 
>>> for administering the grid but someone's has to do it. Anyways...
>>> 
>>> Our company across an issue recently where a one of the nodes seems to 
>>> become very delayed in its response to grid submissions.  Whether it be a 
>>> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min 
>>> to successfully submit. In particular, while users may complain a qsub job 
>>> looks like it has submitted but do nothing, doing a qlogin to the node in 
>>> question will give the following:
>> 
>> This might at least for `qsub` jobs depend on when it was submitted inside 
>> the defined scheduling interval. What is the setting of:
>> 
>> $ qconf -ssconf
>> ...
>> schedule_interval 0:2:0
>> ...
>> flush_submit_sec  4
>> flush_finish_sec  4
>> 
>> 
>>> Your job 287104 ("QLOGIN") has been submitted waiting for interactive 
>>> job to be scheduled ...timeout (3 s) expired while waiting on socket 
>>> fd 7
>> 
>> For interactive jobs: any firewall in place, blocking the communication 
>> between the submission host and the exechost - maybe switched on at a later 
>> point in time? SGE will use a random port for the communication. After the 
>> reboot it worked instantly again?
>> 
>> -- Reuti
>> 
>> 
>>> Now I've seen  a series of forum articles bring up this message while 
>>> seaching through back logs but there never seems to be any conclusions in 
>>> those threads for me to start delving into on our end. 
>>> 
>>> Our past attempts to resolve the issue have only succeeded by rebooting the 
>>> node in question, and not having any real ideas on why is becoming a 
>>> general frustration.  
>>> 
>>> Any initial thoughts/pointers would be greatly appreciated
>>> 
>>> Kind Regards,
>>> 
>>> Derek
>>> 
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Starting out

2019-01-21 Thread Reuti


> Am 18.01.2019 um 18:06 schrieb David Triimboli :
> 
> On 1/18/2019 11:49 AM, Reuti wrote:
>>> Am 18.01.2019 um 17:41 schrieb David Triimboli :
>>> 
>>> On 1/18/2019 11:22 AM, Reuti wrote:
>>>> Hi,
>>>> 
>>>>> Am 18.01.2019 um 17:09 schrieb David Triimboli :
>>>>> 
>>>>> Hi, all. I've got a twenty-four-node cluster running versions of CentOS 5 
>>>>> and Sun Grid Engine. This cluster desperately needs its node OSes 
>>>>> upgraded to be able to install newer software packages, a job I've been 
>>>>> tasked with. The users want to put Ubuntu on the nodes.
>>>>> 
>>>>> I've been working in virtual machines, trying to get some form of grid 
>>>>> engine to work. My understanding is that the old Sun Grid Engine simply 
>>>>> won't work in any modern Linux kernel. Ubuntu 18.04 has a bunch of Son of 
>>>>> Grid Engine packages available through apt-get, but I haven't been able 
>>>>> to get these to work — the services won't run. All instructions I have 
>>>>> found on the web seem to be old and just don't work. I'm even willing to 
>>>>> consider Univa Grid Engine — but they never responded to my request for 
>>>>> trial software.
>>>>> 
>>>>> How should I proceed? What grid engine can I install that will work on a 
>>>>> modern Ubuntu distribution? What tricks do I need to know to get it to 
>>>>> work. Can someone point me to something to get me started?
>>>> I would assume that most likely the `arch` script inside SGE isn't 
>>>> prepared for your actual kernel, i.e. a case for 4.* kernels is missing. 
>>>> What does:
>>>> 
>>>> $ $SGE_ROOT/util/arch
>>>> 
>>>> return?
>>> 
>>> If I install the packages available through apt-get, 
>>> /usr/share/gridengine/util/arch returns: lx-amd64.
>> This is fine.
>> 
>> If the startup fails, there are usually some message in file in /tmp called 
>> qmasterd.$PID or beginning with execd.$PID alike. Can you spot anything 
>> there?
> 
> 
> There are no such files in /tmp, indeed, no files called qmasterd* anywhere 
> in the filesystem.
> 
> The grid engine logging seems to happen in /var/spool/gridengine. In 
> qmaster/messages, I have:
> 
> ---BEGIN QUOTE---
> 01/18/2019 11:37:41|  main|ubuntuclient1|W|local configuration ubuntuclient1 
> not defined - using global configuration
> 01/18/2019 11:37:41|  main|ubuntuclient1|E|global configuration not defined
> 01/18/2019 11:37:41|  main|ubuntuclient1|C|setup failed
> 01/18/2019 11:38:14|  main|ubuntuclient1|W|local configuration ubuntuclient1 
> not defined - using global configuration
> 01/18/2019 11:38:14|  main|ubuntuclient1|E|global configuration not defined
> 01/18/2019 11:38:14|  main|ubuntuclient1|C|setup failed
> ---END QUOTE---

I had a brief look at the Debian package. It seems that they provide a setup 
procedure on their own, which should replace the usual setup. Maybe this wan't 
triggered here, and hence no configuration is available at all. The setup which 
is used by SoGE on its own are the two scripts install_qmaster and 
install_execd (somewhere in /var/lib/gridengine in Debian).

These two scripts should prepare there necessary setup, in case Debian ones was 
skipped.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Starting out

2019-01-18 Thread Reuti


> Am 18.01.2019 um 17:41 schrieb David Triimboli :
> 
> On 1/18/2019 11:22 AM, Reuti wrote:
>> Hi,
>> 
>>> Am 18.01.2019 um 17:09 schrieb David Triimboli :
>>> 
>>> Hi, all. I've got a twenty-four-node cluster running versions of CentOS 5 
>>> and Sun Grid Engine. This cluster desperately needs its node OSes upgraded 
>>> to be able to install newer software packages, a job I've been tasked with. 
>>> The users want to put Ubuntu on the nodes.
>>> 
>>> I've been working in virtual machines, trying to get some form of grid 
>>> engine to work. My understanding is that the old Sun Grid Engine simply 
>>> won't work in any modern Linux kernel. Ubuntu 18.04 has a bunch of Son of 
>>> Grid Engine packages available through apt-get, but I haven't been able to 
>>> get these to work — the services won't run. All instructions I have found 
>>> on the web seem to be old and just don't work. I'm even willing to consider 
>>> Univa Grid Engine — but they never responded to my request for trial 
>>> software.
>>> 
>>> How should I proceed? What grid engine can I install that will work on a 
>>> modern Ubuntu distribution? What tricks do I need to know to get it to 
>>> work. Can someone point me to something to get me started?
>> I would assume that most likely the `arch` script inside SGE isn't prepared 
>> for your actual kernel, i.e. a case for 4.* kernels is missing. What does:
>> 
>> $ $SGE_ROOT/util/arch
>> 
>> return?
> 
> 
> If I install the packages available through apt-get, 
> /usr/share/gridengine/util/arch returns: lx-amd64.

This is fine.

If the startup fails, there are usually some message in file in /tmp called 
qmasterd.$PID or beginning with execd.$PID alike. Can you spot anything there?

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-18 Thread Reuti


> Am 18.01.2019 um 16:26 schrieb Derek Stephenson 
> :
> 
> Hi Reuti,
> 
> I don't believe anyone has adjusted the scheduler from defaults but I see:
> schedule_interval 00:00:04
> flush_submit_sec  1
> flush_finish_sec  1

With a schedule interval of 4 seconds I would set the flush values to zero to 
avoid a too high load on the qmaster. But this shouldn't be related to the 
behavior you observe. Are you running jobs with only a few seconds runtime? 
Otherwise even a larger schedule interval would do.


> For the qlogin side, I've confirmed that there is no firewall and previously 
> a reboot alleviated all issues we were seeing for atleast some time, though 
> the duration seems to be getting smaller... we had to reboot the server 3 
> weeks ago for the same issue.

Was there anything else running on the node – inside or outside SGE?

Were any processes left behind by a former interactive session?

What is the value of:

$ qconf -sconf
…
gid_range2-20100

and how many cores are available per node?

-- Reuti


> Regards,
> 
> Derek
> -Original Message-
> From: Reuti  
> Sent: January 18, 2019 4:51 AM
> To: Derek Stephenson 
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading
> 
> 
>> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
>> :
>> 
>> Hello,
>> 
>> I should preface this with I've just recently started getting my head around 
>> grid engine and as such may not have all the information I should for 
>> administering the grid but someone's has to do it. Anyways...
>> 
>> Our company across an issue recently where a one of the nodes seems to 
>> become very delayed in its response to grid submissions.  Whether it be a 
>> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min to 
>> successfully submit. In particular, while users may complain a qsub job 
>> looks like it has submitted but do nothing, doing a qlogin to the node in 
>> question will give the following:
> 
> This might at least for `qsub` jobs depend on when it was submitted inside 
> the defined scheduling interval. What is the setting of:
> 
> $ qconf -ssconf
> ...
> schedule_interval 0:2:0
> ...
> flush_submit_sec  4
> flush_finish_sec  4
> 
> 
>> Your job 287104 ("QLOGIN") has been submitted waiting for interactive 
>> job to be scheduled ...timeout (3 s) expired while waiting on socket 
>> fd 7
> 
> For interactive jobs: any firewall in place, blocking the communication 
> between the submission host and the exechost - maybe switched on at a later 
> point in time? SGE will use a random port for the communication. After the 
> reboot it worked instantly again?
> 
> -- Reuti
> 
> 
>> Now I've seen  a series of forum articles bring up this message while 
>> seaching through back logs but there never seems to be any conclusions in 
>> those threads for me to start delving into on our end. 
>> 
>> Our past attempts to resolve the issue have only succeeded by rebooting the 
>> node in question, and not having any real ideas on why is becoming a general 
>> frustration.  
>> 
>> Any initial thoughts/pointers would be greatly appreciated
>> 
>> Kind Regards,
>> 
>> Derek
>> 
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Starting out

2019-01-18 Thread Reuti

Hi,

> Am 18.01.2019 um 17:09 schrieb David Triimboli :
> 
> Hi, all. I've got a twenty-four-node cluster running versions of CentOS 5 and 
> Sun Grid Engine. This cluster desperately needs its node OSes upgraded to be 
> able to install newer software packages, a job I've been tasked with. The 
> users want to put Ubuntu on the nodes.
> 
> I've been working in virtual machines, trying to get some form of grid engine 
> to work. My understanding is that the old Sun Grid Engine simply won't work 
> in any modern Linux kernel. Ubuntu 18.04 has a bunch of Son of Grid Engine 
> packages available through apt-get, but I haven't been able to get these to 
> work — the services won't run. All instructions I have found on the web seem 
> to be old and just don't work. I'm even willing to consider Univa Grid Engine 
> — but they never responded to my request for trial software.
> 
> How should I proceed? What grid engine can I install that will work on a 
> modern Ubuntu distribution? What tricks do I need to know to get it to work. 
> Can someone point me to something to get me started?

I would assume that most likely the `arch` script inside SGE isn't prepared for 
your actual kernel, i.e. a case for 4.* kernels is missing. What does:

$ $SGE_ROOT/util/arch

return?

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-18 Thread Reuti



> Am 18.01.2019 um 03:57 schrieb Derek Stephenson 
> :
> 
> Hello,
> 
> I should preface this with I've just recently started getting my head around 
> grid engine and as such may not have all the information I should for 
> administering the grid but someone's has to do it. Anyways...
> 
> Our company across an issue recently where a one of the nodes seems to become 
> very delayed in its response to grid submissions.  Whether it be a qsub, qrsh 
> or qlogin submission jobs can take anywhere from 30s to 4-5min to 
> successfully submit. In particular, while users may complain a qsub job looks 
> like it has submitted but do nothing, doing a qlogin to the node in question 
> will give the following:

This might at least for `qsub` jobs depend on when it was submitted inside the 
defined scheduling interval. What is the setting of:

$ qconf -ssconf
…
schedule_interval 0:2:0
…
flush_submit_sec  4
flush_finish_sec  4


> Your job 287104 ("QLOGIN") has been submitted
> waiting for interactive job to be scheduled ...timeout (3 s) expired while 
> waiting on socket fd 7

For interactive jobs: any firewall in place, blocking the communication between 
the submission host and the exechost – maybe switched on at a later point in 
time? SGE will use a random port for the communication. After the reboot it 
worked instantly again?

-- Reuti


> Now I've seen  a series of forum articles bring up this message while 
> seaching through back logs but there never seems to be any conclusions in 
> those threads for me to start delving into on our end. 
> 
> Our past attempts to resolve the issue have only succeeded by rebooting the 
> node in question, and not having any real ideas on why is becoming a general 
> frustration.  
> 
> Any initial thoughts/pointers would be greatly appreciated
> 
> Kind Regards,
> 
> Derek
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-11 Thread Reuti


> Am 11.01.2019 um 00:30 schrieb Derrick Lin :
> 
> Hi Reuti
> 
> Thanks for the input. But how does this help on troubleshooting the prolog 
> script?

You asked for the meaning of the "-i" option, and I tried to outline its 
behavior.

-- Reuti


> I will also troubleshooting the prolog script line by line and see which line 
> is causing the problem.
> 
> Cheers,
> Derrick
> 
> On Thu, Jan 10, 2019 at 7:42 PM Reuti  wrote:
> Hi,
> 
> Am 09.01.2019 um 23:35 schrieb Derrick Lin:
> 
> > Hi Reuti,
> > 
> > I have to say I am still not familiar with the "-i" in qsub after reading 
> > the man page, what does it do?
> 
> It will be feed as stdin to the jobscript. Hence:
> 
> $ qsub -i myfile foo.sh
> 
> is like:
> 
> $ foo.sh < myfile
> 
> but in batch.
> 
> -- Reuti
> 
> 
> > There is no useful/interesting output in qmaster message or exec node 
> > message log. The only information I could find is from job's trace file:
> > 
> > [root@zeta-4-12 381.1]# ls
> > config  environment  error  exit_status  pe_hostfile  pid  trace
> > [root@zeta-4-12 381.1]# cat trace
> > 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid = 997
> > 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin
> > 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9
> > 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0
> > 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter not 
> > found in config file
> > 01/10/2019 09:12:07 [997:307578]: calling fork_pty()
> > 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579
> > 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120 seconds
> > 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579
> > 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, 
> > root@/opt/gridengine/default/common/prolog_exec.sh, 0, 1);
> > 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old 
> > pgrp=307579 getlogin()=
> > 01/10/2019 09:12:07 [997:307579]: reading passwd information for user 'root'
> > 01/10/2019 09:12:07 [997:307579]: setting limits
> > 01/10/2019 09:12:07 [997:307579]: setting environment
> > 01/10/2019 09:12:07 [997:307579]: Initializing error file
> > 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user
> > 01/10/2019 09:12:07 [997:307579]: setting additional gid=0
> > 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors
> > 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and 
> > "trace"
> > 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up). 
> > Unregister the FD.
> > 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user "root"
> > 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0
> > 01/10/2019 09:12:07 [0:307579]: 
> > execvlp(/opt/gridengine/default/common/prolog_exec.sh, 
> > "/opt/gridengine/default/common/prolog_exec.sh")
> > ### The process just stuck in the line above
> > 
> > Here is the trace file for a qsub/batch job, apparently the prolog script 
> > got executed and the process proceeded:
> > 
> > [root@zeta-4-12 383.1]# ls
> > addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  
> > pid  trace
> > [root@zeta-4-12 383.1]# cat trace
> > 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid = 997
> > 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9
> > 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0
> > 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter not 
> > found in config file
> > 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330
> > 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120 seconds
> > 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330
> > 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, 
> > root@/opt/gridengine/default/common/prolog_exec.sh, 0, 1);
> > 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old 
> > pgrp=315329 getlogin()=
> > 01/10/2019 09:20:22 [997:315330]: reading passwd information for user 'root'
> > 01/10/2019 09:20:22 [997:315330]: setting limits
> > 01/10/2019 09:20:22 [997:315330]: setting environment
> > 01/10/2019 09:20:22 [997:315330]: Initializing error file
> > 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user
>

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-10 Thread Reuti


> Am 09.01.2019 um 23:39 schrieb Derrick Lin :
> 
> Hi Reuti and Iyad,
> 
> Here is my prolog script, it just does one thing, setting quota on the XFS 
> volume for each job:
> 
> The prolog_exec_xx_xx.log file was generated, so I assumed the first exec 
> command got executed. 
> 
> Since the generated log file is empty, I think nothing was executed after 
> that.
> 
> Cheers
> 
> [root@zeta-4-12 common]# cat prolog_exec.sh
> #!/bin/sh

Are the shells the same, i.e. same version? Maybe you can alos use the full 
path /bin/bash here, as /bin/sh will also switch on some kind of compatibility 
mode to the original sh in case bash in invoked by this name.

-- Reuti

> 
> exec >> /tmp/prolog_exec_"$JOB_ID"_"$SGE_TASK_ID".log
> exec 2>&1
> 
> SGE_TMP_ROOT="/scratch_local"
> 
> pe_num=$(cat $PE_HOSTFILE | grep $HOSTNAME | awk '{print $2}')
> 
> tmp_req_var=$(echo "$tmp_requested" | grep -o -E '[0-9]+')
> tmp_req_unit=$(echo "$tmp_requested" | sed 's/[0-9]*//g')
> 
> if [ -z "$pe_num" ]; then
> quota=$tmp_requested
> else
> quota=$(expr $tmp_req_var \* $pe_num)$tmp_req_unit
> fi
> 
> echo "# [$HOSTNAME PROLOG] - JOB_ID:$JOB_ID 
> TASK_ID:$SGE_TASK_ID #"
> echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'project -s -p $TMP $JOB_ID' 
> $SGE_TMP_ROOT"
> echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'limit -p bhard=$quota 
> $JOB_ID' $SGE_TMP_ROOT"
> 
> xfs_quota_rc=0
> 
> /usr/sbin/xfs_quota -x -c "project -s -p $TMP $JOB_ID" $SGE_TMP_ROOT
> ((xfs_quota_rc+=$?))
> 
> /usr/sbin/xfs_quota -x -c "limit -p bhard=$quota $JOB_ID" $SGE_TMP_ROOT
> ((xfs_quota_rc+=$?))
> 
> if [ $xfs_quota_rc -eq 0 ]; then
> exit 0
> else
> exit 100 # Put job in error state
> fi
> 
> 
> On Wed, Jan 9, 2019 at 7:36 PM Reuti  wrote:
> Hi,
> 
> > Am 09.01.2019 um 01:14 schrieb Derrick Lin :
> > 
> > Hi guys,
> > 
> > I just brought up a new SGE cluster, but somehow the qrsh session does not 
> > work:
> > 
> > tester@login-gpu:~$ qrsh
> > ^Cerror: error while waiting for builtin IJS connection: "got select 
> > timeout"
> > 
> > after I hit entered, the session just stuck there forever instead of bring 
> > me to a compute node. I have to entered Crtl+c to terminate and it gave the 
> > above error.
> > 
> > I noticed, the SGE did send my qrsh request to a compute node as I could 
> > tell from qstat:
> > 
> > -
> > short.q@zeta-4-15.localBIP   0/1/80 0.01 lx-amd64
> >  15 0.55500 QRLOGINtester   r01/09/2019 10:47:13 1
> > 
> > We have a prolog script configured globally, the script deals with local 
> > disk quota and keep all output to a log file for each job. So I went to 
> > that compute node, and check, found that a log file was created but it was 
> > empty. 
> > 
> > So my thinking so far is, my qrsh stuck because the prolog script is not 
> > fully executed.
> 
> Is there any statement in the prolog, which could wait for stdin – and in a 
> batch job there is just no stdin, hence it continues? Could be tested with 
> "-i" to a batch job.
> 
> -- Reuti
> 
> 
> > qsub job are working fine.
> > 
> > Any idea will be appreciated 
> > 
> > Cheers,
> > Derrick
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-10 Thread Reuti

Hi,

Am 09.01.2019 um 23:35 schrieb Derrick Lin:

> Hi Reuti,
> 
> I have to say I am still not familiar with the "-i" in qsub after reading the 
> man page, what does it do?

It will be feed as stdin to the jobscript. Hence:

$ qsub -i myfile foo.sh

is like:

$ foo.sh < myfile

but in batch.

-- Reuti


> There is no useful/interesting output in qmaster message or exec node message 
> log. The only information I could find is from job's trace file:
> 
> [root@zeta-4-12 381.1]# ls
> config  environment  error  exit_status  pe_hostfile  pid  trace
> [root@zeta-4-12 381.1]# cat trace
> 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid = 997
> 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin
> 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9
> 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0
> 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter not 
> found in config file
> 01/10/2019 09:12:07 [997:307578]: calling fork_pty()
> 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579
> 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120 seconds
> 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579
> 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, 
> root@/opt/gridengine/default/common/prolog_exec.sh, 0, 1);
> 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old 
> pgrp=307579 getlogin()=
> 01/10/2019 09:12:07 [997:307579]: reading passwd information for user 'root'
> 01/10/2019 09:12:07 [997:307579]: setting limits
> 01/10/2019 09:12:07 [997:307579]: setting environment
> 01/10/2019 09:12:07 [997:307579]: Initializing error file
> 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user
> 01/10/2019 09:12:07 [997:307579]: setting additional gid=0
> 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors
> 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and "trace"
> 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up). Unregister 
> the FD.
> 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user "root"
> 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0
> 01/10/2019 09:12:07 [0:307579]: 
> execvlp(/opt/gridengine/default/common/prolog_exec.sh, 
> "/opt/gridengine/default/common/prolog_exec.sh")
> ### The process just stuck in the line above
> 
> Here is the trace file for a qsub/batch job, apparently the prolog script got 
> executed and the process proceeded:
> 
> [root@zeta-4-12 383.1]# ls
> addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid  
> trace
> [root@zeta-4-12 383.1]# cat trace
> 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid = 997
> 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9
> 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0
> 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter not 
> found in config file
> 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330
> 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120 seconds
> 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330
> 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, 
> root@/opt/gridengine/default/common/prolog_exec.sh, 0, 1);
> 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old 
> pgrp=315329 getlogin()=
> 01/10/2019 09:20:22 [997:315330]: reading passwd information for user 'root'
> 01/10/2019 09:20:22 [997:315330]: setting limits
> 01/10/2019 09:20:22 [997:315330]: setting environment
> 01/10/2019 09:20:22 [997:315330]: Initializing error file
> 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user
> 01/10/2019 09:20:22 [997:315330]: setting additional gid=0
> 01/10/2019 09:20:22 [6782:315330]: closing all filedescriptors
> 01/10/2019 09:20:22 [6782:315330]: further messages are in "error" and "trace"
> 01/10/2019 09:20:22 [6782:315330]: using "/bin/bash" as shell of user "root"
> 01/10/2019 09:20:22 [6782:315330]: using stdout as stderr
> 01/10/2019 09:20:22 [0:315330]: now running with uid=0, euid=0
> 01/10/2019 09:20:22 [0:315330]: 
> execvlp(/opt/gridengine/default/common/prolog_exec.sh, 
> "/opt/gridengine/default/common/prolog_exec.sh")
> 01/10/2019 09:20:22 [997:315329]: wait3 returned 315330 (status: 0; 
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 01/10/2019 09:20:22 [997:315329]: prolog exited with exit status 0
> 01/10/2019 09:20:22 [997:315329]: reaped "prolog" with pid 315330
&g

Re: [gridengine users] qrsh session failed to execute prolog script?

2019-01-09 Thread Reuti

Hi,

> Am 09.01.2019 um 01:14 schrieb Derrick Lin :
> 
> Hi guys,
> 
> I just brought up a new SGE cluster, but somehow the qrsh session does not 
> work:
> 
> tester@login-gpu:~$ qrsh
> ^Cerror: error while waiting for builtin IJS connection: "got select timeout"
> 
> after I hit entered, the session just stuck there forever instead of bring me 
> to a compute node. I have to entered Crtl+c to terminate and it gave the 
> above error.
> 
> I noticed, the SGE did send my qrsh request to a compute node as I could tell 
> from qstat:
> 
> -
> short.q@zeta-4-15.localBIP   0/1/80 0.01 lx-amd64
>  15 0.55500 QRLOGINtester   r01/09/2019 10:47:13 1
> 
> We have a prolog script configured globally, the script deals with local disk 
> quota and keep all output to a log file for each job. So I went to that 
> compute node, and check, found that a log file was created but it was empty. 
> 
> So my thinking so far is, my qrsh stuck because the prolog script is not 
> fully executed.

Is there any statement in the prolog, which could wait for stdin – and in a 
batch job there is just no stdin, hence it continues? Could be tested with "-i" 
to a batch job.

-- Reuti


> qsub job are working fine.
> 
> Any idea will be appreciated 
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] batch array jobs are executed on interactive queue

2019-01-08 Thread Reuti

Hi,

> Am 08.01.2019 um 20:54 schrieb Kandalaft, Iyad (AAFC/AAC) 
> :
> 
> Hi all,
>  
> A problem popped up on a Rocks 7 HPC deployment where batch array jobs are 
> being executed on our interactive queue (interactive.q) as well as our batch 
> queue (all.q).
> This is odd behaviour since our the configuration for the interactive q is 
> set to “qtype INTERACTIVE” and the batch qeueue is “qtype 
> BATCH”.  Generally, a qlogin sessions only gets assigned to an 
> interactive.q slots and qsub jobs get assigned to all.q.  Where should I 
> start looking for information on this?

qtype INTERACTIVE

is more the behavior of "immediate". Hence:

`qsub -now y …` will go to the interactive.q

`qlogin -now n …` will go to the batch.q

Also the assigned parallel environment will allow a batch job to run in an 
interactive.q; maybe you cen remove the PE smp there, unless you want to use it 
interactively too.

-- Reuti


> $ qconf -sp smp
> pe_namesmp
> slots  999
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$pe_slots
> control_slaves TRUE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary TRUE
> qsort_args NONE
>  
> $ qconf -sq all.q
> qname all.q
> hostlist  @mnbat @lnbat
> seq_no0
> load_thresholds   np_load_avg=1
> suspend_thresholdsNONE
> nsuspend  1
> suspend_interval  00:05:00
> priority  0
> min_cpu_interval  00:05:00
> processorsUNDEFINED
> qtype BATCH
> ckpt_list NONE
> pe_list   make smp mpi orte
> rerun TRUE
> slots 0,[@mnbat=80],[@lnbat=128]
> tmpdir/scratch
> shell /bin/bash
> prologNONE
> epilogNONE
> shell_start_mode  posix_compliant
> starter_methodNONE
> suspend_methodNONE
> resume_method NONE
> terminate_method  NONE
> notify00:00:60
> owner_listNONE
> user_listsNONE
> xuser_lists   NONE
> subordinate_list  NONE
> complex_valuesNONE
> projects  NONE
> xprojects NONE
> calendar  NONE
> initial_state default
> s_rt  INFINITY
> h_rt  5256800
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmemINFINITY
> h_vmemINFINITY
>  
> $ qconf -sq interactive.q
> qname interactive.q
> hostlist  @mnint @lnint
> seq_no0
> load_thresholds   np_load_avg=1
> suspend_thresholdsNONE
> nsuspend  1
> suspend_interval  00:05:00
> priority  0
> min_cpu_interval  00:05:00
> processorsUNDEFINED
> qtype INTERACTIVE
> ckpt_list NONE
> pe_list   make smp
> rerun FALSE
> slots 0,[@mnint=80],[@lnint=128]
> tmpdir/scratch
> shell /bin/bash
> prologNONE
> epilogNONE
> shell_start_mode  posix_compliant
> starter_methodNONE
> suspend_methodNONE
> resume_method NONE
> terminate_method  NONE
> notify00:00:60
> owner_listNONE
> user_listsNONE
> xuser_lists   NONE
> subordinate_list  NONE
> complex_valuesNONE
> projects  NONE
> xprojects NONE
> calendar  NONE
> initial_state default
> s_rt  INFINITY
> h_rt  604800
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmemINFINITY
> h_vmemINFINITY
>  
> Thank you for your assistance,
>  
> Iyad K
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fwd: Request: JOB_ID, QUEUE, etc. variables in a QLOGIN session

2018-12-12 Thread Reuti

Hi,

> Am 12.12.2018 um 17:50 schrieb Gowtham :
> 
> Greetings.
> 
> I am wondering if there's a way to access JOB_ID, QUEUE and other such SGE 
> variables from within a QLOGIN session. For example, 
>   • I type 'qlogin' and gain access to one of the compute nodes.
>   • Running 'qstat -u ${USER}' lists this QLOGIN session with a 'job-ID'
>   • The command, echo ${JOB_ID}, returns blank from within that QLOGIN 
> session instead of showing the number displayed in #2. 
> Please let me know if there's a way to achieve this.

The shell you get performed a fresh startup and does not know anything about 
the formerly set environment variables by the sgeexecd.

I have this snippet below, please put it in your ~/.bash_profile. resp. 
~/.profile; whichever you prefer and use. The number of MYPARENT assignments 
depends on the method you started the session: rsh, ssh or built-in. IIRC the 
last `if [ -n "$MYJOBID" ];` section had only the purpose to display a message, 
which was set with "-ac" during submission and might not be necessary here.

-- Reuti

MYPARENT=`ps -p $$ -o ppid --no-header`
#MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
#MYPARENT=`ps -p $MYPARENT -o ppid --no-header`
MYSTARTUP=`ps -p $MYPARENT -o command --no-header`

if [ "${MYSTARTUP:0:13}" = "sge_shepherd-" ]; then
   echo "Running inside SGE" 
   MYJOBID=${MYSTARTUP:13}
   MYJOBID=${MYJOBID% -bg}
   echo "Job $MYJOBID"

   while read LINE; do export $LINE; done < 
/var/spool/sge/${HOSTNAME%%.*}/active_jobs/$MYJOBID.1/environment
   unset HISTFILE

   if [ -n "$MYJOBID" ]; then
  . /usr/sge/default/common/settings.sh
   qstat -j $MYJOBID | sed -n -e "/^context/s/^context: *//p" | tr "," "\n" 
| sed -n -e "s/^MESSAGE=//p"
   fi
fi

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-08 Thread Reuti



Am 07.12.2018 um 21:33 schrieb Derrick Lin:

> Reuti,
> 
> My further tests confirm that $TMP is set inside PROLOG, $TMPDIR is not.
> 
> Both $TMPDIR and $TMP are set in job's environment.
> 
> So technically my problem is solved by switching to $TMP.
> 
> But I am still wondering if this is an issue for SGE 8.1.9

Maybe it was changed in this procedure:

http://gridengine.org/pipermail/users/2013-April/005981.html

-- Reuti


> 
> Cheers,
> Derrick
> 
> On Sat, Dec 8, 2018 at 3:49 AM Reuti  wrote:
> 
> > Am 06.12.2018 um 23:52 schrieb Derrick Lin :
> > 
> > Hi all,
> > 
> > We are switching to a cluster of CentOS7 with SGE 8.1.9 installed.
> > 
> > We have a prolog script that does XFS disk space allocation according to 
> > TMPDIR.
> > 
> > However, the prolog script does not receive TMPDIR which should be created 
> > by the scheduler.
> 
> Is $TMP set?
> 
> -- Reuti
> 
> 
> > 
> > Other variables such as JOB_ID, PE_HOSTFILE are available though.
> > 
> > We have been using the same script on the CentOS6 cluster with OGS/GE 
> > 2011.11p1 without an issue.
> > 
> > Thanks in advance.
> > 
> > Cheers,
> > Derrick
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] TMPDIR is missing from prolog script (CentOS 7 SGE 8.1.9)

2018-12-07 Thread Reuti



> Am 06.12.2018 um 23:52 schrieb Derrick Lin :
> 
> Hi all,
> 
> We are switching to a cluster of CentOS7 with SGE 8.1.9 installed.
> 
> We have a prolog script that does XFS disk space allocation according to 
> TMPDIR.
> 
> However, the prolog script does not receive TMPDIR which should be created by 
> the scheduler.

Is $TMP set?

-- Reuti


> 
> Other variables such as JOB_ID, PE_HOSTFILE are available though.
> 
> We have been using the same script on the CentOS6 cluster with OGS/GE 
> 2011.11p1 without an issue.
> 
> Thanks in advance.
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] $TMPDIR With MPI Jobs

2018-12-06 Thread Reuti

I found my entry about this:

https://arc.liv.ac.uk/trac/SGE/ticket/570

-- Reuti


> Am 06.12.2018 um 19:03 schrieb Reuti :
> 
> Hi,
> 
>> Am 06.12.2018 um 18:36 schrieb Dan Whitehouse :
>> 
>> Hi,
>> I've been running some MPI jobs and I expected that when the job started
>> a $TMPDIR would be created on all of the nodes, however with our (UGE)
>> configuration that does not appear to be the case.
>> 
>> It appears that while on the "master" node a $TMPDIR is created and
>> persists for the duration of the job, for "slave" execution hosts, the
>> directory is only created when MPI processes run and is immediately
>> reaped when they exit. Is there a way to change this behaviour such that
>> the directory persists for the entire duration of the job?
> 
> Your observations are correct. I saw a need for it some time ago: 
> https://arc.liv.ac.uk/trac/SGE/ticket/1290
> 
> One can create persistent scratch directories e.g. in a job prolog (just make 
> the list of nodes unique and issue `qrsh -inherit ...` for each nodes `mkdir 
> $TMPDIR-persistent` Curley braces are optional here, as the dash can't be a 
> character in an environment variable).
> 
> There is one pitfall: in case of a job abort one can't issue `qrsh -inherit 
> ...` in the epilog any longer to remove all the directories on the nodes in 
> turn – the job was already canceld. My solution was to submit a "cleaner.sh" 
> in the prolog too – one for each node (hence they run serial) and get the 
> name of the directory they should remove as argument after the script name 
> (this is known in the prolog). The job were supposed to run in a dedicated 
> cleaner.q only with no limits regarding slots (hence they started as soon as 
> they were eligible tun start), but got a job hold on the actual job which 
> submitted them to wait until it finished.
> 
> -- Reuti
> 
> 
>> 
>> --
>> Dan Whitehouse
>> Research Systems Administrator, IT Services
>> Queen Mary University of London
>> Mile End
>> E1 4NS
>> 
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] $TMPDIR With MPI Jobs

2018-12-06 Thread Reuti

Hi,

> Am 06.12.2018 um 18:36 schrieb Dan Whitehouse :
> 
> Hi,
> I've been running some MPI jobs and I expected that when the job started 
> a $TMPDIR would be created on all of the nodes, however with our (UGE) 
> configuration that does not appear to be the case.
> 
> It appears that while on the "master" node a $TMPDIR is created and 
> persists for the duration of the job, for "slave" execution hosts, the 
> directory is only created when MPI processes run and is immediately 
> reaped when they exit. Is there a way to change this behaviour such that 
> the directory persists for the entire duration of the job?

Your observations are correct. I saw a need for it some time ago: 
https://arc.liv.ac.uk/trac/SGE/ticket/1290

One can create persistent scratch directories e.g. in a job prolog (just make 
the list of nodes unique and issue `qrsh -inherit ...` for each nodes `mkdir 
$TMPDIR-persistent` Curley braces are optional here, as the dash can't be a 
character in an environment variable).

There is one pitfall: in case of a job abort one can't issue `qrsh -inherit 
...` in the epilog any longer to remove all the directories on the nodes in 
turn – the job was already canceld. My solution was to submit a "cleaner.sh" in 
the prolog too – one for each node (hence they run serial) and get the name of 
the directory they should remove as argument after the script name (this is 
known in the prolog). The job were supposed to run in a dedicated cleaner.q 
only with no limits regarding slots (hence they started as soon as they were 
eligible tun start), but got a job hold on the actual job which submitted them 
to wait until it finished.

-- Reuti


> 
> -- 
> Dan Whitehouse
> Research Systems Administrator, IT Services
> Queen Mary University of London
> Mile End
> E1 4NS
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti


> Am 06.12.2018 um 16:59 schrieb Dimar Jaime González Soto 
> :
> 
> qconf -sconf shows:
> 
> #global:
> execd_spool_dir  /var/spool/gridengine/execd
> ...
> ax_aj_tasks 75000

So, this is fine too. Next place: is the amount of overall slots limited:

$ qconf -se global

especially the line "complex_values".

And next: any RQS?

$ qconf -srqs

-- Reuti


> El jue., 6 dic. 2018 a las 12:55, Reuti () 
> escribió:
> 
> > Am 06.12.2018 um 15:19 schrieb Dimar Jaime González Soto 
> > :
> > 
> > qconf -se ubuntu-node2 :
> >  
> > hostname  ubuntu-node2
> > load_scaling  NONE
> > complex_valuesNONE
> > load_values   arch=lx26-amd64,num_proc=16,mem_total=48201.960938M, \
> >   
> > swap_total=95746.996094M,virtual_total=143948.957031M, \
> >   load_avg=3.74,load_short=4.00, \
> >   load_medium=3.74,load_long=2.36, \
> >   mem_free=47376.683594M,swap_free=95746.996094M, \
> 
> Although it's unrelated to the main issue: the swap size can be limited to 2 
> GB nowadays (which is the default in openSUSE). RedHat suggests a little bit 
> more, e.g. here:
> 
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-swapspace
> 
> 
> 
> >   virtual_free=143123.679688M,mem_used=825.277344M, \
> >   swap_used=0.00M,virtual_used=825.277344M, \
> >   cpu=25.00,m_topology=NONE,m_topology_inuse=NONE, \
> >   m_socket=0,m_core=0,np_load_avg=0.233750, \
> >   np_load_short=0.25,np_load_medium=0.233750, \
> >   np_load_long=0.147500
> > processors16
> > user_listsNONE
> > xuser_lists   NONE
> > projects  NONE
> > xprojects NONE
> > usage_scaling NONE
> > report_variables  NONE
> > 
> > El jue., 6 dic. 2018 a las 11:17, Dimar Jaime González Soto 
> > () escribió:
> > qhost :
> > 
> > HOSTNAMEARCH NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> > SWAPUS
> > ---
> > global  -   - -   -   -   - 
> >   -
> > ubuntu-frontend lx26-amd64 16  4.13   31.4G1.2G 0.0 
> > 0.0
> > ubuntu-node11   lx26-amd64 16  4.55   47.1G  397.5M   93.5G 
> > 0.0
> > ubuntu-node12   lx26-amd64 16  3.64   47.1G1.0G   93.5G 
> > 0.0
> > ubuntu-node13   lx26-amd64 16  4.54   47.1G  399.9M   93.5G 
> > 0.0
> > ubuntu-node2lx26-amd64 16  3.67   47.1G  818.5M   93.5G 
> > 0.0
> 
> This looks fine. So we have other settings to investigate:
> 
> $ qconf -sconf
> #global:
> execd_spool_dir  /var/spool/sge
> ...
> max_aj_tasks 75000
> 
> Is max_aj_tasks  limited in your setup?
> 
> 
> 
> -- Reuti
> 
> 
> > 
> > El jue., 6 dic. 2018 a las 11:13, Reuti () 
> > escribió:
> > 
> > > Am 06.12.2018 um 15:07 schrieb Dimar Jaime González Soto 
> > > :
> > > 
> > >  qalter -w p doesn't shows anything, qstat shows 16 processes and not 60:
> > > 
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node21 1
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node12   1 2
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node13   1 3
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node11   1 4
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node11   1 5
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node13   1 6
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node12   1 7
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node21 8
> > > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > > main.q@ubuntu-node2

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti


> Am 06.12.2018 um 15:19 schrieb Dimar Jaime González Soto 
> :
> 
> qconf -se ubuntu-node2 :
>  
> hostname  ubuntu-node2
> load_scaling  NONE
> complex_valuesNONE
> load_values   arch=lx26-amd64,num_proc=16,mem_total=48201.960938M, \
>   swap_total=95746.996094M,virtual_total=143948.957031M, \
>   load_avg=3.74,load_short=4.00, \
>   load_medium=3.74,load_long=2.36, \
>   mem_free=47376.683594M,swap_free=95746.996094M, \

Although it's unrelated to the main issue: the swap size can be limited to 2 GB 
nowadays (which is the default in openSUSE). RedHat suggests a little bit more, 
e.g. here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-swapspace



>   virtual_free=143123.679688M,mem_used=825.277344M, \
>   swap_used=0.00M,virtual_used=825.277344M, \
>   cpu=25.00,m_topology=NONE,m_topology_inuse=NONE, \
>   m_socket=0,m_core=0,np_load_avg=0.233750, \
>   np_load_short=0.25,np_load_medium=0.233750, \
>   np_load_long=0.147500
> processors16
> user_listsNONE
> xuser_lists   NONE
> projects  NONE
> xprojects NONE
> usage_scaling NONE
> report_variables  NONE
> 
> El jue., 6 dic. 2018 a las 11:17, Dimar Jaime González Soto 
> () escribió:
> qhost :
> 
> HOSTNAMEARCH NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> SWAPUS
> ---
> global  -   - -   -   -   -   
> -
> ubuntu-frontend lx26-amd64 16  4.13   31.4G1.2G 0.0 
> 0.0
> ubuntu-node11   lx26-amd64 16  4.55   47.1G  397.5M   93.5G 
> 0.0
> ubuntu-node12   lx26-amd64 16  3.64   47.1G1.0G   93.5G 
> 0.0
> ubuntu-node13   lx26-amd64 16  4.54   47.1G  399.9M   93.5G 
> 0.0
> ubuntu-node2lx26-amd64 16  3.67   47.1G  818.5M   93.5G 
> 0.0

This looks fine. So we have other settings to investigate:

$ qconf -sconf
#global:
execd_spool_dir  /var/spool/sge
...
max_aj_tasks 75000

Is max_aj_tasks  limited in your setup?



-- Reuti


> 
> El jue., 6 dic. 2018 a las 11:13, Reuti () 
> escribió:
> 
> > Am 06.12.2018 um 15:07 schrieb Dimar Jaime González Soto 
> > :
> > 
> >  qalter -w p doesn't shows anything, qstat shows 16 processes and not 60:
> > 
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node21 1
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node12   1 2
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node13   1 3
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node11   1 4
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node11   1 5
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node13   1 6
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node12   1 7
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node21 8
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node21 9
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node12   1 10
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node13   1 11
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node11   1 12
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node11   1 13
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node13   1 14
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node12   1 15
> > 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> > main.q@ubuntu-node21 16
> > 250 0.5 OMAcbuach   qw12/06/2018 11:04:02   
>

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti


> Am 06.12.2018 um 15:07 schrieb Dimar Jaime González Soto 
> :
> 
>  qalter -w p doesn't shows anything, qstat shows 16 processes and not 60:
> 
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node21 1
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node12   1 2
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node13   1 3
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node11   1 4
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node11   1 5
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node13   1 6
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node12   1 7
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node21 8
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node21 9
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node12   1 10
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node13   1 11
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node11   1 12
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node11   1 13
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node13   1 14
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node12   1 15
> 250 0.5 OMAcbuach   r 12/06/2018 11:04:15 
> main.q@ubuntu-node21 16
> 250 0.5 OMAcbuach   qw12/06/2018 11:04:02 
>1 17-60:1

Aha, so they are running already on remote nodes – fine. As the setting in the 
queue configuration is per host, this should work and provide more processes 
per node instead of four.

Is there a setting for the exechosts:

qconf -se ubuntu-node2

limiting the slots to 4 in complex_values? Can you please also provide the 
`qhost` output.

-- Reuti



> 
> El jue., 6 dic. 2018 a las 10:59, Reuti () 
> escribió:
> 
> > Am 06.12.2018 um 09:47 schrieb Hay, William :
> >
> > On Wed, Dec 05, 2018 at 03:29:23PM -0300, Dimar Jaime Gonz??lez Soto wrote:
> >>   the app site is https://omabrowser.org/standalone/ I tried to make a
> >>   parallel environment but it didn't work.
> > The website indicates that an array job should work for this.
> > Has the load average spiked to the point where np_load_avg>=1.75?
> 
> Yes, I noticed this too. Hence we need no parallel environement at all, as 
> OMA will just start several serial jobs as long as slots are available AFAICS.
> 
> What does `qstat` show for a running job. There should be a line for each 
> executing task while the waiting once are abbreviated in one line.
> 
> -- Reuti
> 
> 
> >
> > I would try running qalter -w p  against the job id to see what it says.
> >
> > William
> >
> >
> >
> >>
> >>> Am 05.12.2018 um 19:10 schrieb Dimar Jaime Gonzalez Soto
> >> :
> >>>
> >>> Hi everyone I'm trying to run OMA standalone on a grid engine setup
> >> with this line:
> >>>
> >>> qsub -v NR_PROCESSES=60 -b y -j y -t 1-60 -cwd /usr/local/OMA/bin/OMA
> >>>
> >>> it works but only execute 4 processes  per node, there are 4 nodes
> >> with 16 logical threads.  My main.q configuration is:
> >>>
> >>> qname main.q
> >>> hostlist  @allhosts
> >>> seq_no0
> >>> load_thresholds   np_load_avg=1.75
> >>> suspend_thresholdsNONE
> >>> nsuspend  1
> >>> suspend_interval  00:05:00
> >>> priority  0
> >>> min_cpu_interval  00:05:00
> >>> processorsUNDIFINED
> >>> qtype BATCH INTERACTIVE
> >>> ckpt_list NONE
> >>> pe_list   make
> >>> rerun FALSE
> >>> slots 16
> 
> 
> 
> --
> Atte.
> 
> Dimar González Soto
> Ingeniero Civil en Informática
> Universidad Austral de Chile
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] problem with concurrent jobs

2018-12-06 Thread Reuti



> Am 06.12.2018 um 09:47 schrieb Hay, William :
> 
> On Wed, Dec 05, 2018 at 03:29:23PM -0300, Dimar Jaime Gonz??lez Soto wrote:
>>   the app site is https://omabrowser.org/standalone/ I tried to make a
>>   parallel environment but it didn't work.
> The website indicates that an array job should work for this.
> Has the load average spiked to the point where np_load_avg>=1.75?

Yes, I noticed this too. Hence we need no parallel environement at all, as OMA 
will just start several serial jobs as long as slots are available AFAICS.

What does `qstat` show for a running job. There should be a line for each 
executing task while the waiting once are abbreviated in one line.

-- Reuti


> 
> I would try running qalter -w p  against the job id to see what it says.
> 
> William
> 
> 
> 
>> 
>>> Am 05.12.2018 um 19:10 schrieb Dimar Jaime Gonzalez Soto
>> :
>>> 
>>> Hi everyone I'm trying to run OMA standalone on a grid engine setup
>> with this line:
>>> 
>>> qsub -v NR_PROCESSES=60 -b y -j y -t 1-60 -cwd /usr/local/OMA/bin/OMA
>>> 
>>> it works but only execute 4 processes  per node, there are 4 nodes
>> with 16 logical threads.  My main.q configuration is:
>>> 
>>> qname main.q
>>> hostlist  @allhosts
>>> seq_no0
>>> load_thresholds   np_load_avg=1.75
>>> suspend_thresholdsNONE
>>> nsuspend  1
>>> suspend_interval  00:05:00
>>> priority  0
>>> min_cpu_interval  00:05:00
>>> processorsUNDIFINED
>>> qtype BATCH INTERACTIVE
>>> ckpt_list NONE
>>> pe_list   make
>>> rerun FALSE
>>> slots 16


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] problem with concurrent jobs

2018-12-05 Thread Reuti

Hi,

> Am 05.12.2018 um 19:10 schrieb Dimar Jaime González Soto 
> :
> 
> Hi everyone I'm trying to run OMA standalone on a grid engine setup with this 
> line:

I assume OMA is the name of your application. Do you have any link to its 
website?

> 
> qsub -v NR_PROCESSES=60 -b y -j y -t 1-60 -cwd /usr/local/OMA/bin/OMA
> 
> it works but only execute 4 processes  per node, there are 4 nodes with 16 
> logical threads.  My main.q configuration is:
> 
> qname main.q
> hostlist  @allhosts
> seq_no0
> load_thresholds   np_load_avg=1.75
> suspend_thresholdsNONE
> nsuspend  1
> suspend_interval  00:05:00
> priority  0
> min_cpu_interval  00:05:00
> processorsUNDIFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list   make
> rerun FALSE
> slots 16
> tmpdir/tmp
> shell /bin/csh
> prologNONE
> epilogNONE
> shell_start_mode  posix_compliant
> starter_methodNONE
> suspend_methodNONE
> resume_method NONE
> terminate_method  NONE
> notify00:00:60
> owner_listNONE
> user_listsNONE
> xuser_lists   NONE
> subordinate_list  NONE
> complex_valuesNONE
> projects  NONE
> xprojects NONE
> calendar  NONE
> initial_state default
> s_rt  INFINITY
> h_rt  INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> 
> I want to run 60 processes at the same time, any advice?

Did you define any Parallel Environment for the application? There are none 
specified in your submission command or the queue configuration.

Does the application support to run across nodes by MPI or other means for the 
communication? The PE would deliver a hostlist to the application, which then 
can be used to start processes on other nodes too. Some MPI libraries even 
discover 

BTW: as you wrote "16 logical threads": often its advisable for HPC to disable 
Hyperthreading and use only the physically available cores.

-- Reuti


> -- 
> Atte.
> 
> Dimar González Soto
> Ingeniero Civil en Informática
> Universidad Austral de Chile
>  
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Issue with permissions on new servers added to cluster

2018-12-03 Thread Reuti

Hi,

> Am 03.12.2018 um 15:19 schrieb srinivas.chakrava...@wipro.com:
> 
> Hi,
> 
> We are receiving a strange permissions issue while submitting jobs to new 
> hosts added to our clusters. While submitting jobs with normal permissions to 
> user directories the jobs invariably go into error state. 
> 
> While checking the logs, we find information as below: 
> 12/03/2018 14:16:55|worker||W|job 210165.1 failed on host  
> general opening input/output file because: 12/03/2018 15:16:54 [899:26827]: 
> error: can't open output file "/test.sh.o210165": Permission denied
> 12/03/2018 14:16:55|worker||W|rescheduling job 210165.1
> 
> The strange thing is, while we provide full permissions (777) to a directory 
> and run under it, the job runs fine, but output and error files are created 
> on behalf of "sgeadmin" user with 744 permissions.

Were the sge_execd on these new machines started by root or sgeadmin?

$ ps -e f -o user,ruser,command | grep sge
sgeadmin root /usr/sge/bin/lx24-em64t/sge_execd
root root  \_ /bin/sh /usr/sge/cluster/tmpspace.sh

-- Reuti


>  
> 
> The user directories, job directories and SGE_ROOT folder are all NFS volumes 
> mounted on all hosts similarly. There is no issue on hosts that are already 
> present in the cluster and jobs run fine on them. 
> 
> Can anyone please suggest what might be wrong here?
> 
> Thanks and regards,
> Srinivas.
> 
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email. www.wipro.com 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Processes not exiting

2018-11-14 Thread Reuti

Hi,

> Am 14.11.2018 um 01:06 schrieb ad...@genome.arizona.edu:
> 
> We have a cluster with gridengine 6.5u2 and noticing a strange behavior when 
> running MPI jobs.  Our application will finish, yet the processes continue to 
> run and use up the CPU.  We did configure a parallel environment for MPI as 
> follows:
> 
> pe_namempi
> slots  500
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary FALSE
> 
> Then we have run our application "Maker" like this,
> qsub -cwd -N  -b y -V -pe mpi  /opt/mpich-install/bin/mpiexec  
> maker 

Which version of MPICH are you using? Maybe it's not tightly integrated.

-- Reuti


> It seems to run fine and qstat will show it running.  Once it has completed, 
> qstat is empty again and we have the desired output. However, the "maker" 
> process have continued to run on the compute nodes until I login to each node 
> and "kill -9" the processes.  We did not have this problem when running 
> mpiexec directly with Maker, or running Maker in stand-alone mode (without 
> MPI), so I guess it is a problem with our qsub command or parallel 
> environment?  Any Ideas?
> 
> Thanks,
> -- 
> Chandler / Systems Administrator
> Arizona Genomics Institute
> www.genome.arizona.edu
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Email warning for s_rt ?

2018-10-23 Thread Reuti

Hi,

> Am 23.10.2018 um 20:31 schrieb Dj Merrill :
> 
> Hi Reuti,
>   Thank you for your response.  I didn't describe our environment very
> well, and I apologize.  We only have one queue.  We've had a few
> instances of people forgetting they ran a job that doesn't apparently
> have any stopping conditions, and am trying to come up with a way to
> gently remind folks when they've left something running.
> 
>   Current thoughts are to have the "sge_request" file contain:
> -soft -l s_rt=720:0:0
> 
>   We can tell them to use qalter to extend the time if they want, or they
> can contact us to do it.

This won't work in SGE. The limits are set when the job starts. The only way to 
extend a runtime limit is to -softstop the execd on the particular node (with 
the side effect that no more jobs will scheduled thereto until is is 
restarted). And restart the execd once the job granted to run longer than 
exstimated came to an end. 

>   It would be nice if we could somehow parse the current s_rt on a job,
> and 5 days before that time send out an email notification.  If they
> extend it to longer, we'd like it to again send out the notification 5
> days before the new limit.  In other words, something along the lines of
> running a cron script every night that parses the running jobs, gets the
> relevant info, and sends out an email notification if necessary.
> 
>   In fact, we might not even need the s_rt limit set at all and an email
> reminder at set intervals might be enough for our purposes, although
> being able to have it auto terminate the job would save some manual effort.

I would sugguest to store such arbitrary information in a job context like 
"qsub -ac ESTIMATED_RUNTIME=720". Reading your complete description of the set 
up, I get the impression that we are speaking here of jobs running for days or 
weeks. Hence a cronjob on the master node of the cluster could do all once per 
hour of every 10 minutes:

- read the job context and grep for the current set maximum duration
- generate emails when a certain limits is passed, and store the information 
that the email was send already in the job context too*
- a job that passed the limit will be killed

*) This additional context variable "WARNED_FOR=…" could simply get the same 
value as the just passed limit. As long as "ESTIMATED_RUNTIME" equals 
"WARNED_FOR" no additional email is generated. But if the user changes the 
"ESTIMATED_RUNTIME" we can detect this and an email can be send if the adjusted 
"ESTIMATED_RUNTIME" is about to be reached again. It might be easier, to have a 
wrapper around to convert hh:mm:ss to plain seconds or even advice the user to 
specify the limit in minutes or hours only as a general requirement. Hence no 
further conversion is necessary in the script.

I wonder how we can pull all information in one `qstat` call. The context 
variables of the running jobs you get with `qstat -s r -j "*"`, but the actual 
start of the job is output only in a plain `qstat -s r` or `qstat -s r -r`. To 
lower the impact on the qmaster we should not use a loop covering all currently 
running jobs one after the other only.

-- Reuti

> 
>   What I'm asking for might not even be practical, but I thought it worth
> a try to ask.
> 
> Thanks,
> 
> -Dj
> 
> 
> 
> On 10/20/2018 05:02 AM, Reuti wrote:
>> Hi,
>> 
>> Am 19.10.2018 um 22:44 schrieb Dj Merrill:
>> 
>>> Hi all,
>>> Assuming a soft run time limit for a queue, is there a way to send an
>>> email warning when the job is about to hit the limit?
>>> 
>>> For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run
>> 
>> You are aware, that this is a soft-soft limit. Means: I prefer a queue with 
>> a s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.
>> 
>> 
>>> time, is there a way to send an email at the 25 day mark to let the
>>> person know the job will be forced to end in 5 days?
>> 
>> The s_rt will have already the purpose to send a signal (SIGUSR1) before 
>> h_rt is reached. Please have a look at "RESOURCE LIMITS" in `man 
>> queue_conf`. So I wonder, whether the combined usage of s_rt and h_rt (both 
>> with the default -hard option) could already provide what you want to 
>> implement.
>> 
>> Sure, the SIGUSR1 must be caught in the script and masked out in the called 
>> binary to avoid that it's killed by the SIGUSR1 default behavior. I use a 
>> subshell for it:
>> 
>> trap "echo Foo" SIGUSR1
>> (trap - SIGUSR1; my_binary)
>> 
>> as the SIGUSR1 is send to the comple

Re: [gridengine users] Email warning for s_rt ?

2018-10-20 Thread Reuti

Hi,

Am 19.10.2018 um 22:44 schrieb Dj Merrill:

> Hi all,
>   Assuming a soft run time limit for a queue, is there a way to send an
> email warning when the job is about to hit the limit?
> 
>   For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run

You are aware, that this is a soft-soft limit. Means: I prefer a queue with a 
s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.


> time, is there a way to send an email at the 25 day mark to let the
> person know the job will be forced to end in 5 days?

The s_rt will have already the purpose to send a signal (SIGUSR1) before h_rt 
is reached. Please have a look at "RESOURCE LIMITS" in `man queue_conf`. So I 
wonder, whether the combined usage of s_rt and h_rt (both with the default 
-hard option) could already provide what you want to implement.

Sure, the SIGUSR1 must be caught in the script and masked out in the called 
binary to avoid that it's killed by the SIGUSR1 default behavior. I use a 
subshell for it:

trap "echo Foo" SIGUSR1
(trap - SIGUSR1; my_binary)

as the SIGUSR1 is send to the complete process tree of the job. The "echo Foo" 
could be replaced by`mail -s Warning …`.


>   I've thought about trying to draft a script to do this, but thought I'd
> ask first if anyone else has come up with something.

A completely different approach: use a checkpoint interface to send email a 
warning. The interval given to `qsub -c 600:0:0 -ckpt mailer_only …` represents 
the 25 days, and the checkpointing interface "mailer_only" does not do any real 
checkpointing, but has a script defined for "ckpt_command" which sends an email 
(i.e. "interface application-level" must be used).

There is an introduction to use the checkpoint interface here: 
https://arc.liv.ac.uk/SGE/howto/checkpointing.html

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Change the format of sge delivered mail

2018-10-17 Thread Reuti

Am 17.10.2018 um 19:50 schrieb Nelson Kick:

> Is there any way to add variables to the standard “beas” mail?  Would like to 
> add
> additional info to this list… if possible, like SGE_O_PATH or other env 
> variables.

Please find below the complete mail-wrapper script. Be default, we are 
interested in the context variables set by our script generator:

#$ -ac COMMAND=
#$ -ac OUTPUT=
#$ -ac MAIL_ATTACHMENT=
#$ -ac MAIL_RECIPIENTS_LIST=

The content of OUTPUT= and MAIL_ATTACHMENT= are most likely the same, but the 
users wanted a way to get the name of the output file without it being attached 
to the email all the time. Hence these two context variable, which have a 
similar purpose. A flag could have done the same. The mail-wrapper script also 
tries to mangle the suffix of the file from *.log or *.out file to read (only 
in the attachment) *.txt instead. This way even mail applications on a 
smartphone should be able to display the attachment without complaining.

The MAIL_RECIPIENTS_LIST= are alternate or additional receivers of the email. 
If the list starts with a "+", these will be added to the original receiver 
(bug/feature alert: if the list in SGE has several receivers already (added by 
-M in `qsub`), the mail-wrapper script will be called for each of them once – 
hence the additional receivers will get duplicate emails). Without the "+" the 
list will replace the original receiver. The list entries must be delimited by 
";", as the "," will already be used to separate all the fields in SGE's output 
of the context variables.

(The context script I posted already will convert the ";" in ",", as in the 
context file it creates the entries are already on separates lines. The mail 
wrapper in turn will replace the "," with a " ", as this is what our mail 
application expects if several receivers are specified as delimiter.)

One variable is created by the prolog itself with the line:

qconf -ac NODES=…

to get a list of used nodes for this particular job.

Please let me know, if something is unclear. I hope it will work in a general 
case too.

-- Reuti

 mail-wrapper script follows 

#!/bin/sh

#
# Assemble an email and attach an output file.
# Note: SGE will call this routine for each and every recipient specified
# with the -M option. It won't use the feature of the mail application
# to honor a list of recipients. This wrapper will do.
#

export PATH=/usr/local/bin:/bin:/usr/bin
. /usr/sge/default/common/settings.sh

line() {
if [ -f /var/spool/sge/context/$JOB_ID -a -r /var/spool/sge/context/$JOB_ID 
]; then
echo $(sed -n -e "/^${1}=/s///p" /var/spool/sge/context/$JOB_ID)
fi
}
 
entry() {
RESULT=$(line ${1})
if [ -n "${RESULT}" ]; then
echo "${1}: ${RESULT}"
else
echo "${1}: [none recorded]"
fi
}

command_line() {
COMMAND_LINE=$(qstat -j ${JOB_ID} | sed -n -e "/^context/{s/^context: 
*//;s/,/\n/g;s/;/,/g;p}" | sed -n -e "/^COMMAND=/s///p")
echo
if [ -n "${COMMAND_LINE}" ]; then
echo "COMMAND: ${COMMAND_LINE}"
else
echo "COMMAND: [none recorded]"
fi
}

context() {
echo
entry COMMAND
entry OUTPUT
entry NODES
}

assemble_recipients_list() {
if [ -n "${MAIL_RECIPIENTS_LIST}" ]; then
if [ "${MAIL_RECIPIENTS_LIST:0:1}" = "+" ]; then
MAIL_RECIPIENTS_LIST="$1,${MAIL_RECIPIENTS_LIST:1}"
fi
else
MAIL_RECIPIENTS_LIST="${1}"
fi
MAIL_RECIPIENTS_LIST="${MAIL_RECIPIENTS_LIST//,/ }"
}

check_and_prepare_attachment() {

tmpdir="/tmp"

if [ -n "${MAIL_ATTACHMENT}" ]; then
   if [ ! -f "${MAIL_ATTACHMENT}" ]; then
   ATTACHMENT_ERROR="The requested attachment file 
\`${MAIL_ATTACHMENT}' does not exist."
   return 1
   elif [ ! -r "${MAIL_ATTACHMENT}" ]; then
   ATTACHMENT_ERROR="The requested attachment file 
\`${MAIL_ATTACHMENT}' couldn't be read."
   return 1
   fi

   filename=$(basename "${MAIL_ATTACHMENT}")
   suffix=${filename##*.}
   if [ "${suffix}" = "${filename}" ]; then
   suffix=""
   fi
   filename="${filename%.*}"
   if [ $(stat --printf="%s" "${MAIL_ATTACHMENT}") -gt 1048576 ]; then
   ATTACHMENT_WARNING1="The requested attachment file 
\`${MAIL_ATTACHMENT}' has a size > 1 MiB. The attachment in this email was 
truncated and contains the last MiB only."
   filename="${filename}_last_MiB"
   mangle_attachment="1"
   fi

   if [ -z "${suffix}" -o "${suffi

Re: [gridengine users] Change the format of sge delivered mail

2018-10-17 Thread Reuti

Hi,

> Am 17.10.2018 um 19:50 schrieb Nelson Kick :
> 
> Is there any way to add variables to the standard “beas” mail?  Would like to 
> add
> additional info to this list… if possible, like SGE_O_PATH or other env 
> variables.
>  
> Job 114054 (STDIN) Complete
> User = user
> Queue= queue
> Host = node01
> Start Time   = 10/17/2018 10:46:39
> End Time = 10/17/2018 10:48:19
> User Time= 00:00:00
> System Time  = 00:00:00
> Wallclock Time   = 00:01:40
> CPU  = 00:00:00
> Max vmem = 237.992M
> Exit Status  = 0

Yes, it's possible. But there are several steps involved. First one has to 
realize, that these emails are send from the exechost, after the job left the 
exechost already. Hence nothing is left there to peek at. For me the most 
important items were the context variables, but feel free to add additonal ones 
(hence they will appear as being context variables simply when you add 
"foobar=baz" to the generated file). So, the steps are:

a) create a directory /var/spool/sge/context where the sgeadmin can write at. 
For me this locations was the logical choice, as we have local spool 
directories to lower the NFS traffic anyway, but it could also be the default 
location of the spool directories.


b) we need a script, which will run at the start of a job as prolog and under 
the sgeadmin account:

#!/bin/sh -p
#
# This script saves some information in the /var/spool/sge/context area to be
# retrieved later on when generating the email about the job completion.
#

export PATH=/usr/local/bin:/bin:/usr/bin
. /usr/sge/default/common/settings.sh

if [ "${SGE_TASK_ID}" != "undefined" ]; then
JOB_ID=${JOB_ID}.${SGE_TASK_ID}
fi

#
# Attach the list of granted nodes to the job context.
# (This only works, if it's not an array job.)
#

if [ "${SGE_TASK_ID}" = "undefined" ]; then
if [ "${NSLOTS:-1}" -eq 1 ]; then
   qalter -ac NODES=$(hostname):1 $JOB_ID > /dev/null
else
   qalter -ac NODES=$(awk '{slots[$1]+=$2} END { for (host in slots) { 
printf entry?"+":""; printf host":"slots[host]; entry=1 }}' ${PE_HOSTFILE}) 
${JOB_ID} > /dev/null
fi
fi

#
# Record all context variables in the /var/spool/sge/context/$JOB_ID file to
# have access to it also after the job has finished.
#

if [ -d /var/spool/sge/context -a -w /var/spool/sge/context ]; then
qstat -j ${JOB_ID} | sed -n -e "/^context/{s/^context: 
*//;s/,/\n/g;s/;/,/g;p}" > /var/spool/sge/context/${JOB_ID}
fi

#
# Be sure to exit with 0, even when the grep wasn't successful.
#

exit 0


c) this script must be defined in the SGE configuration with: qconf -mconf

prolog   sgeadmin@/usr/sge/cluster/busybox env -u \
 LD_LIBRARY_PATH -u LD_PRELOAD -u IFS \
 /usr/sge/cluster/context.sh

Using busybox here is a safety measure, as a sneaky user could set some 
variables to get something executed as sgeadmin this way. As you might notice, 
I store all the custom scripts for SGE in /usr/sge/cluster, but this could be 
any location.


d) we need a custom mail-wrapper script, which will read the just written file 
and retrieve the necessary information

I will post this shortly. I wonder to post two scripts, as ours include in the 
current version also the ability to attach the last 1 MiB of the output file to 
the sent email, so that the users can check the result of the computation even 
without login into the cluster.


e) this mail-wrapper script needs again to be defined with: qconf -mconf

mailer   /usr/sge/cluster/mailer.sh

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Bring up execd nodes without explicit configuration

2018-09-15 Thread Reuti

Hi,

The new machine must at least be put into a hostgroup (which is then in turn 
referenced by a queue e.g. by @allhosts) and it must be an administration host. 
When the execd starts, it will automatically be added as an exechost in the 
qmaster.

What do you mean by OS? Just Linux, Windows and AIX – this can be referenced 
with a request for a particular “arch” value (i.e. $ARC in the job script). 
Other flavors could be requested by a custom RESTRING complex, where you could 
put any string you like to reference a particular OS. Essentially all can be 
done with a single queue.

-- Reuti




Von meinem iPhone gesendet
> Am 15.09.2018 um 19:15 schrieb Simon Matthews :
> 
> Is there any way to bring up an execd node, without explicitly
> configuring it at the qmaster? Perhaps it could come up and be added
> to a default queue?
> 
> If it is possible to do this, is it possible to specify a specific OS
> version for the execd when submitting a job? Obviously, this can be
> done by assigning execd hosts to specific queues and submitting the
> job to the appropriate queue, but I was wondering if there was some
> way to submit to one queue, but specify the OS via some other
> parameter.
> 
> I am using SoGE.
> 
> Simon
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cpu usage calculation

2018-08-31 Thread Reuti

Hi John,

> Am 31.08.2018 um 12:27 schrieb Marshall2, John (SSC/SPC) 
> :
> 
> Hi,
> 
> When gridengine calculates cpu usage (based on wallclock) it uses:
> cpu usage = wallclock * nslots
> 
> This does not account for the number of cpus that may be used for
> each slot, which is problematic.

What was the motivation for implementing it this way? I mean: usually one SLOT 
represents one CORE in GirdEngine. Hence, to get a proper allocation and 
accounting while not oversubscribe the nodes you have to request the overall 
amount of cores in case you want to combine processes (like for MPI) and 
threads (like for Open MP).

In your case, it looks to me that you assume that necessary cores are 
available, independent from the actual usage of each node?

-- Reuti

PS: I assume with CPUS you refer to CORES.


> I have written up an article at:
> https://expl.info/display/MISC/Slot+Multiplier+for+Calculating+CPU+Usage+in+Gridengine
> 
> which explains the issue and provides a patch (against sge-8.1.9)
> so that:
> cpu usage = wallclock * nslots * ncpus_per_slot
> 
> This makes the usage information much more useful/accurate
> when using the fair share.
> 
> Have others encountered this issue? Feedback is welcome.
> 
> Thanks,
> John
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Gridengine: error: commlib error: got select error (connection refused)

2018-08-27 Thread Reuti

Hi,

> Am 27.08.2018 um 16:01 schrieb Omri Safren :
> 
> I just installed gridengine & getting error when doing `qstat`:
> 
> error: commlib error: got select error (Connection refused)
> error: unable to send message to qmaster using port 6444 on host 
> "MyHost-VirtualBox": got send error
> 
> `cat /var/spool/gridengine/qmaster/messages` gives:
> 
> main|"MyHost-VirtualBox"|W|local configuration "MyHost-VirtualBox" not 
> defined - using global configuration
> main|"MyHost-VirtualBox"|E|global configuration not defined
> main|"MyHost-VirtualBox"|C|setup failed
> 
> setting `export SGE_ROOT` and running `sudo service 
> /etc/init.d/gridengine-master start` didn't help. I think the service isn't 
> running. Should I setup more env variables or a setup file?
> 
> Running on Ubuntu. Installed by `sudo apt-get install gridengine-master 
> gridengine-client` and accepted all defaults.

I have no clue about the Ubuntu issue. But usually you have to run a setup 
beforehand twice - one for the master, one for the client. Do you have any file 
"install_qmaster" in $SGE_ROOT? It won't install anything, but mainly 
configures the GridEngine.

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] User job fails silently

2018-08-08 Thread Reuti

Hi,

Am 09.08.2018 um 01:26 schrieb Derrick Lin:

> >  What state of the job you see in this line? Is it just hanging there and 
> > doing nothing? They do not appear in `top`? And it never vanishes 
> > automatically but you have to kill the job by hand? 
> 
> Sorry for the confusion. The job state is "r" according to SGE, but as you 
> mentioned qstat output is not related to any process. 
> 
> The line I coped is what it shown in top/htop. So basically, all his jobs 
> became:
> 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187677
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187690
> 
> Each of this scripts does copy & untar a file to the local XFS file system, 
> then a python script is called to operate on these untared files.
> 
> The job log shows that untaring is done, but the python script has never 
> started and the job process stuck as shown above.
> 
> We don't see any storage related contention.
> 
> I am more interested in knowing where this process  bash 
> /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 come from?

Unless the submitted job is marked as a binary, the jobscript is copied to 
SGE's internal database. At this point it would even be possible change the 
jobscript on the disk, while the submitted one keeps his content. On the 
exechost, this stored jobscript is then saved at the start of the job in a 
directory at //job_scripts/ and executed. As 
a consequense, this would also work without a shared file system if the 
 is local on the excehost (like in /var/spool/sge).

If this  is shared (like it seems to be in your case), the 
jobscript is first transferred by SGE's protocol to the node, where the execd 
writes the jobscript in the shared space, which is on the headnode again.

If you peek into the given file, you will hence find the original jobscript of 
the user. Does the jobscript try to modify itself, and the user can't (of 
course) not write at this location?

-- Reuti


> Cheers,
> 
> 
> On Wed, Aug 8, 2018 at 6:53 PM, Reuti  wrote:
> 
> > Am 08.08.2018 um 08:15 schrieb Derrick Lin :
> > 
> > Hi guys,
> > 
> > I have a user reported his jobs stuck running for much longer than usual.
> > 
> > So I go to the exec host, check the process and all processes owned by that 
> > user look like:
> > 
> > `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671
> 
> What state of the job you see in this line? Is it just hanging there and 
> doing nothing? They do not appear in `top`? And it never vanishes 
> automatically but you have to kill the job by hand?
> 
> 
> > In qstat, it still shows job is in running state.
> 
> The `qstat`output is not really related to any running process. It's just 
> what SGE granted and think it is running or granted to run. Especially with 
> parallel jobs across nodes, the might or might not be any process on one of 
> the granted slave nodes.
> 
> 
> > The user resubmitted the jobs and they ran and completed without an problem.
> 
> Could it be a race condition with the shared file system?
> 
> -- Reuti
> 
> 
> > I am wondering what may has caused this situation in general?
> > 
> > Cheers,
> > Derrick
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] User job fails silently

2018-08-08 Thread Reuti



> Am 08.08.2018 um 08:15 schrieb Derrick Lin :
> 
> Hi guys,
> 
> I have a user reported his jobs stuck running for much longer than usual.
> 
> So I go to the exec host, check the process and all processes owned by that 
> user look like:
> 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671

What state of the job you see in this line? Is it just hanging there and doing 
nothing? They do not appear in `top`? And it never vanishes automatically but 
you have to kill the job by hand?


> In qstat, it still shows job is in running state.

The `qstat`output is not really related to any running process. It's just what 
SGE granted and think it is running or granted to run. Especially with parallel 
jobs across nodes, the might or might not be any process on one of the granted 
slave nodes.


> The user resubmitted the jobs and they ran and completed without an problem.

Could it be a race condition with the shared file system?

-- Reuti


> I am wondering what may has caused this situation in general?
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Request: Value of -p in qsub doesn't go below -26

2018-08-06 Thread Reuti

Hi,

You can try to have a look at the extended output of `qstat`:

$ qstat -ext

$ qstat -pri

In addition, the way the priority is honored and essentially computed is 
outlined here:

$ man sge_priority

Maybe this will shed some light on it and point to the cause of it.

-- Reuti

PS: You may also want to switch on the output of the computed tickets:

$ qconf -ssconf
…
report_pjob_tickets   TRUE


Am 06.08.2018 um 19:18 schrieb Gowtham:

> Greetings.
> 
> I am using Rocks Cluster Distribution 6.1 and Grid Engine 2011.11p1. All our 
> simulations are submitted to the queue using the following command format:
> 
> qsub -p N SUBMISSION_SCRIPT.sh
> 
> N is a negative integer ranging from -1 through -60 (we consider this the 
> "priority" of a research group). 
> 
> Until about a week or so ago, everything worked fine. Upon noticing some 
> simulations waiting in queue for longer than normal periods of time (for 
> e.g., my own group's priority is -41), I submitted 60 simulations with 
> priority values -1, -2, -3, ..., -60.
> 
> I noticed that simulations with priority up to -26 ran just fine. Those with 
> -p value -27 and below just stay in 'qw' mode. The usual 'qstat -j SIM_ID' 
> command does not have information as to why it's not running (please see 
> below the output for a simulation with priority -27). Processors/slots are 
> free and available in long.q.
> 
> As far as I know and understand Grid Engine documentation, -p values range 
> from -1024 through 1023 and non operators/admins are restricted to 0 through 
> -1024. 
> 
> Any help in debugging/identifying the cause of this problem will be greatly 
> appreciated.
> 
> 
> job_number: 481703
> exec_file:  job_scripts/481703
> submission_time:Mon Aug  6 12:48:07 2018
> owner:  john
> uid:38025
> group:  jane-users
> gid:506
> sge_o_home: /home/john
> sge_o_log_name: john
> sge_o_path: 
> :/bin:/usr/bin:/usr/kerberos/bin:/share/apps/bin:/share/apps/sbin:/usr/X11R6/bin:/usr/java/latest/bin:/sbin:/usr/sbin:/usr/kerberos/sbin:/opt/gridengine/bin/lx26-amd64:/opt/gridengine/bin/linux-x64:/home/john/bin:/opt/ganglia/bin:/opt/rocks/bin:/opt/rocks/sbin
> sge_o_shell:/bin/bash
> sge_o_tz:   America/Detroit
> sge_o_workdir:  /misc/research/john/test_runs
> sge_o_host: login-0-2
> account:sge
> cwd:/misc/research/john/test_runs
> merge:  y
> hard resource_list: mem_free=2G
> mail_list:  john@login-0-1.local
> notify: TRUE
> job_name:   test_p27.sh
> priority:   -27
> jobshare:   0
> hard_queue_list:long.q
> shell_list: NONE:/bin/bash
> env_list:   
> script_file:test_p27.sh
> scheduling info:queue instance "long.q@compute-0-48.local" 
> dropped because it is disabled
> queue instance "long.q@compute-0-66.local" 
> dropped because it is disabled
> queue instance "long.q@compute-0-65.local" 
> dropped because it is disabled
> queue instance "long.q@compute-0-20.local" 
> dropped because it is disabled
> queue instance "long.q@compute-0-64.local" 
> dropped because it is disabled
> queue instance "repair.q@compute-0-36.local" 
> dropped because it is disabled
> queue instance "long.q@compute-0-63.local" 
> dropped because it is full
> queue instance "long.q@compute-0-50.local" 
> dropped because it is full
> ...
> queue instance "long.q@compute-0-33.local" 
> dropped because it is full
> queue instance "long.q@compute-0-31.local" 
> dropped because it is full
> queue instance "long.q@compute-0-35.local" 
> dropped because it is full
> queue instance "long.q@compute-0-10.local" 
> dropped because it is full
> queue instance "long.q@compute-0-43.local" 
> dropped because it is full
> queue instance "sh

Re: [gridengine users] Start jobs on exec host in sequential order

2018-08-01 Thread Reuti


> Am 01.08.2018 um 03:06 schrieb Derrick Lin :
> 
> HI Reuti,
> 
> The prolog script is set to run by root indeed. The xfs quota requires root 
> privilege.
> 
> I also tried the 2nd approach but it seems that the addgrpid file has not 
> been created when the prolog script executed:
> 
> /opt/gridengine/default/common/prolog_exec.sh: line 21: 
> /opt/gridengine/default/spool/omega-1-27/active_jobs/1187086.1/addgrpid: No 
> such file or directory

I must admit: I wasn't aware of this. Only during the execution of the job it's 
essentially available.

But this has the side effect, that anything done in a prolog or epilog when it 
runs (especially under the user's account) can't be traced or accounted (or a 
`qdel` might fail). This is somewhat surprising.

Do you set the quota with a shell script or a binary? Another idea could be to 
use a starter_method in the queue configuration. Then the addgrpid exists (I 
checked it), and you could call a binary with a SUID to root therein (as the 
starter_method will eventually call the user's script and will run under his 
account only). The SUID won't work for scripts, hence the final call to binary 
with set SUID.

#!/bin/sh
export ADDGRPID=(< $SGE_JOB_SPOOL_DIR/addgrpid)

call some script/binary to set the quota

exec "${@}"

-- Reuti


> Maybe some of my scheduler conf is not correct?
> 
> Regards,
> Derrick
> 
> On Mon, Jul 30, 2018 at 7:35 PM, Reuti  wrote:
> 
> > Am 30.07.2018 um 02:31 schrieb Derrick Lin :
> > 
> > Hi Reuti,
> > 
> > The approach sounds great.
> > 
> > But the prolog script seems to be run by root, so this is what I got:
> > 
> > XFS_PROJID:uid=0(root) gid=0(root) groups=0(root),396(sfcb)
> 
> This is quite unusual. Do you run the prolog as root by intention? I assume 
> so to set the limits:
> 
> $ qconf -sq my.q
> …prolog/some/script
> 
> Do you have here "root:" to change the user (in the global `qconf -sconf`) 
> under which it is run? Please note that this my open some root doors, 
> depending on environment variable setting. I have here "sgeadmin:" for some 
> special handling and use:
> 
> sgeadmin@/usr/sge/cluster/busybox env -u LD_LIBRARY_PATH -u LD_PRELOAD -u IFS 
> /usr/sge/cluster/context.sh
> 
> Nevertheless: the second approach to get the additional group ID from the 
> job's spool area should work.
> 
> -- Reuti
> 
> 
> > 
> > Maybe I am still missing something or prolog script is the wrong place for 
> > getting the group ID generated by SGE?
> > 
> > Cheers,
> > D
> > 
> > On Sat, Jul 28, 2018 at 11:53 AM, Reuti  wrote:
> > 
> > > Am 28.07.2018 um 03:00 schrieb Derrick Lin :
> > > 
> > > Thanks Reuti,
> > > 
> > > I know little about group ID created by SGE, and also pretty much 
> > > confused with the Linux group ID.
> > 
> > Yes, SGE assigns a conventional group ID to each job to track the CPU and 
> > memory consumption. This group ID is in the range you defined in:
> > 
> > $ qconf -sconf
> > …
> > gid_range    20000-20100
> > 
> > and this will be unique per node. First approach could be either `sed`:
> > 
> > $ id
> > uid=25000(reuti) gid=25000(ourgroup) 
> > groups=25000(ourgroup),10(wheel),1000(operator),20052,24000(common),26000(anothergroup)
> > $ id | sed -e "s/.*),\([0-9]*\),.*/\1/"
> > 20052
> > 
> > or:
> > 
> > ADD_GRP_ID=$(< $SGE_JOB_SPOOL_DIR/addgrpid)
> > echo $ADD_GRP_ID
> > 
> > -- Reuti
> > 
> > 
> > > I assume that "ïd" is called inside the prolog script, typically what the 
> > > output looks like?
> > > 
> > > Cheers,
> > > 
> > > On Fri, Jul 27, 2018 at 4:12 PM, Reuti  wrote:
> > > 
> > > Am 27.07.2018 um 03:14 schrieb Derrick Lin:
> > > 
> > > > We are using $JOB_ID as xfs_projid at the moment, but this approach 
> > > > introduces problem to array jobs whose tasks have the same $JOB_ID 
> > > > (with different $TASK_ID).
> > > > 
> > > > Also it is possible that tasks from two different array jobs run on the 
> > > > same node contain the same $TASK_ID, thus the uniqueness of the 
> > > > $TASK_ID on the same host cannot be maintained.
> > > 
> > > So the number you are looking for needs to be unique per node only?
> > > 
> > > What about using the additional group ID then which SGE creates – this 
> > > will be unique per node.
> >

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-30 Thread Reuti


> Am 30.07.2018 um 02:31 schrieb Derrick Lin :
> 
> Hi Reuti,
> 
> The approach sounds great.
> 
> But the prolog script seems to be run by root, so this is what I got:
> 
> XFS_PROJID:uid=0(root) gid=0(root) groups=0(root),396(sfcb)

This is quite unusual. Do you run the prolog as root by intention? I assume so 
to set the limits:

$ qconf -sq my.q
…prolog/some/script

Do you have here "root:" to change the user (in the global `qconf -sconf`) 
under which it is run? Please note that this my open some root doors, depending 
on environment variable setting. I have here "sgeadmin:" for some special 
handling and use:

sgeadmin@/usr/sge/cluster/busybox env -u LD_LIBRARY_PATH -u LD_PRELOAD -u IFS 
/usr/sge/cluster/context.sh

Nevertheless: the second approach to get the additional group ID from the job's 
spool area should work.

-- Reuti


> 
> Maybe I am still missing something or prolog script is the wrong place for 
> getting the group ID generated by SGE?
> 
> Cheers,
> D
> 
> On Sat, Jul 28, 2018 at 11:53 AM, Reuti  wrote:
> 
> > Am 28.07.2018 um 03:00 schrieb Derrick Lin :
> > 
> > Thanks Reuti,
> > 
> > I know little about group ID created by SGE, and also pretty much confused 
> > with the Linux group ID.
> 
> Yes, SGE assigns a conventional group ID to each job to track the CPU and 
> memory consumption. This group ID is in the range you defined in:
> 
> $ qconf -sconf
> …
> gid_range2-20100
> 
> and this will be unique per node. First approach could be either `sed`:
> 
> $ id
> uid=25000(reuti) gid=25000(ourgroup) 
> groups=25000(ourgroup),10(wheel),1000(operator),20052,24000(common),26000(anothergroup)
> $ id | sed -e "s/.*),\([0-9]*\),.*/\1/"
> 20052
> 
> or:
> 
> ADD_GRP_ID=$(< $SGE_JOB_SPOOL_DIR/addgrpid)
> echo $ADD_GRP_ID
> 
> -- Reuti
> 
> 
> > I assume that "ïd" is called inside the prolog script, typically what the 
> > output looks like?
> > 
> > Cheers,
> > 
> > On Fri, Jul 27, 2018 at 4:12 PM, Reuti  wrote:
> > 
> > Am 27.07.2018 um 03:14 schrieb Derrick Lin:
> > 
> > > We are using $JOB_ID as xfs_projid at the moment, but this approach 
> > > introduces problem to array jobs whose tasks have the same $JOB_ID (with 
> > > different $TASK_ID).
> > > 
> > > Also it is possible that tasks from two different array jobs run on the 
> > > same node contain the same $TASK_ID, thus the uniqueness of the $TASK_ID 
> > > on the same host cannot be maintained.
> > 
> > So the number you are looking for needs to be unique per node only?
> > 
> > What about using the additional group ID then which SGE creates – this will 
> > be unique per node.
> > 
> > This can be found in the `id` command's output or in location of the spool 
> > directory for the execd_spool_dir in 
> > ${HOSTNAME}/active_jobs/${JOB_ID}.${TASK_ID}/addgrpid
> > 
> > -- Reuti
> > 
> > 
> > > That's why I am trying to implement the xfs_projid to be independent from 
> > > SGE.
> > > 
> > > 
> > > 
> > > On Thu, Jul 26, 2018 at 9:27 PM, Reuti  wrote:
> > > Hi,
> > > 
> > > > Am 26.07.2018 um 06:01 schrieb Derrick Lin :
> > > > 
> > > > Hi all,
> > > > 
> > > > I am working on a prolog script which setup xfs quota on disk space per 
> > > > job basis.
> > > > 
> > > > For setting up xfs quota in sub directory, I need to provide project ID.
> > > > 
> > > > Here is how I did for generating project ID:
> > > > 
> > > > XFS_PROJID_CF="/tmp/xfs_projid_counter"
> > > > 
> > > > echo $JOB_ID >> $XFS_PROJID_CF
> > > > xfs_projid=$(wc -l < $XFS_PROJID_CF)
> > > 
> > > The xfs_projid is then the number of lines in the file? Why not using 
> > > $JOB_ID directly? Is there a limit in max. project ID and the $JOB_ID 
> > > might be larger?
> > > 
> > > -- Reuti
> > > 
> > > 
> > > > My test shows, when there are multiple jobs start on the same exec host 
> > > > at the same time, the prolog script is executed almost the same time, 
> > > > results multiple jobs share the same xfs_projid, which is no good.
> > > > 
> > > > I am wondering if I can configure the scheduler to start the jobs in a 
> > > > sequential way (probably has a interval in between).
> > > > 
> > > > 
> > > > Cheers,
> > > > Derrick
> > > > ___
> > > > users mailing list
> > > > users@gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > > 
> > > 
> > 
> > 
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-27 Thread Reuti


> Am 28.07.2018 um 03:00 schrieb Derrick Lin :
> 
> Thanks Reuti,
> 
> I know little about group ID created by SGE, and also pretty much confused 
> with the Linux group ID.

Yes, SGE assigns a conventional group ID to each job to track the CPU and 
memory consumption. This group ID is in the range you defined in:

$ qconf -sconf
…
gid_range2-20100

and this will be unique per node. First approach could be either `sed`:

$ id
uid=25000(reuti) gid=25000(ourgroup) 
groups=25000(ourgroup),10(wheel),1000(operator),20052,24000(common),26000(anothergroup)
$ id | sed -e "s/.*),\([0-9]*\),.*/\1/"
20052

or:

ADD_GRP_ID=$(< $SGE_JOB_SPOOL_DIR/addgrpid)
echo $ADD_GRP_ID

-- Reuti


> I assume that "ïd" is called inside the prolog script, typically what the 
> output looks like?
> 
> Cheers,
> 
> On Fri, Jul 27, 2018 at 4:12 PM, Reuti  wrote:
> 
> Am 27.07.2018 um 03:14 schrieb Derrick Lin:
> 
> > We are using $JOB_ID as xfs_projid at the moment, but this approach 
> > introduces problem to array jobs whose tasks have the same $JOB_ID (with 
> > different $TASK_ID).
> > 
> > Also it is possible that tasks from two different array jobs run on the 
> > same node contain the same $TASK_ID, thus the uniqueness of the $TASK_ID on 
> > the same host cannot be maintained.
> 
> So the number you are looking for needs to be unique per node only?
> 
> What about using the additional group ID then which SGE creates – this will 
> be unique per node.
> 
> This can be found in the `id` command's output or in location of the spool 
> directory for the execd_spool_dir in 
> ${HOSTNAME}/active_jobs/${JOB_ID}.${TASK_ID}/addgrpid
> 
> -- Reuti
> 
> 
> > That's why I am trying to implement the xfs_projid to be independent from 
> > SGE.
> > 
> > 
> > 
> > On Thu, Jul 26, 2018 at 9:27 PM, Reuti  wrote:
> > Hi,
> > 
> > > Am 26.07.2018 um 06:01 schrieb Derrick Lin :
> > > 
> > > Hi all,
> > > 
> > > I am working on a prolog script which setup xfs quota on disk space per 
> > > job basis.
> > > 
> > > For setting up xfs quota in sub directory, I need to provide project ID.
> > > 
> > > Here is how I did for generating project ID:
> > > 
> > > XFS_PROJID_CF="/tmp/xfs_projid_counter"
> > > 
> > > echo $JOB_ID >> $XFS_PROJID_CF
> > > xfs_projid=$(wc -l < $XFS_PROJID_CF)
> > 
> > The xfs_projid is then the number of lines in the file? Why not using 
> > $JOB_ID directly? Is there a limit in max. project ID and the $JOB_ID might 
> > be larger?
> > 
> > -- Reuti
> > 
> > 
> > > My test shows, when there are multiple jobs start on the same exec host 
> > > at the same time, the prolog script is executed almost the same time, 
> > > results multiple jobs share the same xfs_projid, which is no good.
> > > 
> > > I am wondering if I can configure the scheduler to start the jobs in a 
> > > sequential way (probably has a interval in between).
> > > 
> > > 
> > > Cheers,
> > > Derrick
> > > ___
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> > 
> > 
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-26 Thread Reuti



Am 27.07.2018 um 03:14 schrieb Derrick Lin:

> We are using $JOB_ID as xfs_projid at the moment, but this approach 
> introduces problem to array jobs whose tasks have the same $JOB_ID (with 
> different $TASK_ID).
> 
> Also it is possible that tasks from two different array jobs run on the same 
> node contain the same $TASK_ID, thus the uniqueness of the $TASK_ID on the 
> same host cannot be maintained.

So the number you are looking for needs to be unique per node only?

What about using the additional group ID then which SGE creates – this will be 
unique per node.

This can be found in the `id` command's output or in location of the spool 
directory for the execd_spool_dir in 
${HOSTNAME}/active_jobs/${JOB_ID}.${TASK_ID}/addgrpid

-- Reuti


> That's why I am trying to implement the xfs_projid to be independent from SGE.
> 
> 
> 
> On Thu, Jul 26, 2018 at 9:27 PM, Reuti  wrote:
> Hi,
> 
> > Am 26.07.2018 um 06:01 schrieb Derrick Lin :
> > 
> > Hi all,
> > 
> > I am working on a prolog script which setup xfs quota on disk space per job 
> > basis.
> > 
> > For setting up xfs quota in sub directory, I need to provide project ID.
> > 
> > Here is how I did for generating project ID:
> > 
> > XFS_PROJID_CF="/tmp/xfs_projid_counter"
> > 
> > echo $JOB_ID >> $XFS_PROJID_CF
> > xfs_projid=$(wc -l < $XFS_PROJID_CF)
> 
> The xfs_projid is then the number of lines in the file? Why not using $JOB_ID 
> directly? Is there a limit in max. project ID and the $JOB_ID might be larger?
> 
> -- Reuti
> 
> 
> > My test shows, when there are multiple jobs start on the same exec host at 
> > the same time, the prolog script is executed almost the same time, results 
> > multiple jobs share the same xfs_projid, which is no good.
> > 
> > I am wondering if I can configure the scheduler to start the jobs in a 
> > sequential way (probably has a interval in between).
> > 
> > 
> > Cheers,
> > Derrick
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Start jobs on exec host in sequential order

2018-07-26 Thread Reuti

Hi,

> Am 26.07.2018 um 06:01 schrieb Derrick Lin :
> 
> Hi all,
> 
> I am working on a prolog script which setup xfs quota on disk space per job 
> basis.
> 
> For setting up xfs quota in sub directory, I need to provide project ID.
> 
> Here is how I did for generating project ID:
> 
> XFS_PROJID_CF="/tmp/xfs_projid_counter"
> 
> echo $JOB_ID >> $XFS_PROJID_CF
> xfs_projid=$(wc -l < $XFS_PROJID_CF)

The xfs_projid is then the number of lines in the file? Why not using $JOB_ID 
directly? Is there a limit in max. project ID and the $JOB_ID might be larger?

-- Reuti


> My test shows, when there are multiple jobs start on the same exec host at 
> the same time, the prolog script is executed almost the same time, results 
> multiple jobs share the same xfs_projid, which is no good.
> 
> I am wondering if I can configure the scheduler to start the jobs in a 
> sequential way (probably has a interval in between).
> 
> 
> Cheers,
> Derrick
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] running failed array jobs in UGE

2018-06-12 Thread Reuti

Hi,

> Am 12.06.2018 um 18:11 schrieb VG :
> 
> Hi Everyone,
> I submitted 200 array jobs on the cluster using -t option. My command line 
> looks like this:
> 
> qsub -t 1-200 -cwd -j y -b y -N jobs -l h_vmem=30G ./script.sh
> 
> After this, job numbered 3,10,45,56,98,134 failed to finish.
> 
> What can I do to only run the failed job now? Can I use -t option in anyway 
> or do I have to submit it one by one?

You have to submit it one by one, possibly in a `for` loop, but you can use -t 
to specify the to be used index at least as a single number.

-- Reuti


> 
> Thanks
> 
> Regards
> Varun
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

2018-06-11 Thread Reuti



> Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+g...@gmail.com>:
> 
> Hello,
> 
> Thank you for the suggestion, Reuti. Not sure if my users' pipelines can deal 
> with multiple job ids, perhaps they will be willing to modify their code.

Also other commands in SGE like `qdel` allow to use the job name to deal with 
such a configuration.


> On Mon, Jun 11, 2018 at 9:23 AM, Reuti  wrote:
> Hi,
> 
> 
> I wouldn't be surprised if the execd remembers that the job was already 
> warned, hence it must be the hard limit now. Would your workflow allow:
> 
> This is happening on different nodes, so each execd cannot know any history 
> by itself, the master must be providing this information.

Aha, you correct.

-- Reuti


> Can't help wondering if this is a configurable option.
> 
> Ilya.
> 
> 
>  
> . /usr/sge/default/common/settings.sh
> trap "qresub $JOB_ID; exit 4;" SIGUSR1
> 
> Well, you get several job numbers this way. For the accounting with `qacct` 
> you could use the job name instead of the job number to get all the runs 
> listed though.
> 
> -- Reuti
> 
> 
> > This is my test script:
> > 
> > #!/bin/bash
> > 
> > #$ -S /bin/bash
> > #$ -l s_rt=0:0:5,h_rt=0:0:10
> > #$ -j y
> > 
> > set -x
> > set -e
> > set -o pipefail
> > set -u
> > 
> > trap "exit 99" SIGUSR1
> > 
> > trap "exit 2" SIGTERM
> > 
> > echo "hello world"
> > 
> > sleep 15
> > 
> > It should reschedule itself indefinitely when s_rt lapses. Yet, what is 
> > happening is that rescheduling happens only once. On the second run the job 
> > receives only SIGTERM and exits. Here is the script's output:
> > 
> > node140
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > User defined signal 1
> > ++ exit 99
> > node069
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > Terminated
> > ++ exit 2
> > 
> > Execd logs confirms that for the second time the jobs was killed for 
> > exceeding h_rt:
> > 
> > 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft wallclock 
> > time - initiate soft notify method
> > 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited with 
> > exit status = 25
> > 
> > 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard wallclock 
> > time - initiate terminate method
> > 
> > And here is the accounting information:
> > 
> > ==
> > qnameshort.q 
> > hostname node140
> > groupeveryone
> > ownerilya
> > project  project.p  
> > department   defaultdepartment   
> > jobname  reshed_test.sh  
> > jobnumber8030395 
> > taskid   undefined
> > account  sge 
> > priority 0   
> > qsub_timeFri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:20:09 2018
> > end_time Fri Jun  8 21:20:15 2018
> > granted_pe   NONE
> > slots1   
> > failed   25  : rescheduling
> > exit_status  99  
> > ru_wallclock 6
> > ...
> > ==
> > qnameshort.q 
> > hostname node069
> > groupeveryone
> > ownerilya
> > project  project.p  
> > department   defaultdepartment   
> > jobname  reshed_test.sh  
> > jobnumber8030395 
> > taskid   undefined
> > account  sge 
> > priority 0   
> > qsub_timeFri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:21:39 2018
> > end_time Fri Jun  8 21:21:50 2018
> > granted_pe   NONE
> > slots1   
> > failed   0
> > exit_status  2   
> > ru_wallclock 11   
> > ...
> > 
> > 
> > Is there anything in the configuration I could be missing. Running 6.2u5.
> > 
> > Thank you,
> > Ilya.
> > 
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatic job rescheduling. Only one rescheduling is happening

2018-06-11 Thread Reuti

Hi,

> Am 08.06.2018 um 23:46 schrieb Ilya M <4ilya.m+g...@gmail.com>:
> 
> Hello,
> 
> I found an unexpected behavior when setting a hard and soft time limits and 
> doing automatic rescheduling by SIGUSR1.

I wouldn't be surprised if the execd remembers that the job was already warned, 
hence it must be the hard limit now. Would your workflow allow:

. /usr/sge/default/common/settings.sh
trap "qresub $JOB_ID; exit 4;" SIGUSR1

Well, you get several job numbers this way. For the accounting with `qacct` you 
could use the job name instead of the job number to get all the runs listed 
though.

-- Reuti


> This is my test script:
> 
> #!/bin/bash
> 
> #$ -S /bin/bash
> #$ -l s_rt=0:0:5,h_rt=0:0:10
> #$ -j y
> 
> set -x
> set -e
> set -o pipefail
> set -u
> 
> trap "exit 99" SIGUSR1
> 
> trap "exit 2" SIGTERM
> 
> echo "hello world"
> 
> sleep 15
> 
> It should reschedule itself indefinitely when s_rt lapses. Yet, what is 
> happening is that rescheduling happens only once. On the second run the job 
> receives only SIGTERM and exits. Here is the script's output:
> 
> node140
> + set -e
> + set -o pipefail
> + set -u
> + trap 'exit 99' SIGUSR1
> + trap 'exit 2' SIGTERM
> + echo 'hello world'
> hello world
> + sleep 15
> User defined signal 1
> ++ exit 99
> node069
> + set -e
> + set -o pipefail
> + set -u
> + trap 'exit 99' SIGUSR1
> + trap 'exit 2' SIGTERM
> + echo 'hello world'
> hello world
> + sleep 15
> Terminated
> ++ exit 2
> 
> Execd logs confirms that for the second time the jobs was killed for 
> exceeding h_rt:
> 
> 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft wallclock 
> time - initiate soft notify method
> 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited with 
> exit status = 25
> 
> 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard wallclock 
> time - initiate terminate method
> 
> And here is the accounting information:
> 
> ==
> qnameshort.q 
> hostname node140
> groupeveryone
> ownerilya
> project  project.p  
> department   defaultdepartment   
> jobname  reshed_test.sh  
> jobnumber8030395 
> taskid   undefined
> account  sge 
> priority 0   
> qsub_timeFri Jun  8 21:19:40 2018
> start_time   Fri Jun  8 21:20:09 2018
> end_time Fri Jun  8 21:20:15 2018
> granted_pe   NONE
> slots1   
> failed   25  : rescheduling
> exit_status  99  
> ru_wallclock 6
> ...
> ==
> qnameshort.q 
> hostname node069
> groupeveryone
> ownerilya
> project  project.p  
> department   defaultdepartment   
> jobname  reshed_test.sh  
> jobnumber8030395 
> taskid   undefined
> account  sge 
> priority 0   
> qsub_timeFri Jun  8 21:19:40 2018
> start_time   Fri Jun  8 21:21:39 2018
> end_time Fri Jun  8 21:21:50 2018
> granted_pe   NONE
> slots1   
> failed   0
> exit_status  2   
> ru_wallclock 11   
> ...
> 
> 
> Is there anything in the configuration I could be missing. Running 6.2u5.
> 
> Thank you,
> Ilya.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to force project base jobs to be scheduled only to project's queue ?

2018-06-11 Thread Reuti

Hi,

> Am 11.06.2018 um 12:04 schrieb Jakub pl :
> 
> Dear all,
> 
> I have a problem with my grid engine. I have a queue with assigned project 
> name. When I submit a job with -P project name, and there are free slots is 
> queue, job is scheduled as expected. But when there is no slots, job - 
> instead of waiting is being scheduled to default queue(with no project 
> assigned). Is this normal ? How can I prevent this ?

You could add this one (or all) projects to the "xprojects" entry in the queue 
definition. Without any projects in "projects" or "xprojects", any project may 
run there. Have a look at `man queue_conf` section "projects".

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Scheduling maintenance and using advance reservation

2018-06-06 Thread Reuti



Am 06.06.2018 um 23:37 schrieb Ilya M:

> Thank you, Mark.
> My hosts belong to multiple queues, but all queues have the same projects. So 
> perhaps I can just remove all projects from queues' configuration.

Often also a calendar is used to disable certain queues. Once the calendar is 
defined, it can even be attached to a queue on the command line with `qconf 
-aattr queue calendar wartung common` where "wartung" is the name of the 
calendar and "common" the addressed queue. 

In your case I must admit you would need for sake of easiness a hostgroup 
containing all GPU nodes:

qconf -mattr queue calendar wartung common@@myGPUgroup

-- Reuti


> Ilya.
> 
> 
> On Wed, Jun 6, 2018 at 2:41 AM, Mark Dixon  wrote:
> On Tue, 5 Jun 2018, Ilya M wrote:
> ...
> Is there a way to submit AR when there are projects attached to queues? I am 
> using SGE 6.2u5.
> ...
> 
> Hi Ilya,
> 
> I've run into this, too: I'm afraid that there isn't. I logged it here:
> 
> https://arc.liv.ac.uk/trac/SGE/ticket/1466
> 
> I started to fix it but ran out of time. I ended up reproducing the 
> queue/project functionality in a way that works with ARs by:
> 
> 0) Remove the queue's project setting
> 1) Added the project's ACL to the queue's user_lists
> 2) Added a new complex called 'project'
> 3) On every host, set a default value for complex 'project'
> 4) On your gpu nodes, set an alternative value for complex 'project'
> 5) Added "-l project=" to 
> $SGE_ROOT/$SGE_CELL/common/sge_request 6) Used a JSV to rewrite the job to 
> change its project complex request
>if the job requests particular projects.
> 
> It's a bit of a faff, but it means you can do it without fixing/upgrading 
> your scheduler. Only works if each host only belongs to one queue.
> 
> Does that help?
> 
> Mark
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE accounting file getting too big...

2018-05-18 Thread Reuti


> Am 18.05.2018 um 17:23 schrieb Noel Benitez :
> 
> Hi guys,
>  
> The "accounting" file on our sge master has a filesize of 20Gb.
>  
> Is there a recommended way of purging this file short of using "cat /dev/null 
> > accounting"  ?

Yes:

: > accounting

Jokes aside, you can either use logrotate or the supplied script:

$SGE_ROOT/logchecker.sh

Usually I copy it to /usr/sge/default/common/ to make my changes and then use 
cron jobs on the nodes and the qmaster.

qmaster:

$ cat /etc/cron.d/gridengine
#
# Special cron-job to reduce the size of the SGE messages/accounting  files.
#
5 1 * * 0   sgeadmin/usr/sge/default/common/logchecker.sh 
-action_on 1

nodes:

$ cat /etc/cron.d/gridware
5 1 * * 0   sgeadmin/usr/sge/default/common/logchecker.sh 
-action_on 2 -execd_spool /var/spool/sge

The default in the script itself I set to:

UNCONFIGURED=no
ACTION_ON=2
ACTIONSIZE=1024
KEEPOLD=3

Note: to read old accouting files in `qacct` on-the-fly you can use:

$ qacct -o reuti -f <(zcat /usr/sge/default/common/accounting.1.gz)

-- Reuti


>  
> Thanks for any help.
>  
>  
>  
>  
>  
> -Noel Benitez, Salk iT Dept.
>  
>  
>  
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2402 matches

Mail list logo