from:"Chris Dagdigian"

Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Chris Dagdigian



The root cause was strange so it's worth documenting here ...

I had created a new consumable and requestable resource called "gpu" 
configured like this:


gpu gpu    INT   <=    YES YES    NONE    0

And on host A I had set "complex_values gpu=1" and on host B I set 
"complex_values gpu=2" etc. etc. across the cluster.


My mistake was setting the default value of the new complex entry to 
"NONE" instead of "0" which is what you probably want when the attribute 
is of type INT


But this was bizzare;  basically I had a bad default value for a 
requestable resource and as soon as we set that value down at the 
execution host level it instantly broke all of our parallel 
environments.  SGE scheduler was treating my mistake like I had created 
a requestable resource of type FORCED or something.


Strange but resolved now.

Regards
Chris




Reuti wrote on 6/11/20 4:17 PM:

Hi,

Any consumables in place like memory or other resource requests? Any output of `qalter -w 
v …` or "-w p"?

-- Reuti



Am 11.06.2020 um 20:32 schrieb Chris Dagdigian :

Hi folks,

Got a bewildering situation I've never seen before with simple SMP/threaded PE 
techniques

I made a brand new PE called threaded:

$ qconf -sp threaded
pe_namethreaded
slots  999
user_lists NONE
xuser_listsNONE
start_proc_argsNONE
stop_proc_args NONE
allocation_rule$pe_slots
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
threaded 1' argument

However all "qstat -j" data is showing this scheduler decision line:

cannot run in PE "threaded" because it only offers 0 slots


I'm sort of lost on how to debug this because I can't figure out how to probe where SGE is keeping 
track of PE specific slots.  With other stuff I can look at complex_values reported by execution 
hosts or I can use an "-F" argument to qstat to dump the live state and status of a 
requestable resource but I don't really have any debug or troubleshooting ideas for "how to 
figure out why SGE thinks there are 0 slots when the static PE on an idle cluster has. been set to 
contain 999 slots"

Anyone seen something like this before?  I don't think I've ever seen this 
particular issue with an SGE parallel environment before ...


Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Chris Dagdigian


Hi folks,

Got a bewildering situation I've never seen before with simple 
SMP/threaded PE techniques


I made a brand new PE called threaded:

$ qconf -sp threaded
pe_name    threaded
slots  999
user_lists NONE
xuser_lists    NONE
start_proc_args    NONE
stop_proc_args NONE
allocation_rule    $pe_slots
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


And I attached that to all.q on an IDLE grid and submitted a job with 
'-pe threaded 1' argument


However all "qstat -j" data is showing this scheduler decision line:

cannot run in PE "threaded" because it only offers 0 slots


I'm sort of lost on how to debug this because I can't figure out how to 
probe where SGE is keeping track of PE specific slots.  With other stuff 
I can look at complex_values reported by execution hosts or I can use an 
"-F" argument to qstat to dump the live state and status of a 
requestable resource but I don't really have any debug or 
troubleshooting ideas for "how to figure out why SGE thinks there are 0 
slots when the static PE on an idle cluster has. been set to contain 999 
slots"


Anyone seen something like this before?  I don't think I've ever seen 
this particular issue with an SGE parallel environment before ...



Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Alternatives to Son of GridEngine

2018-11-12 Thread Chris Dagdigian


My $.02

The commercial version of GE from univa is excellent. I'm working with 
it now. New features and excellent support as always


For non-GE options the trend seems to be moving towards Slurm -- at 
least from what I can see in my particular industry niche


Chris



Taras Shapovalov 
November 12, 2018 at 12:41 PM
Hi Daniel,

There are 4 alternatives remain: Slurm, UGE, PBS Pro and LSF. They are 
all pretty similar and do their job pretty well.


Best regards,

Taras



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
Daniel Povey 
November 12, 2018 at 12:05 PM
Everyone,
I'm trying to understand the landscape of alternatives to Son of 
GridEngine, since the maintenance situation isn't great right now and 
I'm not sure that it has a long term future.
If you guys were to switch to something in the same universe of 
products, what would it be to?  Univa GridEngine?  slurm?  Which of 
these, as far as you know, is better maintained and has a better future?
I'm not interested in fancy new things like mesos that have a 
different programming model or are too new.


Dan


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Son of GridEngine succession?

2018-05-12 Thread Chris Dagdigian

+1 for both the idea as well as the DODGE name "Daughter of Grid Engine" 
is pretty awesome.



Simon Matthews 
May 11, 2018 at 10:14 PM
You could call it "the DOGE" (Daughter of Grid Engine).

Simon
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
Daniel Povey 
May 11, 2018 at 6:49 PM
Everyone,

I want to start a discussion about how to replace Son of GridEngine.
As far as I can tell, Dave Love has had no online activity for a year,
is not responding to emails, and my attempts to contact him indirectly
via his workplace have come to nothing. Even if he is still alive, I
think it's clear that he's either unwilling or unable to continue to
maintain the Son of GridEngine project.

I am thinking we could create a repository on GitHub to replace the
Liverpool-hosted Son of GridEngine project? Maybe call it Grandson of
GridEngine? What do you think?

I know there are people who have patches to contribute. I do myself.

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Scheduler node rebooted - what happens to running jobs?

2018-04-20 Thread Chris Dagdigian

Running jobs continue to run. The only jobs that would be affected is if 
your master node was also running jobs or if the master node contained a 
critical dependency like an NFS file share that the running jobs needed 
-- but if the master node simply bounces and nothing on that host is 
required for the jobs then they'll keep on running. The jobs will finish 
as expected and the sge_execd and shepherd daemons will hang out until 
they can report job status back to the qmaster when it comes online again





Noel Benitez 
April 20, 2018 at 10:33 AM
Hi all.
If the master scheduler node is shut down or gets rebooted,  are 
currently running jobs on the other nodes affected at all? Or will 
they simply continue running?

Thanks for any info.
/-//NBS/


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE 8.1.9. email notification

2018-03-20 Thread Chris Dagdigian

Does /bin/mail exist? On a lot of systems it may actually be /usr/bin/mail 

Regards,
Chris

/* Sent via phone - apologies for typos & terseness */


> On Mar 20, 2018, at 11:31 AM, Peter Sigl  wrote:
> 
> Hi All,
> 
> I am trying to receive email notification for job start and job completion 
> with SGE.
> 
> I added the lines to the runjob.sh file
> 
> #$ -M u...@domain.de
> #$ -m beas
> 
> However, I did not receive an email.
> 
> Also, I get an email from qdel about canceling the job from root:
> Job 31 (ps-ser1-out) was killed by psigl@obelix.local
>  
> On the command line with mail-s I get emails.
> 
> In SGE 8.1.9 under clusterconfiguration /bin/mail is configured.
> 
> #global:
> execd_spool_dir  /opt/gridengine/default/spool
> mailer   /bin/mail
> xterm/usr/bin/xterm
> load_sensor  none
> prolog   none
> epilog   none
> shell_start_mode posix_compliant
> login_shells sh,bash,ksh,csh,tcsh
> min_uid  0
> min_gid  0
> user_lists   none
> xuser_lists  none
> projects none
> xprojectsnone
> enforce_project  false
> enforce_user auto
> load_report_time 00:00:40
> max_unheard  00:05:00
> reschedule_unknown   01:00:00
> loglevel log_warning
> administrator_mail   none
> set_token_cmdnone
> pag_cmd  none
> token_extend_timenone
> shepherd_cmd none
> qmaster_params   none
> execd_params none
> reporting_params accounting=true reporting=true \
>  flush_time=00:00:15 joblog=true sharelog=00:00:00
> finished_jobs100
> gid_range2-20100
> qlogin_command   builtin
> qlogin_daemonbuiltin
> rlogin_command   /usr/bin/ssh
> rlogin_daemonbuiltin
> rsh_command  /usr/bin/ssh
> rsh_daemon   builtin
> max_aj_instances 2000
> max_aj_tasks 75000
> max_u_jobs   0
> max_jobs 0
> max_advance_reservations 0
> auto_user_oticket0
> auto_user_fshare 0
> auto_user_default_projectnone
> auto_user_delete_time86400
> delegated_file_staging   false
> reprioritize false
> jsv_url  none
> jsv_allowed_mod  ac,h,i,e,o,j,M,N,p,w
> 
> There is a link in Centos 7 set to mailx.
> I think under SGE 6.2 postfix was used.
> Is not that the case in Rocks 7 / SGE?
> 
> I configured the relayhost under /etc/postfix/main.cf.
> 
> 
> Is it necessary to configure something else before attempting to receive 
> these emails?
> 
> 
> 
> 
> Peter Sigl
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] RUNNING GROMACS SIMULATIONS THROUGH SCRIPT FILE

2017-08-10 Thread Chris Dagdigian



Simply put the system can't find your gromacs binary that is what the 
"gmx: command not found" means.


Just edit your submit script to pass the full path to gmx and you should 
be fine


-Chris



Subashini K 
August 10, 2017 at 7:59 AM
Hi sun grid engine users,

I am new to scripting.

I want to run GROMACS MD simulations in sun grid engine through qsub 
command.


#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -N smp1
#$ -l h_vmem=1G
gmx mdrun -ntmpi 1 -ntomp 8 -v -deffnm eq


The above contents were in submit.sh file.

When I gave, qsub submit.sh,

I got the following error

/opt/gridengine/default/spool/compute-0-31/job_scripts/4423509: line 
8: gmx: command not found



What am I supposed to do? I intend to do single processor serial job.

How to rectify it?


Thanks,
Subashini.K
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] new error I've never seen before! ("sge_shepherd won't run -- dynamic library missing?")

2017-08-09 Thread Chris Dagdigian



To answer my own question:


/opt/sge/bin/lx-amd64/sge_shepherd
/opt/sge/bin/lx-amd64/sge_shepherd: error while loading shared 
libraries: libhwloc.so.5: cannot open shared object file: No such file 
or directory




The answer is the hwloc-1.5-3.el6_5.x86_64 RPM ...

Chris





Chris Dagdigian <mailto:d...@sonsorol.org>
August 9, 2017 at 12:47 PM
Sorry this is exciting, I've been using SGE forever and rarely see 
something new.


However I see this now on AWS trying to run a new exechost on a centos 
box that I bound to an aws cfncluster grid:


Starting execution daemon. Please wait ...
sge_shepherd won't run -- dynamic library missing?


Anyone seen this before? Google is no help as to what library may 
actually be missing ...



Chris




___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] new error I've never seen before! ("sge_shepherd won't run -- dynamic library missing?")

2017-08-09 Thread Chris Dagdigian

Sorry this is exciting, I've been using SGE forever and rarely see 
something new.


However I see this now on AWS trying to run a new exechost on a centos 
box that I bound to an aws cfncluster grid:


Starting execution daemon. Please wait ...
sge_shepherd won't run -- dynamic library missing?


Anyone seen this before? Google is no help as to what library may 
actually be missing ...



Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] (resend) dealing with AD usernames that contain "@" character

2017-08-02 Thread Chris Dagdigian



Yeah short names are guaranteed unique in my environment.  The new patch 
for SSSD allows one to define an AD domain search/preference order and I 
think the implication there is that if a dupe shortname is detected it 
will assume that the shortname belongs to the 1st domain listed in the 
ordering.


I'm learning far more SSSD subsystem stuff than I want to with this 
SGE/HPC + AD project!


This is one of those massive global companies where you will be laughed 
out of the room if you propose a schema change or something like SFU in 
their primary domain controller environment. It takes days to get one of 
the AD gurus to even agree to a phone call, heh.


So all of our AD integration is via a RHEL IDM server aka  Free-IPA 
master that has a 1-way trust to the top domain of COMPANY.COM. The 
1-way trust allows Free-IPA and RHEL IDM to traverse the transitive 
trust relationships to resolve and enumerate users and groups who are in 
child domains like NAFTA.COMPANY.COM  and EAME.COMPANY.COM etc.


-dag


Ian Kaufman wrote:
If you support multiple domains, are you able to guarantee unique 
short names? It seems to me that could be a problem. If it is a case 
of multiple AD domains, but all coming form the same entity, thus 
guaranteeing unique short names, why not see if Services for UNIX is 
enabled in the domain, and use LDAP to query against it?


Ian



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] (resend) dealing with AD usernames that contain "@" character

2017-08-02 Thread Chris Dagdigian



Thanks Reuti!

I can't use the trick in that tip because we have more than one AD 
domain to support and that "default_ad_domain_suffix=" setting only 
works for one AD domain


The real solution is for us to wait for the next SSSD patch to come out 
- they've added features that should allow universal short names coming 
from any AD domain, transitive trust or child domain.


The current plan for now is to make local accounts that match the AD 
short name while stealing the UID and GID values from the remote AD 
integration server. We'll run that way until the SSSD patch shows up in 
the various Linux repos


-Chris



Reuti wrote:

A similar question was already on the list before. IMO it's not a valid user 
name in Linux and doesn't conform to POSIX, where only certain characters are 
allowed. There was this hint:

http://arc.liv.ac.uk/pipermail/gridengine-users/2010-August/031881.html

-- Reuti


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] (resend) dealing with AD usernames that contain "@" character

2017-08-01 Thread Chris Dagdigian



oops. Sent last email in HTML format which likely got stripped. 
Resending 



Hi folks,

Has anyone used FreeIPA or RHEL IDM to integrate an SGE cluster into a 
complex active directory environment?


I've got an issue where the AD integration is working fine across a 
pretty complex set of Active Directory domains and transitive trusts but 
the structure of our AD usernames is utterly breaking SGE ...


Example:

On an IDM integrated host my username is really long:

 $ whoami ux...@nafta.company.org

And the "@" character utterly freaks out qlogin, qrsh and qsub ...

$ qsub -cwd -N test /opt/sge/examples/jobs/simple.sh
Unable to run job: At ('@') not allowed in objectname
Exiting.
[ux...@nafta.company.org@usae-hpc ~]$


Has anyone dealt with this before?   This is a new one for me!

-Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Dealing with AD integration and usernames that contain "@" ...

2017-08-01 Thread Chris Dagdigian


  
Hi folks, 
  
Has anyone used FreeIPA or RHEL IDM to integrate an SGE cluster into a 
complex active directory environment? 
  
I've got an issue where the AD integration is working fine across a 
pretty complex set of Active Directory domains and transitive trusts but
 the structure of our AD usernames is utterly breaking SGE ...
  
Example:
  
On an IDM integrated host my username is really long:
  
   $ whoami   
ux...@nafta.company.org
  
And the "@" character utterly freaks out qlogin, qrsh and qsub ...
  
  $ qsub -cwd -N test 
/opt/sge/examples/jobs/simple.sh
  Unable to run job: At ('@') not 
allowed in objectname
  Exiting.
  [ux...@nafta.company.org@usae-hpc
 ~]$
  
  
Has anyone dealt with this before?   This is a new one for me!
  
-Chris
  
  
  
  
  


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] move SoGE from berkeleydb to classic

2017-07-19 Thread Chris Dagdigian



Changing the spooling method is usually a "destroy and rebuild" 
operation in my experience.



Roberto Nunnari 
July 19, 2017 at 10:56 AM
Hello.

A couple of months ago I installed SoGE-8.1.9 building it with 
-spool-berkeleydb


Now I would like to move to -spool-classic

Do I need to rebuild, install and export + import SoGE configuration 
or is it possible to change a settings in the existing installation?


Thank you and best regards.
Roberto
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fwd: eqw for qsub jobs

2016-09-28 Thread Chris Dagdigian



I think the "queue instance dropped because ... full" is not related to 
your user/job problem. The dropped message is a sign from the job 
placement process that the queue instance was skipped during the active 
host select-and-job-dispatch round because it had no more job slots free 
to take new work. This would be a normal status alert on an active 
cluster with lots of jobs in 'qw' state. No big deal basically unless 
you think a resource, quota or some other thing is interfering.


State "Eqw" is usually a sign that something went badly wrong with a 
job. Its usually a sign of a significant issue like the UID/GID of the 
user not existing on the execution host or similar or it could be as 
simple as user error in a script (permission denied, path not found, etc.).


What does "qstat -j " tell you about the jobs in Eqw state? Any 
interesting spool lots from the compute nodes or qmaster?


Chris




Dan Hyatt wrote:


I am trying to narrow down what would cause this. I searched google 
and the sge resources and could not find a reason for


  queue instance "VeryHighMem@blade5-5-8" dropped because it is full
  queue instance "HighMem@blade5-1-4" dropped because it is full

This is that one user almost every shop has who is incredible at its 
work, but causes about 90% of the technical problems because of bad 
choices.



Why would sge queue the jobs for everyone else but with this user 
suddenly drop jobs "because its full"


I have lots of jobs went to "eqw" as shown in the follow:
1144122 0.55500 sas64  username   Eqw   09/27/2016 22:54:45   
 1
1144125 0.55500 sas64  username   Eqw   09/27/2016 22:55:35   
 1
1144127 0.55500 sas64  username   Eqw   09/27/2016 22:56:25   
 1
1144130 0.55500 sas64  username   Eqw   09/27/2016 22:57:15   
 1
1144134 0.55500 sas64  username   Eqw   09/27/2016 22:58:05   
 1
1144139 0.55500 sas64  username   Eqw   09/27/2016 22:58:55   
 1
1144142 0.55500 sas64  username   Eqw   09/27/2016 22:59:46   
 1
1144145 0.55500 sas64  username   Eqw   09/27/2016 23:00:36   
 1
1144151 0.55500 sas64  username   Eqw   09/27/2016 23:01:26   
 1
1144156 0.55500 sas64  username   Eqw   09/27/2016 23:02:16   
 1
1144161 0.55500 sas64  username   Eqw   09/27/2016 23:03:06   
 1
1144165 0.55500 sas64  username   Eqw   09/27/2016 23:03:56   
 1
1144169 0.55500 sas64  username   Eqw   09/27/2016 23:04:46   
 1
1144174 0.55500 sas64  username   Eqw   09/27/2016 23:05:36   
 1
1144177 0.55500 sas64  username   Eqw   09/27/2016 23:06:26   
 1
1144182 0.55500 sas64  username   Eqw   09/27/2016 23:07:17   
 1
1144186 0.55500 sas64  username   Eqw   09/27/2016 23:08:07   
 1
1144196 0.55500 sas64  username   Eqw   09/27/2016 23:08:57   
 1
1144204 0.55500 sas64  username   Eqw   09/27/2016 23:09:47   
 1
1144212 0.55500 sas64  username   Eqw   09/27/2016 23:10:37   
 1
1144217 0.55500 sas64  username   Eqw   09/27/2016 23:11:27   
 1
1144221 0.55500 sas64  username   Eqw   09/27/2016 23:12:17   
 1
1144224 0.55500 sas64  username   Eqw   09/27/2016 23:13:08   
 1
1144225 0.55500 sas64  username   Eqw   09/27/2016 23:13:58   
 1
1144227 0.55500 sas64  username   Eqw   09/27/2016 23:14:48   
 1
1144232 0.55500 sas64  username   Eqw   09/27/2016 23:15:38   
 1
1144236 0.55500 sas64  username   Eqw   09/27/2016 23:16:28   
 1
1144244 0.55500 sas64  username   Eqw   09/27/2016 23:17:18   
 1
1144255 0.55500 sas64  username   Eqw   09/27/2016 23:18:09   
 1
1144265 0.55500 sas64  username   Eqw   09/27/2016 23:18:59   
 1
1144276 0.55500 sas64  username   Eqw   09/27/2016 23:19:49   
 1
1144286 0.55500 sas64  username   Eqw   09/27/2016 23:20:39   
 1
1144295 0.55500 sas64  username   Eqw   09/27/2016 23:21:29

Re: [gridengine users] Hardware thoughts?

2016-07-20 Thread Chris Dagdigian



In environments where you do tens of thousands of jobs per day or tons 
or really short jobs or a constant flow of jobs always active you may 
need a master node that is somewhat beefy. If you've never seen your 
head node get slammed then you can downsize. If there is a chance that 
your workload could change significantly then keep the size as is.


I'm in favor of massive login nodes. They are often used by users who 
are prototyping job scripts and we can't always train them to 'qlogin' 
or 'qrsh' into a remote node for testing. All you need is a couple of 
people running large R or Matlab tasks plus some other people doing a 
massive set of array job prep combined with a couple of people who 
constantly "qstat" and you can run the login node out of resources 
pretty quickly.


The cost of CPU and RAM at this scale is dirt cheap. Effectively noise 
relative to cost of networking and storage so I also tend to make login 
and interactive nodes larger than strictly necessary.


My $.02!

Chris



Notorious Biggles wrote:

Hi all,

I have some money available to replace the infrastructure nodes of one 
of my company's grid engine clusters and I wanted a sanity check 
before I order anything new.


Initially we contacted the company we originally bought the cluster 
from and they quoted us for a combined login/storage/master node with 
loads of everything and a hefty price tag. I feel an aversion to 
combining login nodes with storage and master nodes - we already have 
that on one of the clusters and a user being able to crash the entire 
cluster seems a bad thing to me and it happened often enough.


I read Rayson's blog post about scaling grid engine to 10k nodes at 
http://blogs.scalablelogic.com/2012/11/running-1-node-grid-engine-cluster.html 
and it seems that 4 cores and 1 GB of memory is more than enough to 
run a grid engine master. Given that I'd be lucky to have 100 nodes to 
a master, can anybody see a reason to spec a high powered master node? 
I look at my existing master nodes with 8+ cores and 24+ GB of memory 
and in Ganglia all I see is acres of green from memory being used as 
cache and buffers. It seems rather a waste.


The other thing I was curious about is what kind of spec seems 
reasonable to you for a login node. My one cluster with separate login 
nodes has similar specs to the master nodes - 8 cores, 24 GB memory 
and it seems wasted. I can see an argument for these nodes to be more 
than just a low end box, especially if anybody is trying to do some 
kind of visualization on them, but I've never had complaints about 
them being under-powered yet.


Any thoughts you might have are appreciated.

Thanks
Biggles



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] prolog execution location and behavior?

2016-06-08 Thread Chris Dagdigian



Hey folks -- need my brain refreshed on prolog behavior ...

Trying to figure out if a prolog script would be suitable for 
dramatically changing the execution environment -- doing things like NFS 
filesystem unmounts or chroot actions so that an incoming job would 
execute in the changed environment.


I can see the prolog running as 'me' and as a child of the sge_shepherd 
daemon but I don't have enough of a test lab setup to confirm that the 
prolog is running on the execution host and if the parent/child process 
relationship is such that chroot jail actions performed by a prolog 
would be where the jobscript ends up running


Anyone have a quick answer? Thanks!

Chris



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] docker under GE

2016-05-30 Thread Chris Dagdigian



This may not be a Univa mailing list but it is also not your mailing list.

Univa folk are welcome here, as always.

It's a tough line for Univa to walk between their commercial interests 
and the open source forks of GE that we know and love. By my view Univa 
has done all the right things on this mailing list. They participate 
freely, share knowledge extensively and  help troubleshoot GE problems 
users are having without caring what version is in use. They don't shill 
hard, spread FUD or otherwise pollute the list with too much marketing 
etc. All things considered we are *way* better off having Univa 
engineers, developers and support people participating here.



Best,
Chris


Paolo Francesco Lenti 
May 30, 2016 at 3:33 AM
sorry,

but UGE <> SGE/SoGE (no more), and this isn't an Univa ml.

best
p.





___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] All queues dropped because of overload or full

2016-05-25 Thread Chris Dagdigian



Something is fundamentally broken with Grid Engine.  An empty "qconf 
-sql" means that SGE is unaware of *any* cluster queues -- at the very 
least you should see the default all.q show up


And also this is clear via blank "qstat -f' output -- SGE simply does 
not think that any compute nodes or SGE cluster queues even exist


Sad to say though that the root cause and real fix is likely via ROCKS. 
SGE does not break this way naturally -- something went sideways during 
the ROCKS upgrade or one of the ROCKS specific upgrade or autoinstall 
scripts.


You may need to ask the ROCKS people how to force a reinstall of SGE -- 
anything manual that we propose on this list would likely not persist 
since ROCKS likes to do a lot of automated provisioning and service 
management behind the scenes.


Chris



Pat Haley wrote:


It looks similar but one big difference is when I run "qconf -sh" I 
see all my compute nodes listed along with my frontend.  However 
"qconf -sql" is empty.


Thanks


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] All queues dropped because of overload or full

2016-05-25 Thread Chris Dagdigian



I'd be willing to bet the output of "qstat -f -u '*'  " shows that all 
your compute nodes are in 'au' state


If there is no sge_execd process running on each compute node then Grid 
Engine won't work and it can't dispatch "work" to those nodes.


The errors you see and the jobs pending forever in wait state is just a 
symptom of the real problem -- you have no functional grid in which to 
dispatch the jobs.


Basically your compute nodes fell over; if you can restart SGE on those 
nodes and monitor via 'qstat -f' to confirm that the 'au' state goes 
away then your jobs should start flowing again


Chris



Pat Haley wrote:


We have also noticed that there are no sge deamons running on any of 
the execution nodes (I don't know if that is normal or not).  We have 
also collected the information below from qconf.  Any help in 
resolving this would be greatly appreciated. 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

2015-12-16 Thread Chris Dagdigian



This looks and feels like an MPI job launching failure

Especially as it fails exactly when it tries to cross the threshold from 
single chassis to multiple boxes


The #1 debugging advice in this scenario is this:

 -- Can you definitively run on more than 12 cores OUTSIDE of grid engine?

My experience with failures similar to this is that you first need to 
see if the problem is with the app or if the problem is with Grid 
Engine. Testing to see if your "hello world" example works beyond 12 
cores WITHOUT grid engine will be a valuable datapoint and 
troubleshooting step. When MPI is involved this is doubly true.


-Chris


Gowtham 
December 16, 2015 at 1:53 PM

Dear fellow Grid Engine users,

Over the past few days, I have had to re-install compute nodes (12 
cores each) in an existing cluster running Rocks 6.1 and Grid Engine 
2011.11p1. I ensured the extend-*.xml files had no error in them using 
the xmllint command before rebuilding the distribution. All six 
compute nodes installed successfully, and so did running several test 
"Hello, World!" cases up to 72 cores. I can SSH into any one of these 
nodes, and SSH between any two compute nodes just fine.


As of this morning all submitted jobs that require more than 12 cores 
(i.e., spanning more than one compute node) fail about a minute after 
starting successfully. However, all jobs with 12 or less cores within 
the a given compute node run just fine. The error message for failed 
job is as follows:


  error: got no connection within 60 seconds. "Timeout occured while 
waiting for connection"

  Ctrl-C caught... cleaning up processes

"Hello, World!" and one other program, both compiled with Intel 
Cluster Studio 2013.0.028, display the same behavior. The line 
corresponding to the failed job from 
/opt/gridengine/default/spool/qmaster/messages is as follows:


  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 
6129.1 task 1.compute-0-1 failed - killing job


I'd appreciate any insight or help to resolve this issue. If you need 
additional information from my end, please let me know.


Thank you for your time and help.

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue configurations still stored in text files?

2015-11-03 Thread Chris Dagdigian



This is true but only when classic mode spooling is in use. I've been 
able to resurrect totally busted environments via simple text edits in 
the past. The useful/fixable config data in text form is why I still 
tend to config a majority of  SGE clusters with classic spooling, even 
on large clusters. We tend to go binary or other methods only when the 
storage system can't keep up or the job flow rate is insanely high.


Chris

Lane, William wrote:

Is it still true that:
"The queue configurations are stored as text files in the directory 
$SGE_ROOT/$SGE_CELL/spool/qmaster/cqueues/, e.g.:"


from:
http://www.softpanorama.org/HPC/Grid_engine/sge_queues.shtml#Listing_all_existing_queues

That would be great news for me if it were still true.

-Bill L


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] At what point does the network overhead of adding additional nodes to a queue offset the benefit?

2015-09-24 Thread Chris Dagdigian



SGE is fine on 1GB fabrics and I don't know of anyone who uses 10Gb for 
SGE unless it's a combined network fabric that is carrying storage and 
application traffic along with SGE traffic on the same links. Or if you 
are running all new stuff with 10Gb for everything and maybe a 1GB NIC 
held back for ILO/DRAC/IPMI/provisioning usage


The commonly accepted point at which you'd hit a scaling limit on an 
ethernet network would most likely be determined, not by Grid Engine 
traffic but by:

 - Network filesystem traffic for shared storage
 - Application message passing traffic

I don't see SGE native traffic as a huge consumer of bandwidth or 
network resources in most cases. It's the "other stuff' that blows out 
the network.


And there is no one size fits all answer there as people's HPC 
footprints vary wildly by how they are used and what they are 
architected for.


SGE can run at massive scale over 1Gb network fabric without issues.  
The only time 1Gb network becomes the bottleneck is when you try to 
stuff NFS and application traffic down the same pipe. And even then 
you'd hit performance and job throughput problems before you hit a 
scaling limit wall.  If you've got SGE running on a mostly free 1GB 
fabric (maybe it's your admin or provisioning network etc.) you'd be 
fine at even large scale.


The sorts of tuning you'd do to run "big SGE" on a 1GB fabric would be to:

 - Tune the qmaster host to handle the # of endpoints expected
 - Make darn sure application traffic and storage traffic is on a 
different network
 - If you have to share the 1Gbe with other traffic than configure SGE 
for local spooling. The danger here is performance impact, not scaling


My $.02





Lane, William 
September 24, 2015 at 6:04 PM
If a cluster is running on a relatively slow speed networking backbone 
(say gigabit ethernet or
10 Gib ethernet as opposed to inifiniband), is there any commonly 
accepted point at which increasing the number
of nodes in a queue negatively affects the performance of the queue? 
Is there any general
rule about how many nodes to have in a queue based on a given network 
backbone?


-Bill L.



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] quick question re formatting of complex_values in global exec host

2015-08-11 Thread Chris Dagdigian



I've got a consumable resource called "foo" that I manage via an entry 
in the global exechost object ("qconf -me global; set foo=100")


But now due to a networking issue only 50% of my compute nodes are 
capable of running jobs that request this resource


I'd rather not pin the complex entry to individual hosts - is there a 
way I can reference hostgroups in an entry for "complex_values=" line?


Something like:

complex_values  foo@@workingNodes=10

or similar?

-Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] command runs in grid engine but does not complete.

2015-06-08 Thread Chris Dagdigian



Most common scenario when "it works from the command line" but "it does 
not work in grid engine" is usually:


- Different shell environment between command-line and batch execution 
(especially if SGE is running in POSIX_COMPLIANT mode)

- Different ENV variables between CLI and batch environment
- Different PATH definition between CLI and batch environment

Usually when faced with this problem I submit a shell script that dumps 
PATH, ENV and other interesting info and then I compare that to the 
environment where my command line example is working.


You also have two different user IDs shown in the output. Is the userID 
that is running the job trying to create a data file that may already 
exist and perhaps be owned by a different user?





Dan Hyatt 
June 8, 2015 at 2:42 PM

We are running a binary program called metaanalysis, which the user 
says was working prior to a grid reconfiguration.



qsub -cwd -b y /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log

This starts, runs, creates the logs, and then fails to create the data 
files

qsub -cwd -b y  /dsg_cent/bin/metal < c22srcfile.txt > c22SBP.log

-rw-rw-r-- 1 aldi   genetics 8523209 Jun  8 09:53 c22GENOA.SBP.EA.M1.csv
-rw-rw-r-- 1 aldi   genetics 8660667 Jun  8 09:53 c22FamHS.SBP.ea.M1.csv
-rw-rw-r-- 1 aldi   genetics 6025412 Jun  8 09:53 
c22HYPERGEN.SBP.EA.M1.csv

-rw-rw-r-- 1 aldi   genetics2061 Jun  8 09:53 c22srcfile.txt
-rw-rw-r-- 1 dhyatt genetics  43 Jun  8 13:40 c22SBP.log
-rw-r--r-- 1 dhyatt genetics   0 Jun  8 13:40 metal.e1043
-rw-r--r-- 1 dhyatt genetics2743 Jun  8 13:40 metal.o1043
[dhyatt@blade5-2-1 c22

 the control/output file indicates everything runs there are .o and .e 
files, but no data



The command line works fine, and creates the data files. But I need to 
run large jobs on the queue


-rw-rw-r-- 1 aldi   genetics  8523209 Jun  8 09:53 c22GENOA.SBP.EA.M1.csv
-rw-rw-r-- 1 aldi   genetics  8660667 Jun  8 09:53 c22FamHS.SBP.ea.M1.csv
-rw-rw-r-- 1 aldi   genetics  6025412 Jun  8 09:53 
c22HYPERGEN.SBP.EA.M1.csv

-rw-rw-r-- 1 aldi   genetics 2061 Jun  8 09:53 c22srcfile.txt
-rw-rw-r-- 1 dhyatt genetics  8177082 Jun  8 13:39 METAANALYSIS1.TBL
-rw-rw-r-- 1 dhyatt genetics 1054 Jun  8 13:39 METAANALYSIS1.TBL.info
-rw-rw-r-- 1 dhyatt genetics 10487038 Jun  8 13:39 METAANALYSIS2.TBL
-rw-rw-r-- 1 dhyatt genetics 1316 Jun  8 13:39 METAANALYSIS2.TBL.info
-rw-rw-r-- 1 dhyatt genetics 5030 Jun  8 13:39 c22SBP.log

any thoughts?

Dan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sanity check on usage of "-p" priority value: per-user effect or global across waitlist?

2015-04-30 Thread Chris Dagdigian



Yep I think we are going to try this and monitor it. Gut feeling is that 
the fairshare by user policy has so much more impact/weight on the 
global job waitlist that if we just have a single user doing stuff with 
"-p -10" and "-p -1"  to distinguish between her own jobs it might 
actually do close to what we want without taking too much of a global hit


Thanks all!

Chris


Fritz Ferstl wrote:
Nah, the weight_priority won't help. It just determines how much 
influence the -p has vs things like job wait time or urgency. If you 
have none of those then all being equal it would have the same effect 
as if you left it untouched. And if you have influence from waiting 
time or urgency or others then setting the prio weight low would make 
it just totally insignificant.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] sanity check on usage of "-p" priority value: per-user effect or global across waitlist?

2015-04-30 Thread Chris Dagdigian



GE and UGE man pages are not clear about the scope of "-p" priority 
values when a user uses it. It's been a long time since I needed this 
and I wanted to confirm the scope of the behavior ..


Use case:

 - I need to submit 100 personal jobs as "dag" with 10 jobs being 
slightly more important than others

 - I'm not an admin so I can't use priority values higher than 0

What I'd like to do:

 - Submit 90 jobs with "-p -100" since I can't use value higher than zero
 - Submit 10 jobs with "-p -10" to give priority to my 10 special tasks

My question:

 - Does my use of "-p" to send lower-than-zero values for my submitted 
jobs affect just MY jobs and the order in which they get dispatched or 
will I end up penalizing myself globally because all the other jobs from 
other users on the cluster are running with default "-p" values of 0 
assigned to them?


-dag


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] load grpah

2015-04-17 Thread Chris Dagdigian



There are tons of systems for measuring and displaying load on a grid. 
Ganglia would give you pretty graphs of CPU usage etc while tools like 
php-qstat would be able to show/display info about SGE usage, queue 
state and pending jobs etc.



Jacques Foucry 
April 17, 2015 at 4:12 AM
Hello folks,

Is there a way to have (may be on qmaster) some load graph of the grid ?

Thanks for you help.

Jacques

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Anyone have scripts for detecting users who bypass grid engine?

2015-04-09 Thread Chris Dagdigian



I'm one of the people who has been arguing for years that technological 
methods for stopping abuse of GE systems never work in the long term 
because motivated users always have more time and interest than 
overworked admins so it's kind of embarrassing to ask this but ...


Does anyone have a script that runs on a node and prints out all the 
userland processes that are not explicitly a child of a sge_sheperd daemon?


I'm basically looking for a light way to scan a node just to see if 
there are users/tasks running that are outside the awareness of the SGE 
qmaster.  Back in the day when we talked about this it seemed that one 
easy method was just looking for user stuff that was not a child process 
of a SGE daemon process.


The funny thing is that it's not the HPC end users who do this. As the 
grid(s) get closer and closer to the enterprise I'm starting to see 
software developers and others trying to play games and then plead 
ignorance when asked "why did you SSH to a compute node and start a 
tomcat service out of your home directory?". heh.


-chris



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] best way to start new exechost in disabled (d) state during template driven install?

2015-04-08 Thread Chris Dagdigian



THANK YOU! That is where I remember initial_state from -- the queue 
config and not the autoInstall template


-dag


Reuti wrote:

What is the value of "initial_state" in the queue definition right now?

-- Reuti

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] best way to start new exechost in disabled (d) state during template driven install?

2015-04-08 Thread Chris Dagdigian



Been a while since I needed this so I'm being lazy and asking the list 
first, heh ...


We are doing some elastic grid engine building in Amazon using methods 
other than the StarCluster suite due to some unique scaling and security 
requirements.


I recall from memory that there was a way to configure "default queue 
instance state" such that new nodes came into the grid in disabled state 
. This would allow us to complete our provisioning and config management 
work and avoid the race condition where the new exechost joins the 
cluster and starts trying to take jobs before it is 100% configured.


Is there a way in the autoinstall template to set exechost state? Am I 
imagining things that this used to be a feature? Or is this perhaps a 
global qmaster setting instead?


Thanks!

Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

2015-04-08 Thread Chris Dagdigian



Fantastic timing! Thank you very much. I'm going to install and try it 
out right now.


Regards,
Chris



RDlab 
April 8, 2015 at 6:06 AM
Hello,

My name is Gabriel and I am the IT manager at RDlab, we are the S-GAE 
guys :)


A new version of S-GAE compliant and tested for UNIVA logs has been 
released today v1.1.8. You can download it for free at its homepage:


http://rdlab.cs.upc.edu/s-gae

Please, do not hesitate to contact us!

Best regards,

Gabriel


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

2015-03-03 Thread Chris Dagdigian



I'll give some impressions of S-GAE since I have it installed in a lot 
of places ...


- It's a good basic reporting tool for monthly metrics.
- I don't use all of the features; mainly the full cluster "view"
- In the full cluster view there are 4-6 PNG graphics that I just 
generate and copy/embed into a written document


The basic metrics that I like are:

 - Job count shown as a percentage of success/failed jobs (job success 
% is a great top-line metric)

 - Cluster exec time (bar graph showing longest / shortest / avg job info)
 - Slots per job  graph (great way to show that only 1% of jobs use MPI 
or threaded PE hack)

 - Top ten users by memory consumption
 - Top ten users by raw job count
 - Top ten users by absolute exec time

Generic observations:

 - It's not super fast at ingest; it does a qacct on every job in the 
accounting file, parses the data and loads into db; I usually let it 
cook overnight on ingest


 - It can be tuned for ingest with various memory, mysql and ramdisk 
methods


 - It's not fast at viewing - tons of temporary mysql tables are made 
in $TMP just to show the front cluster view page


 - It can take 10 minutes just to render the HTML main page after we've 
loaded metrics for the month; lots of action in /tmp with temporary 
mysql files


 - By default it will reject jobs for which the username does not exist 
on localhost - this is crappy for situations where I take someone's 
accounting file and run it through my own S-GAE server running on AWS 
cloud or elsewhere. I had to make scripts that parse the accounting file 
for usernames, generate a uniq list and then make fake dummy accounts on 
the local system. This is problematic if you don't pay attention to the logs


 - Errors in the logs about being unable to ingest or create summary 
views may make you think at first about SQL or database problems but 99% 
of the time it means that the system ran /tmp to 100% full and just 
bombed out trying to execute a procedure


 - There are certain things that can ONLY be done in the web interface 
that kill me when I set up or repeatedly setup and rebuild a metric 
system. You can't configure the known queues or other parameters via a 
script or a config file. Each time you install or reinstall you need to 
step through the web page. There are multiple point and click events 
require to register each cluster queue which is painful on big systems 
where I may be destroying and rebuilding the S-GAE system multiple 
times. It's a human interaction / UI  hassle basically



Tuning:

 - S-GAE needs huge /tmp space and may fail subtly unless you are 
careful about watching the logs
 - For a cluster that does between 1-2million jobs a month we need a 
100GB /tmp partition to run metrics



For fixed installs that run metrics monthly I just configure the server 
to use a big /tmp partition and decide if I can get away with turning on 
the in-memory accounting file handling on a given system.


When running on the Amazon cloud doing a 1-off analysis on accounting 
file from a client I've found that I could make things go far far faster by:


 - Running on a spot node with lots of memory
 - Carving out a ramdisk out of some of the ram and mounting it as /ramdisk
 - Relocating the mysql database data/table files into /ramdisk
 - Applying some of the mysql tuning advice from google to the 
mysql.conf file

 - Keeping the accounting file in /ramdisk/ path




___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

2015-03-02 Thread Chris Dagdigian



ooh the  various MoD ("metrics on demand") look pretty interesting.   
Would love to chat about how people have made XDMoD and other variants 
work with Grid Engine(s) -- can we get a little thread going on best 
practices and recommendations for 3rd party reporting/metrics tools? 
Suspect there would be decent interest in this ...


-Chris



Tina Friedrich 
March 2, 2015 at 11:37 AM
Yes, there's an additional field - job_class.

I'm not using S-GAE, so got nothing for you I'm afraid; I had a 
similar problem with UBMoD (which I'm still running), where I had to 
make (probably similar) changes to make it work (keep it working, 
rather).


Tina





___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Anyone using S-GAE reporting app with Univa grid engine?

2015-03-02 Thread Chris Dagdigian



Hey folks,

I'm a big fan of the php based S-GAE grid engine reporting tool that the 
fine folks over at 
http://rdlab.cs.upc.edu/index.php/en/services/s-gae.html have put together


However it looks like S-GAE is falling over on a cluster where we 
recently converted from open source grid engine to the commercial Univa 
version.


Suspect Univa has new/different fields in the accounting file that 
S-GAE's qacct.php parser can't deal with. I also suspect that they have 
converters or tools to handle this but I figured it would also be worth 
canvasing the user community to see of people have run into this before. 
I'll write a new parser or converter if I have to.   Don't' want to 
reinvent the wheel if I don't' have to ...


Regards,
Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Can SGE handle job dependencies?

2015-02-22 Thread Chris Dagdigian



Yes SGE can handle dependencies between jobs and even dependencies 
between tasks in a job array.


The job dependency syntax depends on job naming in the most common use 
case, here is a simple example:


  qsub -N DataStagerTask ./my-SGE-job.sh

  qsub -hold_jid DataStagerTask ./my-analytic-job.sh


The "-hold_jid " argument is what makes the 2nd job dependent on 
the 1st job exiting before it will run


-Chris




Peng Yu 
February 22, 2015 at 3:26 PM
Hi Ed,

I am wondering if SGE allows users to specify dependencies between
jobs. For example, I may need job1 be finished before job2 is started,
even thought there might be enough resource to run job2 at a given
time.

Would you please let me if SGE do so? Thanks.


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] suggestions on setting up queues

2015-01-16 Thread Chris Dagdigian



Queues are just a piece of the puzzle when it comes to handling resource 
allocation on a multi user system, what (if any)  scheduling policies 
and resource quotas are you currently using?


That said you are using the queue methods in a good way. There are 
certain things that can only be really done on a per-queue basis and top 
of the list would be ACL protection and the ability to impose hard or 
soft wallclock limits.


A fairshare-by-user policy with the queue structure you set up would be 
a decent starting point from which you can gather more data and user 
feedback.


Thoughts

 - resource quota would perfectly handle the "only N jobs per user can 
run in the long-job.q cluster queue ..."


 - I've had little success putting wallclock limits on interactive 
queues; there are legit business/scientific reasons in many cases for a 
long running interactive session. You might want to poll the users or 
collect data on this. In a few different environments I've had decent 
success by leaving interactive queue slots unrestricted but putting a 
resource quota around how many slots a single user can consume. It's 
also pretty easy to set up tools that would allow you to dynamically 
adjust the size/count of the interactive slot pool to account for 
changing demand - it's particularly easy when used with SGE hostgroup 
objects.


My $.02






Stephen Spencer 
January 16, 2015 at 2:50 PM
Good morning.

With the number of users on our clusters growing, it's becoming less 
realistic to say "play fair 'cause you're not the only user of the 
cluster."


I'm looking for suggestions on setting up queues, both the "why" and 
"how," that will allow more of our users access to the cluster.


What I'm thinking of is a multi-queue approach:

  * some limited number of "interactive" slots (and they'd be
time-limited)
  * a queue for jobs with short time duration - the "express" queue
  * a queue for jobs that will run longer... but only so many of these
per user

Any and all suggestions are welcome.

Thank you!

Best,
--
Stephen Spencer
spen...@cs.washington.edu 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE and NFS

2014-11-12 Thread Chris Dagdigian


my $.02

SGE can run 100% local without NFS - the main thing (in my experience) 
that you lose in this config is the easy troublshooting ability of going 
into a central $SGE_ROOT/$SGE_CELL/ and seeing all of the various node 
spool and message files. It's annoying but not a dealbreaker especially 
after seeing what you are experiencing.


That said, I do a ton of SGE work with classic spooling on EMC Isilon 
storage - some environments that do close to 1 million jobs/month in 
throughput and we've never seen a catastrophic loss of jobs or spool 
data. Most are without Bright although I know of at least one group 
running Bright on 1000 cores sitting on top of Isilon storage and 
they've not seen anything like this either.


If you go 100% local my recommendation would just be to put the whole 
$SGE_ROOT out on the local nodes. The time it would take to winnow down 
to the minimal file set is not worth it relative to the size of the 
whole thing.


-Chris



Peskin, Eric 
November 12, 2014 at 8:26 AM
All,

Does SGE have to use NFS or can it work locally on each node?
If parts of it have to be on NFS, what is the minimal subset?
How much of this changes if you want redundant masters?

We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and 
SGE 2011.11. Specifically, SGE is provided by a Bright package: 
sge-2011.11-360_cm6.0.x86_64


Twice, we have lost all the running SGE jobs when the cluster failed 
over from one head node to the other. =( Not supposed to happen.
Since then, we have also had many individual jobs get lost. The later 
situation correlates with messages in the system logs saying



That file lives on an NFS mount on our Isilon storage.
Surely, the executables don't have to be on NFS?
Interesting, we are using local spooling, the spool directory on each 
node is /cm/local/apps/sge/var/spool , which is, indeed local.

But the $SGE_ROOT , /cm/shared/apps/sge/2011.11 lives on NFS.
Does any of it need to?
Maybe just the var part would need to: /cm/shared/apps/sge/var ?

Thanks,
Eric



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] wiki is back, now hopefully with far more resiliency

2014-07-25 Thread Chris Dagdigian

Hi folks,

http://wiki.gridengine.info/wiki/index.php/Main_Page

... is back online although it's running an old version of mediawiki
that needs some attention eventually so it may go down for upgrades at
some point.

That site has been on and offline randomly for quite some time mostly
due to it running on flaky/aging hardware running citrix xenserver
hypervisor. Storage hiccups and other oddities would halt the VM and it
was awkward to ssh into the hypervisor to restart/debug things
especially if I was on the road.

All that has changed. New home, new basement "datacenter", new
150mbit/sec business-SLA internet circuit and everything is running off
a sweet new micro-PC running Xen w/ local SSD disk and metadata and
backup storage pools running off remote iSCSI SAN luns. Should be far
more reliable now that everything is running CentOS 6.5 on brand new
networking, storage and hypervisor kit.

-dag

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] pbsdsh

2014-07-02 Thread Chris Dagdigian



HUMMEL Michel wrote:
> I wonder if there is, in OGS, an equivalent of the pbsdsh command from
torque.
> This command spawns a program on all nodes allocated to the PBS job.
The spawns take place concurrently - all execute at (about) the same time.
>

Not within OGS - most people independently install DSH or pPDSH to
handle this

My $.02
-Chris
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] [administrative] test of users mailing list after DNS provider swap

2014-06-03 Thread Chris Dagdigian


Sorry for the admin note; we've effectively been offline since our DNS
provider (zoneedit.com) had 2 nameservers go down for multi-day periods.
I've got 30+ domains managed with them and only gridengine.org had the
bad luck to have it's primary and secondary nameservers assigned to the
two zoneedit hosts that have been unreachable.

With DNS down, most mail users would not get messages as a common
anti-spam measure is to refuse email when the envelope contains a FROM
value that does not resolve. I saw many of these refusals in the
mailserver logs.

This morning I pulled gridengine.org DNS zone record(s) from zoneedit
and moved it into the Amazon Route 53 managed DNS service. Data may take
a while to propagate and this is my first test to see if mail is going
though ...

Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Configurations during kickstart

2014-05-01 Thread Chris Dagdigian

Random ideas:

1. try disabling the log redirects to see if anything ends up in the
standard kickstart log?

2. SGE is unusually sensitive to hostname and DNS resolution. Is your
kickstart environment giving the node the same IP address during
provisioning as it has when running? Does your kickstart environment
have reverse DNS lookup working so that a lookup on the IP returns the
proper hostname?

3. qconf requires communication with the qmaster, it looks like you are
defining ENV vars that point only to the bin directory rather than
setting up the full SGE environment during the kickstart. Consider
sourcing the SGE init scripts or at least setting SGE_ROOT and SGE_CELL
values so that the SGE binaries can navigate to
$SGE_ROOT/$SGE_CELL/act_qmaster so that it knows what host to be
communicating with

Regards,
Chris

Michael Stauffer wrote:
> Hi,
> 
> I'm trying to get some resource configurations in place during
> kickstart. I have the following in my kickstart file
> "replace-partition.xml". The file is run during kickstart: I can see
> output to text files when I add debugging info.
> 
> This code runs correctly if I run it in a shell once the node is up.
> 
> The issue seems to be that qhost and qconf aren't outputting anything
> when they run. Is that to be expected? Here's what I have added:
> 
> 
> 
>   
> 
> # Here's the code as I'd like it to work:
> # This code gets reached. I can output these env vars and the
> #  values are correct.
> export SGEBIN=$SGE_ROOT/bin/$SGE_ARCH
> export NODE=$(/bin/hostname -s)
> export MEMFREE=`$SGEBIN/qhost -F mem_total -h $NODE|tail -n
> 1|cut -d: -f3 | cut -d= -f2`
> $SGEBIN/qconf -mattr exechost complex_values h_vmem=$MEMFREE
> $NODE 2>&1 > /root/qconf_complex_setup.log
> $SGEBIN/qconf -mattr exechost complex_values s_vmem=$MEMFREE
> $NODE 2>&1 >> /root/qconf_complex_setup.log
> 
> 
> 
> Thanks!
> 
> -M
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Gridengine on MAC

2014-04-06 Thread Chris Dagdigian

Run linux as a virtual machine on your mac. It will be easier all
around. SGE usually builds and compiles under OS X without too much
hassle but dealing with all of the "mac stuff" like switching the
startup scripts over to the OS X Launchd() framework files is a pain in
the arse.

My $.02 of course!

Michael Ljungberg wrote:
> Hi
> 
> I am new to this email list but wonder if there is anyone that has a
> relatively simple description of how to install GridEngine on a MAC 10.9
> MacBook Pro computer. I am using Gridengine on a Linux cluster and want
> to have transparant scripts.
> 
> 
> Thank you in advance
> 
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Chris Dagdigian


Same symptoms seen at one of my clients just yesterday. Programmatic
scripts that send a small number of jobs into qsub that all use a
threaded PE or similar. Our cluster routinely runs much larger workloads
all the time.

Our sge_qmaster ran the master node out of memory and was killed hard by
the OOM daemon.

We also disabled schedd_job_info and have not seen the issue although
it's only been about 24 hours.  I'm actually a huge fan of this
parameter as it's one of the few troubleshooting tools available to
regular non-admin users. The qalter '-w v' workaround is a great catch
though!

Regards,
Chris



Peskin, Eric wrote:
> All,
> 
> We are running SGE 2011.11 on a CentOS 6.3 cluster.
> Twice now we've had the following experience:
> -- Any new jobs submitted sit in the qw state, even though there are plenty 
> of nodes available that could satisfy the requirements of the jobs.
> -- top reveals that sge_qmaster is eating way too much memory:  > 50 GiB in 
> one case, > 128 GiB in another.
> -- We restarted the sgemaster.  That fixes it, but...
> -- Many (though not all) jobs were lost during the master restart.  :(
> 
> We have a suspect (but are not sure) about what jobs are triggering it, but 
> we do not know why or what to do about it.  Both times that this happened 
> someone was running a script that automatically generates and submits 
> multiple jobs.  But it wasn't submitting that many jobs -- only 18.  We have 
> users who do similar things with over 1,000 jobs without causing this.
> 
> The generated scripts themselves look like reasonable job scripts.  The only 
> twist is using our threaded parallel environment and asking for a range of 
> slots.  An example job is:
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -N c-di-GMP-I.cm
> #$ -cwd
> 
> module load infernal/1.1
> cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm
> 
> 
> The scripts are submitted from a perl script with:
> system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!);
> 
> Our threaded parallel environment is:
> pe_namethreaded
> slots  5000
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$pe_slots
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> 
> Any ideas on the following would be appreciated:
> 1)  What is causing this?
> 2)  How can we prevent it?
> 3)  Is it normal that restarting the qmaster kills some jobs?
> 4)  Is there a safer way to get out of the bad state once we are in it?
> 5)  Is there a safe way to debug this problem, given that any given 
> experiment might put us back in the bad state?
> 
> Some background:
> Our cluster uses the Bright cluster management system.
> 
> We have 64 regular nodes, with 32 slots each.  (Each node has 16 real cores, 
> but with hyper-threading is turned on.)
> 62 of the regular nodes are in one queue.  
> 2 of the regular nodes are in a special queue to which most users do not have 
> access.
> A high-memory node (with 64 slots) is in its own queue.
> 
> Each node including the head node (and the redundant head node) has 128 GiB 
> of RAM, except for one high memory node with 1 TiB of RAM.  We have memory 
> over-commiting turned off:  vm.overcommit_memory = 2
> 
> [root@phoenix1 ~]# cat /etc/redhat-release 
> CentOS release 6.3 (Final)
> [root@phoenix1 ~]# uname -a
> Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 
> x86_64 x86_64 x86_64 GNU/Linux
> [root@phoenix1 ~]# rpm -qa sge\*
> sge-client-2011.11-323_cm6.0.x86_64
> sge-2011.11-323_cm6.0.x86_64
> 
> Any ideas would be greatly appreciated.
> 
> 
> 
> This email message, including any attachments, is for the sole use of the 
> intended recipient(s) and may contain information that is proprietary, 
> confidential, and exempt from disclosure under applicable law. Any 
> unauthorized review, use, disclosure, or distribution is prohibited. If you 
> have received this email in error please notify the sender by return email 
> and delete the original message. Please note, the recipient should check this 
> email and any attachments for the presence of viruses. The organization 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> =
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] "SAS Grid" and "SAS VA" integrated with gridengine?

2014-03-20 Thread Chris Dagdigian


Hi folks,

Looking for pointers, documentation URLs or even just personal anecdotes
regarding integrating a few SAS products within grid engine environments.

In this case I'm looking info/tips on two separate SAS products:

  SAS Grid
  SAS Visual Analytics

I have a few potential projects with the need for these tools, however
both may involve sensitive or patient-identifiable data so there is a
chance that they may get built into standalone IT silos and just use SAS
scheduling/dispatch tools. Would also be interested in feedback on the
pros and cons of 3rd party batch scheduler integration vs. the native
SAS stack as well.

Thanks!

Regards,
Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Please help with installation URGENT

2014-03-07 Thread Chris Dagdigian

Have you set a $JAVA_HOME environment variable? That is how I believe
the installer finds your java environment

The only other thing I see in your error logs is a lot of effort spent
generating ssl certficates and otherwise prepping for "Secure Mode"
which is a mode that almost nobody runs Grid Engine in. I'd recommend
disabling that unless this was intentional

And finally you don't really need the GUI for a single machine install

What happens when you skip the GUI and just do this?

# cd /home/abhinav/Downloads2/ge2011.11/
# mv default _default.old
# ./install_sgemaster
# ./install_execd

You may get much better diagnostic info using the command line
installation methods

-Chris

Abhinav Mittal wrote:
> Hi
> 
> I have been trying to install via gui on ubuntu.
> command: sudo ./start_gui_installer
> It is not able to detect JVM installed on my machine.
> My machine is acting both as qmaster and execution host.
> I am attaching 2 snapshots as well and error transcript.
> Please help.
> Abhinav Mittal
> IIIT Hyderabad India
> Center for Computational Natural Sciences and Bioinformatics
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] OGS 2011.11 , Ubuntu 12.04 LTS and NFSv4

2013-10-10 Thread Chris Dagdigian



Hey folks,

Got a cluster running OGS 2011.11 via the dropbox download courtesy
binaries that is having trouble when the NFSv4 share is getting hammered
by file access.

I'm 99% certain that this is an NFSv4/kernel/driver/Ubuntu 12.04 LTS
issue but wanted to check in to see if anyone has any awareness of
issues with OGS and Ubuntu 12.04 LTS or maybe any other oddities
regarding the use of NFSv4 over 10GbE

We used to have more error messages but after upgrading the NIC driver
we only see this on the OS:

> xx-05: Oct  9 13:40:16 xx-05 kernel: [167190.710137] nfs4_reclaim_open_state: 
> unhandled error -13. Zeroing state


Primary symptom is nodes appearing to hang and lots of hung SGE ('t')
job states. I think this indicates that under the hood SGE is having
trouble logging state and spool info when the NFSv4 share runs into a
glitch, timeout or errror.

Like I said this clearly feels like a NFSv4/OS/tuning issue but wanted
to check out of paranoia to see if anyone else has info or experience

Next steps for us:

1. Move spooling to local disk
2. See if we can break the same way via NFSv3
3. Play with GlusterFS
4. Standard NFS tuning for OS/kernel
...
N. Maybe recompile or rebuild gridengine native on the OS


-dag


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Cluster Data Management

2012-12-19 Thread Chris Dagdigian



My own random thoughts about the storage pod and random ideas.

(1) The backblaze pod we built (Rayson has my bioteam.net URL in his 
post below) cost roughly $12K for 100 terabytes of usable storage. Even 
with all the downsides and negatives to this particular hardware rig the 
"$12,000 for 100 usable terabytes" is still a disruptive price point and 
there is all sorts of room for innovation in what you'd might want to do 
with them


(2) Caveat: This is not the only game in town. If you are willing/able 
to spend a bit more money for something more 'safe' in operational terms 
there are tons of chassis on the market now that would give you a large 
number of hot-swap drive bays. Go to siliconmechanics.com etc. and see 
their 'storage server' section for an idea of what you can get when you 
look at hardware that is a few levels above the super cheap pod concept


(3) The protocase people who did the hardware for this particular pod 
are working on a version 2 that addresses many of the hardware 
resiliency items I had blogged about. Their new chassis design will have 
redundant power and mirrored boot drives. So anyone looking at this type 
of unit should check with protocase to see what their plans are


Now to answer Rayson's question

-- I personally would like to combine 3 of these pods using object 
storage software like the stuff over at www.swiftstack.com. An 
individual pod is a dangerous and risky thing. Three or more pods with 
automatic replication and geographic separation is far more interesting. 
Hopefully I'll find someone to fund some of this work early in 2013 so I 
can kick the tires and write/talk about it.



So perhaps for the future of grid engine I'd like to see some sort of 
"understanding of object storage" or other storage systems that use 
RESTful HTTP for data operations. Maybe expand the file staging and 
"store my results in location X" to handle putting job output into an 
object. Heck the next extension after that would be support for the AWS 
S3 API so that we've got the IaaS cloud storage thing handled.


-Chris



Rayson Ho wrote:

Somewhat related to the email message below: I am going to play with a
StoragePod2.0  for the next project in 2013... The machine uses
off-the-shelf components like motherboard, CPU, 45 disks, memory,
power supply... and the real intellectual property that was released
by Backblaze is the 4U custom case (but it is still cool!).

http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/

When I first saw the machine, I thought that it was the Sun Fire x4500
painted in red!! To me, it is a whitebox, DIY version of the x4500,
but the StoragePod is much more affordable at less than US $10,000!

So my question is, besides the obvious functionalities like input file
staging&  transfer, what are the useful features that can (and should)
be implemented in Grid Engine for data management? (Feel free to
contact me offline if your site has special requirements.)

BTW, while googling, found that Chris also has a StoragePod:
http://bioteam.net/2011/08/backblaze-storage-pod/

Rayson

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] will changes to a hard limit in a queue config roll down into running jobs?

2012-11-15 Thread Chris Dagdigian



Quick question ...

I've got a job with a user running in a queue that has a 48 hour hard 
wallclock limit. The user is prepared to move into a long.q but his job 
is *almost* complete and will not go much past the 48h limit. Trying to 
see if I can preserve the job and not lose 48 hours of computation...


If I relax or remove the hard limit from the cluster queue config 
temporarily will the job be spared? Do changes to these limits get 
passed down dynamically to running jobs.  I can't remember if I've ever 
had this particular scenario come up before...


Regards,
Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fwd: Gridengine and Hadoop

2012-03-30 Thread Chris Dagdigian

I'm registering my interest here.

Reuti -- if you could pass my email along to Ralph I'd appreciate it.

I have several consulting customers using EMC Isilon storage on Grid 
Engine HPC clusters and we've been getting pinged from EMC/Greenplum 
sales reps pushing to show off the combination of native HDFS support in 
Isilon + the greenplum hadoop appliance integration.

Basically I have a few largish sites that could test & provide feedback 
if things work out. Some are commercial, some are .gov & all are 
interested in SGE + Hadoop enhancements.

-dag

Reuti wrote:

on behalf of Ralph Castain who you may know from the Open MPI mailing list I 
want to forward this eMail to your attention.

-- Reuti

>  I have a question for the Gridengine community, but thought I'd run it 
through you as I believe you work in that area?
>  
>  As you may know, I am now employed by Greenplum/EMC to work on resource management for Hadoop as well as MPI. The main concern frankly is that the current Hadoop RM (yarn) scales poorly in terms of launch and provides no support for MPI wireup, thus causing MPI jobs to exhibit quadratic scaling of startup times.
>  
>  The only reason for using yarn is that it has the HDFS interface required to determine file locality, thus allowing users to place processes network-near to the files they will use. I have initiated an effort here at GP to create a C-library for accessing HDFS to obtain that locality info, and expect to have it completed in the next few weeks.
>  
>  Armed with that capability, it would be possible to extend more capable RMs such as Gridengine so that users could obtain HDFS-based allocations for their MapReduce applications. This would allow Gridengine to support Hadoop operations, and make Hadoop clusters that used Gridengine as their RM be "multi-use".
>  
>  Would this be of interest to the community? I can contribute the C-lib code for their use under a BSD-like license structure, if that would help.
>  
>  Regards,

>  Ralph
>  
>  

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Build on OS X

2012-02-07 Thread Chris Dagdigian


Hi Valerio,

To start with you'll have to hack up the "aimk" file so that it 
recognizes the arch and continues to try to compile. Start with any 
existing stanzas that cover compiling on darwin and customize as needed.
 

I used to build SGE on OS X but it's been ages and I don't have the 
source on my current laptop or else I could give better advice ...

-Chris



   	   
   	Valerio Luccio  
  February 7, 2012 
10:21 AM
  
  


  
Hello all.

We've been running an old version of SGE on Max OS X servers from
10.2 through 10.6. I now decided to upgrade (actually I have to
upgrade) and downloaded the GE2011.11 tar ball. I compiled it on a
Linux box and then went to compile it on an OS X 10.6.8 server and
when I did the first "./aimk -no-java -no-jni -no-secure
-spool-classic -no-dump -only-depend" i got:

What am I missing ? What am I supposed to do to get it to compile ?

Thanks,

  ___users mailing 
listusers@gridengine.orghttps://gridengine.org/mailman/listinfo/users


Chris 
Dagdigian Principal Consultant, BioTeam Inc.
 http://bioteam.net
  My latest BioTeam Blog
 Post:  2012 Cloud Training 
Dates
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] installing qmaster and exec on Solaris 11

2012-01-30 Thread Chris Dagdigian



Geilow, John wrote:
Dumping
 bootstrapping information
Initializing spooling database
error: unknown object type for list attribute 
"SC_job_load_adjustments" in function ""
 

That (above) is a real error and generally would break an install for 
most people. You need a working spooling setup in the $SGE_CELL 
directory for the remainder of the install to proceed. 

What happens when you try "classic" spooling mode? 

Regards,
Chris






  
Chris 
Dagdigian Principal Consultant, BioTeam Inc.
 http://bioteam.net

Tel:  | Mobile: 




___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] best way to instrument/troubleshoot a segfaulting sge_qmaster?

2012-01-24 Thread Chris Dagdigian



Hi folks,

I've got a fresh set of GE2011 binaries where the sge_qmaster segfaults 
almost instantly on startup.


Looking for quick tips on instrumenting or dialing up the debug data to 
the point where I can get useful error data. Is the best method still to 
try strace or are there other options I should be trying?


-Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] deciding spool directory location

2012-01-13 Thread Chris Dagdigian

That's an awesome epilog script Reuti! I might modify it so that a user
can trigger a request for the archive but it's disabled by default. That
would be a pretty excellent debug tool...

Thanks again!

-dag

Reuti wrote:

Am 13.01.2012 um 17:33 schrieb Chris Dagdigian:

Whoa. If there is a tool out there that gives users access to debug and info
from the spool area I'd love to hear about it and get it out into the
community. One of the downsides to spool locations is that they are usually
only accessible to admins.

Because it is on a different machine like a node? The default permissions allow
everyone to read it. As small epilog:

#!/bin/bash
tar -C ${SGE_JOB_SPOOL_DIR%/*} -czf
${SGE_STDOUT_PATH%/*}/${SGE_JOB_SPOOL_DIR##*/}.tgz ${SGE_JOB_SPOOL_DIR##*/}

and you get an archive where stdout is set to.

-- Reuti

One of my minor gripes about Grid Engine is the lack of debug/troubleshooting stuff that is
available to non-admin users who don't have sudo or root access. One of last good systems providing
data to regular users about "why is my job not scheduled" is now losing ground since
"schedd_job_info=false" started being deployed on high-volume clusters.

Even if there is a tool out there that can't be shared it would be great if
someone could talk about the methods used -- maybe we can gin up an equiv
utility for the community...

dag

Dave Love wrote:

Not just the administrator, actually. There's stuff which isn't
accessible via qacct but can be useful for users to get post mortem
information about failures. Mark Dixon has a tool which grovels it
(unpublished?, hint).

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] deciding spool directory location

2012-01-13 Thread Chris Dagdigian



Whoa. If there is a tool out there that gives users access to debug and 
info from the spool area I'd love to hear about it and get it out into 
the community.  One of the downsides to spool locations is that they are 
usually only accessible to admins.


One of my minor gripes about Grid Engine is the lack of 
debug/troubleshooting stuff that is available to non-admin users who 
don't have sudo or root access. One of last good systems providing data 
to regular users about "why is my job not scheduled" is now losing 
ground since "schedd_job_info=false" started being deployed on 
high-volume clusters.


Even if there is a tool out there that can't be shared it would be great 
if someone could talk about the methods used -- maybe we can gin up an 
equiv utility for the community...


dag



Dave Love wrote:

Not just the administrator, actually.  There's stuff which isn't
accessible via qacct but can be useful for users to get post mortem
information about failures.  Mark Dixon has a tool which grovels it
(unpublished?, hint).

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] My notes on building Open GridScheduler 2011.11 on RedHat/CentOS 6.x based systems

2012-01-12 Thread Chris Dagdigian



Tried to reverse engineer my crusty old build environment into something 
that I (or even others) can actually replicate or follow.


Going to try similar for 32bit binaries as well as document the process 
for RHEL/CentOS 5.x based systems in the near future...


Short link:
http://biote.am/6y

Long link:
http://bioteam.net/2012/01/building-open-grid-scheduler-on-centos-rhel-6-2/

Feedback welcome.

-dag



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] deciding spool directory location

2012-01-12 Thread Chris Dagdigian


Hi Dale,


We are trying to determine where the spool directory should reside based on 
performance

 Versus ease of administration.  Can somebody explain how ease of 
administration would
 be made easier?


Here is a short answer:

When the spool directory is shared it is far easier for an administrator 
to troubleshoot node-specific job issues. This is because you can 
see/access all of the spool/location without having to hop to a specific machine.


When spool is not shared your spool data and messages are on local disk 
on the compute nodes. This means that you have to connect to that node 
in order to read or examine the files.


More detail ...


The decision to do shared or not-shared generally revolves around the 
power of your NFS server, what else is talking on that same 
network/subnet/vlan/wire and probably more importantly how many jobs you 
might be running through your system during a day. The number of jobs 
entering and existing the system is the real factor on how often and 
hard your spool share is getting hit. Some of my pharma clusters run 
hours-long jobs and might only do a few hundred or thousand jobs per 
day. Another biotech cluster of similar size might be doing 150,000 jobs 
per day running short chemical simulations.


My gut answer is usually to do shared-spool first and only move away 
from that if performance demands it. Changing the spooling location 
post-install is not a huge deal.


I'm also a classic spooling zealot. I hate berkeleydb spooling and even 
on the 2000 core cluster that does 150,000 jobs per day we still use 
classic spooling on a NFS shared SGE Root and spool. We are, however, 
using Isilon scale-out NAS for the NFS and that means we have no real 
performance issues at all.


My $.02

-Chris



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] More Univa FUD???

2012-01-11 Thread Chris Dagdigian



Rayson Ho wrote:

And finally, thanks Chris for not selling gridengine.org. We started
telling people to subscribe to this list since late last year on the
Open Grid Scheduler homepage, and hopefully gridengine.org will not be
sold in the foreseeable future.


History time! -- I bought gridengine.org and gridengine.info many years 
(2005, whoa!) ago after discovering that domain squatters had already 
grabbed the .com and .net versions -- that made me mad and grabbing 
.info and .org seemed like a good defensive move.


My intent was to use the lesser .info for a "personal" site/blog that 
was unaffiliated with whatever employer or job I had at the time and I 
was thinking of giving/transferring the .org domain to the SGE-dev team 
at Sun. The http redirect I had for gridengine.org -> 
gridengine.sunsource.net was supposed to be temporary.


All things considered I'm glad the .org handover to Sun did not occur.

Anyway consider this an explicit promise not to sell, "monitize" or 
place ads on gridengine.org -- it's not going anywhere.


It would have been awesome if there was one single open source fork that 
I could have given the domain name to but since that seems unlikely I 
think it would be best used as a central place to put "stuff" that is 
useful to all the forks and our community -- for right now that's mainly 
the mailing list and archives but in the future maybe a new wiki or even 
central buildbot-style courtesy binary generator or regression testsuite 
that services all the projects. The site is already running on an Amazon 
AWS instance so expanding/scaling it up is not super difficult.


-dag

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Restoring SGE accounting file after re-build

2012-01-04 Thread Chris Dagdigian


Almost!

I'm not near an SGE install but there is one other file you need to 
worry about. It's a text file that contains a simple integer value for 
the "next" SGE job ID.


The file is called "jobseqnum" and it's found spool/qmaster/jobseqnum

You don't have to restore it from backup, just find the current file in 
your system and edit it appropriately while SGE is not running.





Gowtham wrote:


Before rebuilding one of our clusters
(Rocks 5.4.2), I happened to backup

/opt/grigengine/default/common/

folder that contains a file 'accounting'
(~ 3.2MB, last job ID being 12924).

After the rebuild, I don't yet see
'accounting' in the aforementioned location.

So, would it be safe to assume that if I place
the backed up file in that location with
appropriate ownership and permissions, SGE
will continue numbering newer jobs from 12925?

Best,
g

--
Gowtham
Information Technology Services
Michigan Technological University

(906) 487/3593
http://www.it.mtu.edu/

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] More Univa FUD???

2011-12-16 Thread Chris Dagdigian




> First they closed the
> source code, now they are taking over the 'Grid Engine' name. What will
> they do next??

my $.02 ...


Mark if you wanna see a textbook example of infantile FUD in action all 
you need to do is read your own blog at 
http://gridenginetruth.blogspot.com/


gridengine.com has been owned by a domain speculator or some other 
value-less operator since at least 2005 when I bought the 
gridengine.info and gridengine.org domains. Just like gridengine.net is 
being used right now in fact.


gridengine.com has *never* been used for anything GE related and a quick 
visit to archive.org will show that it's basically hosted a domain 
parking page for years.


And now ... it redirects ... to univa.com ... the company that hired a 
bunch of the GE development team ... The company that sells and supports 
a GE variant ... the company with a product called ... (drumroll) ... 
"Grid Engine".


FUD != "company purchasing a domain name from a speculator that 
accurately describes the product they are developing and selling"


Wait until someone does a google search for "open grid scheduler" and 
notices that Univa might have bought placement around those keywords. 
That's totally gonna blow some minds. Maybe I should get in front of 
that outrage train and label it UltraMegaUberFUD :)










___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration?

2011-11-23 Thread Chris Dagdigian

William Hay wrote:
> As others have pointed out community support for closed source
> versions is necessarily limited but nothing stops us from having a go.
>   As Univa and Oracle diverge from the open source versions this will
> become harder though.

Just wanted to mention on the list and in public that many smart people 
that I respect pointed this same observation out. Sorry for my "doh!" 
moment last night ...

-chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration?

2011-11-22 Thread Chris Dagdigian


Hi Rayson,

Did not mean to imply that it was you who made those statements - I 
actually thought you were referring to or quoting someone else who had 
attempted in the past to dictate what the community list can be used 
for. All I wanted to say was that nobody can dictate how this list is 
used and it would be a shame if our mailing-list/support group started 
to fragment up.


That said, however, I 100% understand that support can't be a given when 
the source is not available. I'll keep my UD-product questions to Univa 
when I'm dealing with their versions.


-dag


Rayson Ho wrote:

Hi Chris,

I *DID NOT* say that all discussions related to Univa Grid Engine had
to be banned. As we don't have the Univa Grid Engine source code, we
just can't debug the problem. That's basically the same reason Bill
asked others to turn to Oracle for help with issues related to Oracle
Grid Engine back in March 2011:

http://gridengine.org/pipermail/users/2011-March/000288.html

Not only that we don't have the source code, Univa Grid Engine is not
even available as a free download. I am not interested in registering
for Univa's trival license, and this means we can't reproduce the
problem.

When I said that "we also cannot help users using UGE which is also
not opensource", I mean that, we, as people who answered literally
thousands of questions on the Grid Engine mailing lists, have to
speculate whether it is a bug in UGE or it really is a limitation.
This is not productive, and you have to understand that our time, like
everyone's is not free. We don't have 48 hours a day while Univa only
has 24. And besides, Univa knows how to solve its own customers
issues.



Like many users of the Oracle branded SGE I was hopefully assuming faster
and more targeted support would be available ... Don't we have a history going
back (like forever?) of doing that?


We (at least Ron&  myself, I can't speak for Reuti), helped lots of
new users install&  setup Grid Engine when Sun was in charge. Back in
those days, Reuti, Ron,&  I (and others) often responsed in minutes,
and we enjoyed working with Sun because Sun contributed the Gridware
source code to the open source community, and the product was
completely open source.

However, responding to mailing list messages actually took time away
from us to do useful things. If you are using Univa Grid Engine, then
you are paying a Univa customer (since it is commercial only). Univa
has support engineers and they are the people who are hired to support
Univa customers.

Now Chris, tell me why my (as well as Reuti's) original response was
not a fair&  accurate answer.

Rayson



On Tue, Nov 22, 2011 at 3:54 PM, Chris Dagdigian  wrote:

Like many users of the Oracle branded SGE I was hopefully assuming faster
and more targeted support would be available from the smart people who
inhabit the users- list. Don't we have a history going back (like forever?)
of doing that?

Univa support is going to be my 2nd stop mainly because I'd expect it to
take longer to open, troubleshoot and resolve via a formal ticketing system.
I think my issue has more to do with PAM and NIS than any deep SGE issue.

I really have not been paying much mind to the fireworks on this list
recently but if the end result is that people are going to shun Oracle and
Univa customers on this mailing list then let me be the first person to
complain how unfortunate and sad this state of affairs has become.

Will we shun ScalableLogic customers next?

Let me be perfectly blunt as nominal owner of the gridengine.org domain name
- there is nobody on earth who has the authority (informal or otherwise) to
declare that this mailing list can't be used to support or not-support any
particular variant of Grid Engine. We've never given anyone that authority,
nor should we.

I don't want to start an entirely new complaint-fest so let me withdraw my
question. I'll go straight to Univa on this one.



dag




Bill Bryce wrote:

Hi Chris,

I think the best way is to log this as an issue at Univa and we can go
from there.  Is this cluster for your personal use or are you configuring it
on behalf of a customer?  You can send an email to supp...@univa.com or
login to the support portal http://www.univa.com/support and we can help.

Regards,

Bill.

On 2011-11-22, at 3:32 PM, Reuti wrote:


Hi Chris,

Am 22.11.2011 um 21:05 schrieb Chris Dagdigian:


I'm hands-on with a shiny new cluster running Univa's 8.0.1 release and
am having some issues running jobs as a non-root user via an account that
lives in Active Directory.

isn't Univa offering "Full, Enterprise Class Support"? I thought this is
one of the advantages over the community support for the open source
version. So I would assume they have their own forum/list like Oracle does
for their version:

https://forums.oracle.com/forums/forum.jspa?for

Re: [gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration?

2011-11-22 Thread Chris Dagdigian



Like many users of the Oracle branded SGE I was hopefully assuming 
faster and more targeted support would be available from the smart 
people who inhabit the users- list. Don't we have a history going back 
(like forever?) of doing that?


Univa support is going to be my 2nd stop mainly because I'd expect it to 
take longer to open, troubleshoot and resolve via a formal ticketing 
system. I think my issue has more to do with PAM and NIS than any deep 
SGE issue.


I really have not been paying much mind to the fireworks on this list 
recently but if the end result is that people are going to shun Oracle 
and Univa customers on this mailing list then let me be the first person 
to complain how unfortunate and sad this state of affairs has become.


Will we shun ScalableLogic customers next?

Let me be perfectly blunt as nominal owner of the gridengine.org domain 
name - there is nobody on earth who has the authority (informal or 
otherwise) to declare that this mailing list can't be used to support or 
not-support any particular variant of Grid Engine. We've never given 
anyone that authority, nor should we.


I don't want to start an entirely new complaint-fest so let me withdraw 
my question. I'll go straight to Univa on this one.




dag




Bill Bryce wrote:

Hi Chris,

I think the best way is to log this as an issue at Univa and we can go from 
there.  Is this cluster for your personal use or are you configuring it on 
behalf of a customer?  You can send an email to supp...@univa.com or login to 
the support portal http://www.univa.com/support and we can help.

Regards,

Bill.

On 2011-11-22, at 3:32 PM, Reuti wrote:


Hi Chris,

Am 22.11.2011 um 21:05 schrieb Chris Dagdigian:


I'm hands-on with a shiny new cluster running Univa's 8.0.1 release and am 
having some issues running jobs as a non-root user via an account that lives in 
Active Directory.

isn't Univa offering "Full, Enterprise Class Support"? I thought this is one of 
the advantages over the community support for the open source version. So I would assume 
they have their own forum/list like Oracle does for their version:

https://forums.oracle.com/forums/forum.jspa?forumID=859

-- Reuti



The cluster is the standard sort of RHEL 5.7 based system but we are using 
Centrify and in particular the Centrify NIS-gateway-to-ActiveDirectory to 
service the cluster nodes without having to license centrify on all nodes in 
the cluster.

The user errors I see are familiar ones:

"can't get password entry for user "x". Either user does not exist or NIS 
error!"

The confusing thing is that I can SSH into compute nodes as the same user and 
both password logins and passwordless SSH work perfectly. It's only when 
running under SGE that the jobs fail.

If I had to guess I'd wonder first if SSHD was using Linux /etc/pam.d/ in a way that 
"works" while SGE is accessing PAM in some way that we have not configured 
properly yet. That's only a guess though.

Does anyone have examples of SGE running via NIS authentication or via 
Centrify? Any examples of PAM configuration that were needed to get NIS users 
recognized under SGE?

Thanks!

-Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


William Bryce | VP of Products
Univa Corporation - 1001 Warrenville Road, Suite 100 Lisle, Il, 65032 USA
Email bbr...@univa.com | Mobile: 512.751.8014 | Office: 416.519.2934


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] SGE (univa 8.0.1) - anyone running SGE with Centrify active directory integration?

2011-11-22 Thread Chris Dagdigian



Hi folks,

I'm hands-on with a shiny new cluster running Univa's 8.0.1 release and 
am having some issues running jobs as a non-root user via an account 
that lives in Active Directory.


The cluster is the standard sort of RHEL 5.7 based system but we are 
using Centrify and in particular the Centrify 
NIS-gateway-to-ActiveDirectory to service the cluster nodes without 
having to license centrify on all nodes in the cluster.


The user errors I see are familiar ones:

 "can't get password entry for user "x". Either user does not exist or 
NIS error!"


The confusing thing is that I can SSH into compute nodes as the same 
user and both password logins and passwordless SSH work perfectly. It's 
only when running under SGE that the jobs fail.


If I had to guess I'd wonder first if SSHD was using Linux /etc/pam.d/ 
in a way that "works" while SGE is accessing PAM in some way that we 
have not configured properly yet. That's only a guess though.


Does anyone have examples of SGE running via NIS authentication or via 
Centrify? Any examples of PAM configuration that were needed to get NIS 
users recognized under SGE?


Thanks!

-Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] cannot run in PE ... because it only offers 0 slots

2011-11-18 Thread Chris Dagdigian



Check the value of "pe_list" in your queue configuration. The MPI PE you 
are trying to use is not listed in the pe_list parameter for the queue 
you are submitting to.  The queue you show only has "make" as a 
supported PE.


-Chris


Gerard Henry wrote:

hello all,

i got trouble to confgure a queue on SGE 6.2u5 (linux)

I have two machines amd64, with this topology: SCCSCC so the total of
cores is 8.

first, i defined a group:
# qconf -shgrp @qlong
group_name @qlong
hostlist charybde scylla

then a queue:
# qconf -sq long1
qname long1
hostlist @qlong
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 4
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY

but when i try to submit a job, it fails with:
% qsub -w v ./script1.sh
Job 14431 cannot run in PE "mpi_labo" because it only offers 0 slots

the beginning of the script is:
...
#$ -q long1
#$ -pe mpi_labo 6


and the PE is defined by:
qconf -sp mpi_labo
pe_name mpi_labo
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE


If i try to submit with "-pe mpi_labo 4", it works. What am i missing?

I also tried to augment the value:
qconf -mq long1
slots 8
but in this case, the program executes his 8 threads on the same host,
that's not what i want;

thanks in advance for help,

gerard


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Re??? `cloud' nodes

2011-11-07 Thread Chris Dagdigian



I have Chef recipes for Grid Engine that will auto install SGE master 
and execd onto a single node. We also have recipes that built full-blown 
SGE clusters but largely stopped doing that in favor of StarCluster - we 
ended up adding modules to StarCluster to do the customizations we 
needed when running on AWS.


I can't clean up those recipes in time to share them but keep an eye on 
(or an RSS reader) bioteam.net as we eventually get to a point where we 
have our act together enough to publish that github repo of our useful 
Chef stuff.


And to be honest there is nothing about Chef + SGE that you can't learn 
on your own by looking at other Recipes and Cookbooks - that's how I 
learned Chef. Grab a coobook or two that you know you'll need from 
http://community.opscode.com/cookbooks and start looking at the code.


If I can find the time to clean up my "build SGE onto a single node" 
Recipe I'll post it to this list and on bioteam.net. Since the cookbook 
creates an SGE autoinstall template file that is compatible with 
"./inst_sge -x ..." it should be pretty easy to extend it to install 
multiple compute nodes.


-Chris







Chi Chan wrote:

So no one  has Puppet or Chef Recipes for SGE?

--Chi




- 原始信件 
寄件者： Chi Chan

Anyone has Opscode Chef recipes for SGE Grid Engine? I want to setup a simple 
test cluster and try IT automation and see if it is really useful.


--Chi




- 原始信件 
寄件者： Rayson Ho
收件者： Jesse Becker
副本： Chi Chan; Kristen Eisenberg; 
"users@gridengine.org"
寄件日期： 2011/10/18 (二) 5:08 PM
主旨： Re: [gridengine users] Re??? `cloud' nodes

On Tue, Oct 11, 2011 at 4:39 PM, Rayson Ho  wrote:

On Tue, Oct 11, 2011 at 3:58 PM, Jesse Becker  wrote:

We're getting a bit off topic here, but CFEngine fits the one of your
requirements (but probably not the other). 狢t is written in C, is quite
fast, and has a much lower resource footprint than anything based on
Ruby.

Like many other things, some people who are against configuration
management and some don't...


Re-replying to Jesse's message...

I attended an IBM training on BladeCenter recently, and talked to an
IBMer during lunchtime.

He mentioned U of Toronto's Scinet, which is a TOP500 supercomputer
(highest ranking: #16 in Jun 09). Scinet uses IBM's xCAT, which is
developed for HPC clusters, for provisioning.

In fact some other people use xCAT as a replacement for Platform's
Scali Cluster Manager:

http://www.nodeofcrash.com/?p=353

xCAT is written in Perl, so it also is another package to install&  maintain.

Rayson






It all comes down to, is it cheaper to manually manage a cluster by
hand, or should we use tools like Chef, Puppet, or Tivoli and hire 1
less person.

(But for some HPC sites, running tools in the background is not
possible or acceptable. For example, the original Catamount OS in
DoE's Red Store could only run 1 single-threaded process at a time on
the compute PEs.)

I brought up the StarCluster Chef integration because doing things by
hand is not possible in the EC2 scale - in the pre-cloud days, you can
setup machines 1 at a time, but when you can launch hundreds of
machines in minutes, doing things by hand is way too slow&  expensive.

BTW, some tutorial videos by Justin&  Chris:

- StarCluster 0.91 Demo: http://www.youtube.com/watch?v=vC3lJcPq1FY

- Launching a Cluster on Amazon EC2 Spot Instances Using StarCluster:
http://www.youtube.com/watch?v=2Ym7epCYnSk

Rayson




贌owever, it is probably not "simpler to learn" by a long shot.
Configuration management is deceptively complex once you get beyond the
"golden master" view of the world.




On Tue, Oct 11, 2011 at 12:45 PM, Rayson Ho  wrote:
2) And the BioTeam integrates StarCluster with Opscode Chef, so you
can automate many of the administrative tasks (create users, package
management, service setup, etc) of EC2 SGE clusters:

http://bioteam.net/2011/03/dude-you-got-some-chef-in-my-starcluster/

While I have more experience with IBM Tivoli&  Puppet, I am really
impressed with the Chef EC2 module. And Chef is gaining quite a lot of
momentum lately. E.g. Dell recently open sourced Crowbar, which is an
OpenStack installer based on Chef.

I will wait for Puppet Enterprise 2.0, which is supposed to have new
EC2&  VMware provisioning&  orchestration capabilities, and see how
Puppet compares with Chef before I decide if I am switching to Chef.
But configuration management is real and it can cut down a lot of IT
infrastructure maintenance.

Rayson



On Mon, Oct 10, 2011 at 7:01 PM, Kristen Eisenberg
  wrote:

Chris Dagdigian  writes:


By FAR the best way to run standalone Grid Engine clusters on the Amazon
Cloud today is to simply use MIT Starcluster :

http://web.mit.edu/stardev/cluster/index.html

I didn't mention it as I got the impression that that wasn't the OP's
case, but probably it shoul

Re: [gridengine users] OT: IBM to acquire Platform Computing!

2011-10-11 Thread Chris Dagdigian



On a related note I was talking to a former Platform person who I'm sure 
many of us know on this list and he mentioned that the stripped down 
older variant of Platform LSF that platform produced back in the day 
("lava") has a new open source home and developer group:


 http://openlava.net/

-Chris



Rayson Ho wrote:

http://www.platform.com/press-releases/2011/IBMtoAcquireSystemSoftwareCompanyPlatformComputingtoExtendReachofTechnicalComputing

Not sure what's going to happen to Loadleveler...

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] differentiating queues/hosts in a heterogenous system

2011-10-05 Thread Chris Dagdigian

That's the right way to do it but you don't need to do it at the queue
level if you don't want.

You can assign attributes to the nodes themselves and then request them
like...

qsub -hard -l resourceX=TRUE ./path-to-my-job.script

That will run on any queue and only on hosts where the boolean
comparison matches.

The advantage of tagging hosts and requesting those host-specific tags
is that you don't have to create and manage a pile of queues with
different resources attached. The philosophy of SGE is "minimal queues
with each user responsible for requesting the resources he/she needs in
order to be successful"

-Chris

Rick Reynolds II wrote:

We're planning to use SGE to send jobs to a set of worker machines. Those
worker machines are each connected to specific and different pieces of
hardware. So a specific job would need to be mapped to a specific worker
machine that was connected to the appropriate hardware.

The worker machines themselves all look pretty much the same (same kind of machines, same
OS, etc.). So I'm looking at differentiating the queues via the complex attributes. E.g.
giving each queue a Type via a string value that must be matched by 'qsub
-l=' kinds of commands.

Is this the only option in SGE for differentiating queues/machines based on attributes
that aren't part of the worker machine itself? I.e. are the complex attributes the only
way of "tagging" machines or queues?

Thanks,
Rick Reynolds

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] oracle's online course for gridengine, any good?

2011-09-29 Thread Chris Dagdigian

The biggest complaint about Sun's SGE training classes was that the
instructors were professional trainers rather than people who had
actually used Grid Engine. That matters a lot in technical training -
the war stories & "mistakes I've made" anecdotes are pretty important.

Not sure about their online offering, never seen it. The training
materials were pretty solid back in the day so if they just ported them
online they might still be very useful.

I've also uploaded all of my Grid Engine training slides and materials
to the bioteam blog over time. Both my SGE Admin and User materials are
up there (but somewhat out of date ...)

http://www.bioteam.net/2011/03/grid-engine-for-users/

-Chris

Rick Reynolds II wrote:

Does anyone have info about Oracle's gridengine online course:
http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getCourseDesc?dc=D63323GC10&p_org_id=47&lang=US
?

We're looking at deploying three or four small-ish (i.e. 30 nodes or so)
clusters run by gridengine as a replacement for a home-grown queueing system
that is showing its age. I'd like to develop a little proficiency in the
system before attempting to develop something using it.

So is the course worthwhile? It's certainly cheap enough to tempt me to give
it a shot ($300). Or will I be in about the same place if I just keep reading
every doc I can get my hands on from oracle's site?

Thanks,
Rick Reynolds

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Which Grid Engine?

2011-09-08 Thread Chris Dagdigian



I recommend Univa all the time to environments where local SGE expertise 
may be limited or if commercial support looks like it will be needed or 
desired.


I also maintain handbuilt binaries of the open source forks and I've 
used Dave L's 'son of gridengine' codebase on two different client 
clusters and so far they are working out just fine.


My long term goal is still to setup a buildbot that does nightly or 
weekly builds of the open codebases, possibly integrated with the nice 
Atlassian tools. Won't happen until at least October at this point 
though given my travel and work schedule.


My $.02

-Chris


William Hay wrote:

Does anyone want to sing the praises and explain the advantage of the
various other variants of Grid engine out there
(Univa's open core, Oracle Grid Engine, Open Grid Scheduler, the Love
child variant or even using some linux distro
which includes an SGE variant)?

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Chris Dagdigian


I think I learned this trick from Reuti:

 - Any legit job running under Grid Engine will be a child process of 
an sge_execd daemon.


A nice little trick is a cronjob that does a "kill -9" on any user 
process that is not a child of sge_execd -- that will quickly send a 
message to the people bypassing the resource scheduling layer.


That said, however, I've been in this position in a number of 
environments and I can tell you that you will NEVER win the battle with 
users trying to game the system. The motivated user will always have 
more time and more incentive than an overworked cluster administrator.


While simple technical measures like that "kill -9" trick or Reuti's 
more sensible suggestion of blocking interactive SSH access to nodes 
outside of SGE should be pursued I'd suggest that you don't spend much 
more time than that developing technical countermeasures.


The real way this gets solved in a multi-user cluster environment is by 
treating acceptable cluster usage as a human resources policy. You'll 
never win a technical battle with a motivated power user.


Acceptable cluster use should be governed by a published policy and when 
the policy is avoided or gamed then the response should involve mentors, 
managers or the HR department, not technology or scripts.


In a corporate setting this comes down to:

1. First time you bypass SGE the admins send you a warning

2. Second time you get caught your manager gets notified

3. Third time? Account is disabled and you are reported to the HR 
department for violating company policy repeatedly


Sorry for being long winded but most long-time cluster admins might 
share my option that cluster use policies can't be treated as a 
technical war between admins and users -- it's far easier and better to 
treat this as a workplace behavior thing.


-Chris






Reuti wrote:

Hi,

Am 19.08.2011 um 18:30 schrieb Gowtham:


In some of the computing clusters across our campus, we have noticed many users 
running their jobs outside of the SGE queuing system. While we have plans to 
continue tutoring them about the benefits of using a queuing system, not 
everyone seems to be getting the message - as such, these
violating-users' jobs are hampering those who have been
using SGE.

On all our Rocks based clusters, we do keep the list of
cluster's uses in a flat text file, one user per line.

Is there a way by which I (as root) can kill all those
jobs submitted outside of SGE on compute nodes by these
normal users?

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] tight openmpi integration - how to alter hostnames for selected exechosts?

2011-08-17 Thread Chris Dagdigian



Thanks Joe & Reuti -


[cdagdigian@master ib-mpi-tests]$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)



[cdagdigian@master ib-mpi-tests]$ ompi_info | egrep '(rdma|openib)'
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.4.3)
 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.4.3)
[cdagdigian@master ib-mpi-tests]$



Looks like the environment is correct, I'll take Joe's advice and 
explicitly turn off TCP support and see what happens.


Regards,
Chris




___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] tight openmpi integration - how to alter hostnames for selected exechosts?

2011-08-17 Thread Chris Dagdigian


Hi folks,

I'm sorta stymied by the magic of effortless openmpi tight integration 
with SGE and am wondering how best to proceed...


Here is my situation:

- Cluster has nodes named "node1 ... nodeN"
- Cluster also has IB NICs in each node
- Cluster hosts file declares the IB interfaces as "inode1 ... inodeN"

So my basic situation is that the hostname of the compute node is 
different if I want to explicitly invoke the infiniband interface and 
network. I need to use "inodeN" instead of "nodeN" for my MPI hosts.


In the bad old days of loose MPI integration I'd just intercept the 
temporary hostfile generated by the pe_starter method and just run a 
quick regex on it to change all mentions of "node" to "inode" and I'd be 
done - the mpirun command would be force fed a machines file that 
explicitly names the infiniband-associated hostnames.


However with the magic/automatic support that SGE has for OpenMPI there 
is no written MPI hosts file that I can find ($TMPDIR/hosts does not 
exist in the job context) -- the SGE scheduler just sends the selected 
host set directly to the OpenMPI starter process and in my case it seems 
clear that SGE is sending the "ethernet" hostnames instead of the IB 
hostnames and thus my shiny IB fabric is being ignored in favor of 
running MPI over the ethernet links.



So my basic question is "how to force tightly integrated openmpi to use 
a (sligthly) different set of hostnames so that the IB fabric is 
actually used ..."


Right now I'm thinking of mirroring part of the loose integration method 
and writing a simple pe_starter method that will take $pe_hosts and 
translate it into a hostfile that has the 'nodeN' to 'inodeN' regex 
applied. Then I can modify my job scripts to force mpirun to accept a 
machinesfile or hostfile argument.


Is there a better way ?

Also, is there a better way to "prove" what network/interface endpoints 
openmpi is using? So far for debugging I've been using the following 
options to sorta prove to myself that the non-IB network is being used:



$MPIRUN --display-devel-allocation --display-allocation --verbose 
--show-progress


and by running that command through SGE and then outside of SGE with a 
manual hostfile using the IB interface I see enough difference in output 
to be convinced that SGE is routing jobs through the ethernet network.



Thoughts, clues and tips appreciated!

-Chris





___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] 3rd party 'qtcsh' build failing on son-of-gridengine?

2011-08-11 Thread Chris Dagdigian



Hi folks,

One of the nicer outcomes of the post-Oracle world is how easy it's 
starting to become to actually build SGE source ...


I'm currently trying to build an x86_64 version of the latest 
son-of-gridengine and am running into an issue with 'qtcsh'


All of the normally flaky java, qmon, DRMAA etc. stuff builds fine but 
aimk currently bombs out. Ignoring the copious warning messages, this 
looks like the actual fatal error:


{ anyone else see this? Any workarounds? }




gcc -o tcsh  -DSGE_ARCH_STRING=\"lx-amd64\" -O3 -Wall -Wstrict-prototypes 
-DUSE_POLL -DLINUX -DLINUXAMD64 -DLINUXAMD64 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8  
-DLOAD_OPENSSL -DTARGET_64BIT  -DSPOOLING_dynamic -DSECURE -I/usr/include -DCOMPILE_DC 
-D__SGE_COMPILE_WITH_GETTEXT__  -D__SGE_NO_USERMAPPING__ -Wno-error -DPROG_NAME='"qtcsh"' 
-DLINUXAMD64-I. -I.. sh.o sh.dir.o sh.dol.o sh.err.o sh.exec.o sh.char.o sh.exp.o 
sh.func.o sh.glob.o sh.hist.o sh.init.o sh.lex.o sh.misc.o sh.parse.o sh.print.o sh.proc.o sh.sem.o 
sh.set.o sh.time.o glob.o mi.termios.o ma.setp.o vms.termcap.o tw.help.o tw.init.o tw.parse.o 
tw.spell.o tw.comp.o tw.color.o ed.chared.o ed.refresh.o ed.screen.o ed.init.o ed.inputl.o 
ed.defns.o ed.xmap.o ed.term.o tc.alloc.o tc.bind.o tc.const.o tc.defs.o tc.disc.o tc.func.o 
tc.os.o tc.printf.o tc.prompt.o tc.sched.o tc.sig.o tc.str.o tc.vers.o tc.who.o  -lcrypt
  -L../../../LINUXAMD64 -L. -rdynamic -Wl,-rpath,\$ORIGIN/../../lib/lx-amd64  -

lsge -lpthread-ldl

ed.screen.o: In function `SetAttributes':
ed.screen.c:(.text+0x7d3): undefined reference to `tputs'
ed.screen.c:(.text+0x879): undefined reference to `tputs'
ed.screen.c:(.text+0x8b8): undefined reference to `tputs'
ed.screen.c:(.text+0x90d): undefined reference to `tputs'
ed.screen.c:(.text+0x963): undefined reference to `tputs'
ed.screen.o:ed.screen.c:(.text+0x9a4): more undefined references to `tputs' 
follow
ed.screen.o: In function `Insert_write':
ed.screen.c:(.text+0xea7): undefined reference to `tgoto'
ed.screen.c:(.text+0xeb6): undefined reference to `tputs'
ed.screen.c:(.text+0xeed): undefined reference to `tputs'
ed.screen.o: In function `DeleteChars':
ed.screen.c:(.text+0xf78): undefined reference to `tputs'
ed.screen.c:(.text+0xfad): undefined reference to `tgoto'
ed.screen.c:(.text+0xfe6): undefined reference to `tputs'
ed.screen.o: In function `MoveToChar':
ed.screen.c:(.text+0x1185): undefined reference to `tgoto'
ed.screen.c:(.text+0x1194): undefined reference to `tputs'
ed.screen.c:(.text+0x119f): undefined reference to `tgoto'
ed.screen.c:(.text+0x11ae): undefined reference to `tputs'
ed.screen.o: In function `MoveToLine':
ed.screen.c:(.text+0x1299): undefined reference to `tgoto'
ed.screen.c:(.text+0x12a8): undefined reference to `tputs'
ed.screen.c:(.text+0x1304): undefined reference to `tputs'
ed.screen.c:(.text+0x1325): undefined reference to `tgoto'
ed.screen.c:(.text+0x1335): undefined reference to `tputs'
ed.screen.o: In function `EchoTC':
ed.screen.c:(.text+0x3174): undefined reference to `tgoto'
ed.screen.c:(.text+0x3184): undefined reference to `tputs'
ed.screen.c:(.text+0x31b6): undefined reference to `tgetflag'
ed.screen.c:(.text+0x3223): undefined reference to `tputs'
ed.screen.c:(.text+0x3238): undefined reference to `tgetstr'
ed.screen.c:(.text+0x3309): undefined reference to `tgoto'
ed.screen.c:(.text+0x331b): undefined reference to `tputs'
ed.screen.o: In function `GetTermCaps':
ed.screen.c:(.text+0x33ce): undefined reference to `tgetent'
ed.screen.c:(.text+0x33e0): undefined reference to `tgetflag'
ed.screen.c:(.text+0x33fa): undefined reference to `tgetflag'
ed.screen.c:(.text+0x3417): undefined reference to `tgetflag'
ed.screen.c:(.text+0x3427): undefined reference to `tgetflag'
ed.screen.c:(.text+0x3437): undefined reference to `tgetnum'
ed.screen.c:(.text+0x3447): undefined reference to `tgetnum'
ed.screen.c:(.text+0x3474): undefined reference to `tgetstr'
ed.screen.c:(.text+0x3820): undefined reference to `tgetflag'
ed.screen.c:(.text+0x3839): undefined reference to `tgetflag'
ed.screen.o: In function `ClearToBottom':
ed.screen.c:(.text+0x6d0): undefined reference to `tputs'
ed.screen.o: In function `SoundBeep':
ed.screen.c:(.text+0xb2c): undefined reference to `tputs'
ed.screen.c:(.text+0xb3f): undefined reference to `tputs'
ed.screen.o: In function `ClearScreen':
ed.screen.c:(.text+0xbb0): undefined reference to `tputs'
ed.screen.c:(.text+0xbdb): undefined reference to `tputs'
ed.screen.o:ed.screen.c:(.text+0xc38): more undefined references to `tputs' 
follow
collect2: ld returned 1 exit status
make: *** [tcsh] Error 1
not done





___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue and Parallell Environments

2011-08-08 Thread Chris Dagdigian



2 obvious problems

(1) Your queue instances are in error state E which means no jobs will 
run ever. State E is a peristent error that must be manually cleared by 
an SGE admin


(2) You have 0 set in the "slots" value for your mat-lab queue. This 
means that unless slots are inherited by the exec hosts itself your 
queue is going to apply a value of 0 slots


So quick advice ..

- Clear the error states from the matlab queue instances
- Set some manual # of slots in your matlab queue just to see if it 
makes a difference


-Chris


Eric Kaufmann wrote:

Here is some of the requested information.

qstat -f -q matlab
queuename  qtype resv/used/tot. load_avg arch
states
-
matlab@compute-0-18.local  BIP   0/0/0  0.00 lx26-amd64E
-
matlab@compute-0-19.local  BIP   0/0/0  0.00 lx26-amd64E
-
matlab@compute-0-90.local  BIP   0/0/0  0.00 lx26-amd64E

qconf -sql
Dcradle
all.q
cdt
check
clong
goodson
linda
long
matlab
sapt
schrod
std

qconf -sq matlab
qname matlab
hostlist  @matlab
seq_no0
load_thresholds   NONE
suspend_thresholdsNONE
nsuspend  1
suspend_interval  00:05:00
priority  0
min_cpu_interval  00:05:00
processorsUNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list   matlabPE
rerun FALSE
slots 0
tmpdir/tmp
shell /bin/bash
prologNONE
epilogNONE
shell_start_mode  posix_compliant
starter_methodNONE
suspend_methodNONE
resume_method NONE
terminate_method  NONE
notify00:00:60
owner_listNONE
user_listsNONE
xuser_lists   NONE
subordinate_list  NONE
complex_valuesNONE
projects  NONE
xprojects NONE
calendar  NONE
initial_state default
s_rt  INFINITY
h_rt  INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize   INFINITY
h_fsize   INFINITY
s_dataINFINITY
h_dataINFINITY
s_stack   INFINITY
h_stack   INFINITY
s_coreINFINITY
h_coreINFINITY
s_rss INFINITY
h_rss INFINITY
s_vmemINFINITY
h_vmemINFINITY



Here is the output of

On Mon, Aug 8, 2011 at 4:05 PM, Chris Dagdigian mailto:d...@sonsorol.org>> wrote:


If you post the output of 'qconf -sq ' we can
provide more targeted advice.

qstat -f -q  output might be useful as well just so we
can be sure your nodes are actually up and not in error state

It sounds as though you have a cluster queue set up without any
available hosts configured within it? Your hosts entry for the queue
should say "@allhosts" or "@matlabhosts" however you set it up. It
should not be "@/" - it has to name or reference an existing and
real SGE hostgroup. It may be possible you have a hostgroup created
without any actual hosts defined within it.

Showing the 'qconf -s ...' output for the PE, queue and hostgroups
would help


-Chris






Eric Kaufmann wrote:

I am running SGE 6.2 on a Rocks 5.2 cluster. I am trying to add
a new
parallel environment and queue for Matlab. I was able to add
both. The
queue for Matlab shows zero slots available. I did create a
matlab host
group. This shows up in the Hostgroup list. In the Attributes for
Host/Host Group @/ is listed but that is all.

I have other queues where machines are also listed in the
Attributes for
Host/Host Group.

What am I missing here? Is there also a way to set up a parallel
environment so only one que can use it?

Thanks,

Eric


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue and Parallell Environments

2011-08-08 Thread Chris Dagdigian


2 obvious problems


Eric Kaufmann wrote:

Here is some of the requested information.

qstat -f -q matlab
queuename  qtype resv/used/tot. load_avg arch
states
-
matlab@compute-0-18.local  BIP   0/0/0  0.00 lx26-amd64E
-
matlab@compute-0-19.local  BIP   0/0/0  0.00 lx26-amd64E
-
matlab@compute-0-90.local  BIP   0/0/0  0.00 lx26-amd64E

qconf -sql
Dcradle
all.q
cdt
check
clong
goodson
linda
long
matlab
sapt
schrod
std

qconf -sq matlab
qname matlab
hostlist  @matlab
seq_no0
load_thresholds   NONE
suspend_thresholdsNONE
nsuspend  1
suspend_interval  00:05:00
priority  0
min_cpu_interval  00:05:00
processorsUNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list   matlabPE
rerun FALSE
slots 0
tmpdir/tmp
shell /bin/bash
prologNONE
epilogNONE
shell_start_mode  posix_compliant
starter_methodNONE
suspend_methodNONE
resume_method NONE
terminate_method  NONE
notify00:00:60
owner_listNONE
user_listsNONE
xuser_lists   NONE
subordinate_list  NONE
complex_valuesNONE
projects  NONE
xprojects NONE
calendar  NONE
initial_state default
s_rt  INFINITY
h_rt  INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize   INFINITY
h_fsize   INFINITY
s_dataINFINITY
h_dataINFINITY
s_stack   INFINITY
h_stack   INFINITY
s_coreINFINITY
h_coreINFINITY
s_rss INFINITY
h_rss INFINITY
s_vmemINFINITY
h_vmemINFINITY



Here is the output of

On Mon, Aug 8, 2011 at 4:05 PM, Chris Dagdigian mailto:d...@sonsorol.org>> wrote:


If you post the output of 'qconf -sq ' we can
provide more targeted advice.

qstat -f -q  output might be useful as well just so we
can be sure your nodes are actually up and not in error state

It sounds as though you have a cluster queue set up without any
available hosts configured within it? Your hosts entry for the queue
should say "@allhosts" or "@matlabhosts" however you set it up. It
should not be "@/" - it has to name or reference an existing and
real SGE hostgroup. It may be possible you have a hostgroup created
without any actual hosts defined within it.

Showing the 'qconf -s ...' output for the PE, queue and hostgroups
would help


-Chris






Eric Kaufmann wrote:

I am running SGE 6.2 on a Rocks 5.2 cluster. I am trying to add
a new
parallel environment and queue for Matlab. I was able to add
both. The
queue for Matlab shows zero slots available. I did create a
matlab host
group. This shows up in the Hostgroup list. In the Attributes for
Host/Host Group @/ is listed but that is all.

I have other queues where machines are also listed in the
Attributes for
Host/Host Group.

What am I missing here? Is there also a way to set up a parallel
environment so only one que can use it?

Thanks,

Eric


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue and Parallell Environments

2011-08-08 Thread Chris Dagdigian



If you post the output of 'qconf -sq ' we can provide 
more targeted advice.


qstat -f -q  output might be useful as well just so we can be 
sure your nodes are actually up and not in error state


It sounds as though you have a cluster queue set up without any 
available hosts configured within it? Your hosts entry for the queue 
should say "@allhosts" or "@matlabhosts" however you set it up. It 
should not be "@/" - it has to name or reference an existing and real 
SGE hostgroup. It may be possible you have a hostgroup created without 
any actual hosts defined within it.


Showing the 'qconf -s ...' output for the PE, queue and hostgroups would 
help



-Chris





Eric Kaufmann wrote:

I am running SGE 6.2 on a Rocks 5.2 cluster. I am trying to add a new
parallel environment and queue for Matlab. I was able to add both. The
queue for Matlab shows zero slots available. I did create a matlab host
group. This shows up in the Hostgroup list. In the Attributes for
Host/Host Group @/ is listed but that is all.

I have other queues where machines are also listed in the Attributes for
Host/Host Group.

What am I missing here? Is there also a way to set up a parallel
environment so only one que can use it?

Thanks,

Eric

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] any plans for a fall GE users/developers meeting?

2011-07-27 Thread Chris Dagdigian



If we can't get a standalone meeting going, it might be reasonable to 
try to get something together for SC11 in Seattle: 
http://sc11.supercomputing.org/


-Chris

Brooks Davis wrote:

In the past there have been meetings of GE users and developers many
Octobers.  I've usually found about them too late to attend.  I'm
wondering if anything is in the works for this fall.  We've got a
significant commitment to GE with at least 5 current deployments and
would love to have a chance to discuss future plants with developers and
fellow users in person.


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] hedeby install howto?

2011-06-29 Thread Chris Dagdigian

My overall advice for people trying to run Grid Engine on the Amazon
Cloud is this:

(1) If you just want to run Grid Engine in standalone mode on the Amazon
Cloud then you should be using StarCluster
http://web.mit.edu/stardev/cluster/ -- those folks made a fantastic and
free system for elastic SGE clusters on the cloud with working shared
filesystem, MPI configured etc. etc. It's freaking magical. They also
track new AWS features closely and develop support for cool things that
most people would not have time to implement themselves on a small
project -- such as support for running under Spot Instances and within
the odd networking sandbox of the new VPC environments. My company tends
to be the one that runs into StarCluster limitations (such as
inconsistent support for running inside a special VPC zone where a HTTP
proxy was required for internet access) and the developers have been
responsive, friendly and overall nice to interact with.

(2) If you *really* want to do the hybrid cloudbursting thing than I'd
simply say go talk to the nice folks at Univa.com - they already have
far more up to date methods, software and (most importantly) happy
customers using their cloud bursting stuff. This would be the modern and
sustainable route.

(3) If you really don't want to use Univa and you want to bridge a local
SGE cluster into the Amazon cloud than just skip the Hedeby
overhead/complexity and just use Amazon VPC to link your local and
remote subnets together. After that its just one big happy SGE cluster
with a boatload of network/bandwidth limitations affecting some nodes
more than others.

Hedeby was part of a massive project for "resource aware" stuff within
Sun. SGE was just a tiny part. In my mind it has little use or utility
in the context of open source software unless you are going all-in on
all the other hedeby features. Using Hedeby today just to get SGE cloud
adaptors is going to be a stressful exercise involving complex
end-of-life software that is effectively a developmental dead end.

My $.02

-Chris

Dave Love wrote:

Allan Tran writes:

I'm trying to install hedeby to integrate sge62u5 and amazon cloud but can't
seems to find any good step by step article. All the links seem to be broken
since Oracle kills the open source GE.

Maybe Chris D will chime in with useful guidance. However, as well as
doc at http://wikis.sun.com/display/gridengine62u5/Home (currently
down), you should be able to get everything that was on sunsource via
https://arc.liv.ac.uk/trac/SGE. The source is available via darcs, hg,
and git, or directly under http://arc.liv.ac.uk/repos/darcs/hedeby/&c
(which should work to display the original web pages). Unlike the
actual gridengine source, it hasn't had any love since it was stashed.

I found this
http://wiki.gridengine.info/wiki/index.php/SGE-Hedeby-And-Amazon-EC2#Installing_the_Grid_Engine_Hedeby_Service_Adapter
but
stuck at step Setup SDM Master (sdmadm, etc, where is this sdmadm from? I
know I'm missing a lot of packages but is there a single place I can
download them all.
What I have so far now is a fresh working SGE 62u5 installed with JVM
enabled.

Presumably you need the source from the hedeby, hedeby-ge-adapter, and
hedeby-cloud-adapter repos as above. I can't remember how difficult it
is to build. There are instructions somewhere in the www directory of
the repo if I recall correctly.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Web based forums

2011-06-07 Thread Chris Dagdigian



Hi Rich,

The most active people on the list who provide the most support almost 
unanimously hate web forums and having to use a web browser to 
communicate so I don't think forums are in the future ...


We don't expect users to download and search .gzip files, my preferred 
method is to use excellent email list sites like http://markmail.org/ 
which have excellent archive and search features. Of course I don't 
think we are indexed in there just yet.


Longer term I'd expect you to find our list and archives showing up in 
google for general searches and at places like http://markmail.org/ for 
more specific stuff. I'm one of the people who does some of the 
housekeeping and infrastructure stuff and this has been on my to-do list 
forever. Places like markmail make it easy to import old list archives 
as well so we won't have any gaps in the record once we manage to get in.


-Chris



Maes, Richard wrote:

Just getting back into the grid engine community after missing all the
drama. I saw that the forums have been replaced with mailing list. I
have a couple questions about how to really use the mailing list because
I think I am missing something. I saw that in the users archive there
were 6 months worth of archives. I wanted to do a search on all of those
archive but I didn’t see a way to do it. I did see that there were
GZipped’ text files.

Is the expected user behavior to download those gzips and search them? I
am hoping that there is a better way, and I’m just missing it. I am
wondering if I should probably have done a
site:gidengine.org/popermail/users/ via google, if that would have
accomplished what I needed. Just thinking out loud here.

I also saw that the “what’s going on” page indicated that web forums
probably wouldn’t make an appearance again. If I get a vote, I would
like to see web forums again. Mostly for the history searching and it
makes the information more main stream which I think is why the previous
forums were so successful. Just an opinion. Anyone else like web forums?

That said, if you need a volunteer to set something up, I will sign up.
I don’t like making suggestions to other people that they do more work
especially when things are as tough as they are. If you need more help,
let me know.

*Rich*


ciena logo

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] qmaster startup due to communication errors

2011-05-26 Thread Chris Dagdigian



Compare the contents of $SGE_ROOT/$SGE_CELL/act_qmaster to what you have 
in /etc/hosts -- the act_qmaster file contains the hostname for what SGE 
believes is the qmaster. That hostname needs to resolve perfectly in DNS 
or in your /etc/hosts file.


You can also experiment with the $SGE_ROOT/utilbin/gethostname and 
gethostbyname etc. commands to see how SGE resolves the local naming 
environment


And finally make sure that you have ports 6444 and 6445 open on your 
firewall!




Carlos Scaloni wrote:


Hi friends

I installed the sge6_2u5 but when i try to start the qmaster i see this:

/etc/init.d/sgemaster.p6444
starting sge_qmaster

sge_qmaster start problem

sge_qmaster didn't start!


and in /tmp/sge_messages.txt :

05/26/2011 20:44:53|  main|proyecto|C|abort qmaster startup due to
communication errors

I don't know what the problem is!


I installed it with: sudo ./install_qmaster The installation finished
without any error!
Options that i used: admin user is sgeadmin, sge_qmaster port 6444,
sge_execd port 6445, classic pooling, gid range 2-21000
the rest options by default!

I try to start it with: /etc/init.d/sgemaster.p6444

The file /tmp/sge_message contains this:
05/26/2011 20:42:39| main|proyecto|C|abort qmaster startup due to
communication errors
05/26/2011 20:44:53| main|proyecto|C|abort qmaster startup due to
communication errors

My hostname is:
hostname
proyecto.local

And i have /etc/host so:

cat /etc/hosts
127.0.0.1 localhost
::1 localhost
192.168.56.101 proyecto.local
10.0.2.15 proyecto.local


Can anyone help me, please??

thanks in advance


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] write my own accounting log parser..

2011-05-05 Thread Chris Dagdigian



I've written perl scripts to scrape the accounting log and throw the 
entries into a mysql database - mainly so we could write our own simple 
queries and text based reporting tools.  Never wrote the web app though 
at one time when I was entranced with ruby-on-rails I thought it would 
be a cool use-case use case to play with.


Suggestions:

 - if you go simple and just slurp in the accounting file you CANT use 
the SGE job ID as a primary key or unique identifier. The ID will repeat 
itself in the log file for parallel and restarted jobs and will also 
eventually rotate around.


- the value for dbrwriter is what it does with the reporting file, the 
accounting file AND the new values it derives from those data sets. 
There is a lot to be learned from dbrwriter.


My secret plan was always to keep dbrwriter, the schema and the code 
that loads it into a database. I just wanted to toss the web console and 
replace it with something more simple and useful. Never got far.


-chris





William Deegan wrote:

Greetings,

I'm pondering writing a python based dbwriter replacement which would
just parse the accounting file and stuff it in a db, and then have some
python web app framework for reporting.

Has anyone already done this?

Any suggestions on how to "eat" the accounting log file?
(consume it so it never gets big? Do I rotate it out and parse and
discard?)

-Bill
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] `cloud' nodes

2011-04-29 Thread Chris Dagdigian



I'm buried in work and biz travel so apologies if this quick reply is 
not on topic ...


By FAR the best way to run standalone Grid Engine clusters on the Amazon 
Cloud today is to simply use MIT Starcluster :


http://web.mit.edu/stardev/cluster/index.html

The people behind starcluster basically did everything I did many years 
ago as part of my cheesy demos showing SGE on the "cloud" at various 
meetings and talks.  The main difference is they did it correctly with 
smart people who can write good code, heh. And they also extended and 
added all the usual features that you would want and request. It really 
is a nice solution. Much of the code is python.


My company is a big user and we are helping push it into more enterprise 
environments by (for instance) making sure that it plays nicely with VPC 
VPNs and HTTP proxies and other odd network/use-cases that you don't 
often see in academic environments.


Bridging cloud SGE nodes to local SGE clusters ("hybrid clusters", 
"cloud bursting") is a different story, this is something that I pretty 
much refuse to do mainly because in my field I'm bound by data movement 
and moving job data into and out of the cloud is slow enough to be a 
deal breaker. The networking is also a hassle.


SDM can do this, you can do this yourself on Amazon via linking your 
local and cloud subnets via Amazon VPC.


Univa can also do this with their own products that contain grid engine 
+ other cool stuff.


-Chris




Dave Love wrote:

LaoTsao  writes:


Could SDM help in this case?


Maybe (e.g. 
http://wiki.gridengine.info/wiki/index.php/SGE-Hedeby-And-Amazon-EC2),
and there are other options like OpenNebula (e.g. the links at
http://wiki.gridengine.info/wiki/index.php/Virtualization_and_Grid_Engine).

If anyone wants it, the SDM source is available under
http://arc.liv.ac.uk/downloads/SGE/ as hedeby-*.tar.gz, roughly in the
state it was left at sunsource.net.  I haven't tried to build it against
post-6.2u5 SGE, and I don't remember how the build works to know whether
to expect any problems.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Green Computing (power control)

2011-04-29 Thread Chris Dagdigian

I have absolutely seen this done with very real results. The most 
important thing is have the system generate emails to senior management 
saying things like "... I saved $12,000 in electricity last quarter ..." 
-- I can't overstate enough the importance of making sure that you have 
the PR stuff covered in addition to the nice tech stuff under the hood.


The best system I've ever seen was doing this on a 4,000 core cluster 
built from blades. The blade system provided the necessary hooks by 
which some simple scripts could use simple passwordless SSH commands to 
start and stop nodes when needed. This was easier to script, manage and 
operate as opposed to having to do straight IPMI or other less 
scriptable/automatable methods.


The main points:

- track state in a simple database

- monitor the length of the pending ("qw" state jobs) to see when new 
nodes need to be powered up


- script things so that when nodes are powered up they are by default 
coming up in disabled (state "d") so that they don't take jobs on right away


- each node that boots up needs to run a series of sanity tests designed 
to protect against common startup failures (missing NFS mounts, etc) 
that could kill jobs. Running the sanity check script remotely via a 
passwordless SSH command seems to work and lets you report state/status 
back into your tracking database


- only after the powered on node passes its sanity check do you switch 
the node away from disabled state "d" so it can start taking on work


- BEfore you shut down a node, put it into state "d" so that you avoid a 
race condition between a job landing on the node and your shutdown 
command hitting it


- Track your up/down actions in enough detail so you can create reports 
showing how much power you have saved. Senior managers love this stuff



I tried to get the people who wrote this system to turn it into a 
product and they were cool with it. The big company they worked for was 
also cool with it but we never went all that far because the effort of 
doing the legal stuff required to allow this code to leave the big 
company was basically "too much work" at the time.



-dag




Stuart Barkley wrote:

Like the recent question on "cloud", we are looking to "green" our
systems somewhat.

E.g. we would like to power down unneeded nodes and power them back on
when they can be useful for the workload.

I've done a limited extent of this manually, powering down unused
racks of nodes until I notice a need for the additional nodes.

I can create a host group containing "green" nodes subject to power
control and have these be the least restrictive access rules which
should allow all jobs to use the nodes.

We don't need fine grained power saving.  I'm thinking to power down
unused nodes every 15 minutes and looking to see if nodes need to be
powered on every 5 minutes.  I would also leave a few nodes powered on
and actually spin up new nodes as those are used (fill first being
important).

Has anyone successfully done something like this?

SDM/hedeby claims some support for this, but it looks like a horrid
cancer growth out the side of Grid Engine.  Can this be lopped off?

A while back I looked at SPIRIT
and it looked like it might function as a useful starting point.

Are there any other similar things I should look at?

It shouldn't be too hard to shutdown unused nodes.  There are a couple
interlocks which would need to occur to ensure GE doesn't try to start
a job just as the node is being shutdown (disable queues on the host
first and then checking again to ensure nothing got started).

Figuring out when to bring a node back online looks much harder.

Crudely, I can check if any jobs are waiting to run and just power up
a few nodes and hope the fulfill the need.  Repeat until job starts
running or all nodes are powered on.  I believe that this is what
SPIRIT does.

This needs to take into account jobs that might need other resources
beyond just compute nodes (software licenses, special hardware).
This isn't a current need for our systems.

This also needs to account for a job which needs more nodes and than
are available even if all the nodes where powered on.  This is
probably not currently an issue with our SGE cluster which mostly runs
lots of array jobs.

It also needs to deal with a user who might have requested a specific
(non-green) node or node group for some reason.  It only helps to
power on a new node if it would actually be used.

My biggest concern is doing something simple and this having
pathological edge cases which negate the entire effort.  Having broken
"green" capability can tick some check boxes.  Having working "green"
capability can actually save power, money and help the environment.

This is actually something that the job scheduler should be able to
help with.  Perhaps there are some hooks in SGE for SDM that could be
used without going down that whole SDM/hedeby/cloud computing route?

Any thought

Re: [gridengine users] Berkeley DB (was building RHEL5)

2011-04-08 Thread Chris Dagdigian



I was lucky enough to have a Panasas PAS12 ("fastest HPC storage system 
in the world!") chassis in my home office for a few weeks.


Suffice to say I don't think it will have any troubles handling classic 
spooling at all, heh.


-Chris


Mark Suhovecky wrote:

Our current installation uses  AFS and Panasas filesystems, and
we might see a million jobs in a month. Grid Engine performance
has not been an issue. So perhaps I'm better served sticking with classic
spooling.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Berkeley DB (was building RHEL5)

2011-04-08 Thread Chris Dagdigian




Rayson answered the ARCO question - spooling does not matter since the 
only ARCO involved files that get scraped are the accounting and 
reporting files


classic vs. berkeley is always an interesting question.

I also am firmly in the classic spooling camp but we sometimes use 
berkeley spooling. There seem to be two main things driving the choice:


- NFS performance. If your NFS server is poor and you have a large 
client count than at some point spooling may become a bottleneck. 
However, on the flip side if you have a great NFS server you can use 
classic spooling at large scale. One trivial example -- a 4,000-core 
cluster easily using classic spooling even with more than ~500,000 jobs 
per day because the NFS service is coming from a small Isilon scale-out 
NAS system that is running wire-speed across a dozen GbE NICs


- Job submission rate and job "churn". I think DanT said this in a blog 
post years ago but if you expect to need 200+ qsubs per second then you 
are going to need berkeley spooling. Same goes for clusters that 
experience huge amounts of job flows or state changes. I have less 
experience here but in these sorts of systems I think binary spooling 
makes a real difference


My $.02 of course!

-chris




Mark Suhovecky wrote:


OK, I got SGE6.2u5p1 to build with version 4.4.20 of Berkeley DB,
and proceeded to try and install Grid Engine on the master host
via inst_sge.

  At some point it tells me that I should install Berkeley DB
on the master host  first, so I do "inst_sge -db", which hangs when it tries
to start the DB for the first time. Then, because some
days I'm not terribly bright, I decide to see if the DB will start
at machine reboot. Well, now it hangs when sgedb start
runs from init. Still gotta fix that.

So let me back up for a minute and ask about Berkeley DB...

We currently run sge_6.2u1 on 1250 or so hosts, with "classic"
flat file spooling, and it's pretty stable.
When we move to SGE6.2u5p1, we'd like
to use the Arco reporting package, and I'm blithely assuming
that I need a DB with an SQL interface to accomodate this.

Is that true? Can we use Arco w/o DB spooling?


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Contribs and GridGraph

2011-04-07 Thread Chris Dagdigian



Just to reiterate what Dave said ...

gridengine.org is running the Wordpress blog engine on a small Amazon 
EC2 server backed by amazon cloudfront as a content distribution network 
(not because we need a CDN but mainly because I wanted to kick the tires 
...) so even large files won't cause any troubles


Happy to give out wordpress accounts to anyone with something useful to 
do. The blog can easily host static pages and files for things that 
don't fit cleanly into the date-based article/post format


-chris



Dave Love wrote:

"gary_sm...@vrtx.com"  writes:


Hi all,

Do we have any place yet for SGE related tools contributions?  I have a queue
visualization app I wrote (http://www.gracklewolf.com/gridgraph) that might be
useful to others, but I have no idea where to post it.


I don't know what the policy will be about add-ons in the Univa repo --
something I would have asked in a meeting I missed -- but it will depend
at least on signing something (straightforward) relating to copyright, as
I understand it.

https://arc.liv.ac.uk/trac/SGE is available for community contributions
of things that can't go into the Univa repo, or ones waiting to (already
in the build framework, for instance).  It needs more merging, but I'm
expecting it to be a superset of Univa's, and people could maintain
things there if appropriate.  Some of mine and others' bits have gone in
like that already.

Otherwise, I assume we could put files or links on gridengine.org or
gridengine.info.  (If people have suggestions, let us know -- at least I
and Chris Dagdigian can edit it, and possibly others.)

Thanks for posting.  For what it's worth, it looks as if the code would
need parametrizing for general use, and I guess it could fall foul of
bad XML (with different bugs in different SGE versions), like
<https://arc.liv.ac.uk/trac/SGE/ticket/314>; that bit me trying to use
XML output for the same sort of thing.
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] building RHEL5

2011-04-07 Thread Chris Dagdigian



I could be wrong but ...

Even though Univa (and others?) expect to depreciate the use of RPC 
based spooling to a remote berkeley DB server the current SGE codebase 
and aimk built scripts still expect to see a berkeley installation that 
has a ready to go "rpc_server" binary or whatever ...


So the main reason to use an older version of berkeleydb is that you can 
build with "--rpc-server" enabled


Anyway, on my OS X and Linux build machines I use the older 4.x version, 
I think I tracked down the last 4.x version that still shipped with 
rpc-server support.


-Chris



Mark Suhovecky wrote:

Thanks to those who replied to my last post. I was able to build the source 
successfully.
I did encounter an error installing the binaries that I wanted to ask about.

I'm trying out the berkeleydb, and I downloaded and built version 5.1
The installation seems to expect version 4.4 . The SGE build instructions
say version 4.2 or later should work.  Is this just a configuration step I've 
misssed,
or do I need to build the older DB version?


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] great blog post w/ deep dive into SGE priority calculations

2011-04-06 Thread Chris Dagdigian



Jiri forwarded me the URL to his post and I found it fascinating:

"Calculating GE Job Priorities"
http://olwynion.blogspot.com/2011/04/calculating-ge-job-priorities.html

I've always felt that one of the strengths of GE (unlimited number of 
knobs that you can alter) is also one of it's biggest problems (infinite 
number of potential configurations and no huge corpus of well tested 
values ...) and this post reinforces a lot of those thoughts.


What do others think? I gave up years ago trying to understand the 
policy mechanism at any deep level. I have a few good config recipes 
that I stick with. Whenever I have to deviate from those, I often end up 
making best-guess changes to odd SGE values/weights and then I have to 
watch the pending/active job list to see if the resource allocation mix 
is doing what I hoped. More clarity and "predictive-ness" would be welcome.


-dag

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] building jgdi on Mac OS X 10.6

2011-03-30 Thread Chris Dagdigian



Hi folks,

After very smooth Linux builds I figured I'd test my luck with OS X ...

Hitting an error now in the "./aimk -only-core" step, due to an arch 
mismatch


 errors: "libjvm.dylib, file was built for i386 which is not the 
architecture being linked (x86_64)"


It looks like this may be caused by a symlink on 10.6 that points to the 
i386 version instead of the native 64bit version. I can probably fix 
this with hacking around the symlinks but some googling revealed that 
people "should not" be directly linking to libjvm.dylib and instead 
should be messing around with the JavaVM Framework which supposedly 
knows how to do this all automatically.


The Apple material suggests the use of "gcc -framework JavaVM ..." will 
solve this linking problem.


I figured I'd ask the list before going down this road any further -- 
anyone have any tips & tricks for building on Apple OS X 10.6?


-Chris


Error message below:




[java] aimk:
 [java]  [exec] Building in directory: 
/opt/github-gridengine/gridengine/source
 [java]  [exec] making in DARWIN_X64/ for DARWIN_X64 at host ani.local
 [java]  [exec] _C_O_R_E__S_Y_S_T_E_M_
 [java]  [exec] gcc -o jgdi_test -L/opt/berkeley-db/lib/ -L. -arch 
x86_64 -L/opt/openssl/usr/local/ssl//lib 
-L/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../Libraries 
 jgdi_test.o -ljvm libuti.a libcommlists.a -lm -lpthread
 [java]  [exec] ld: warning: in 
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Libraries/libjvm.dylib,
 file was built for i386 which is not the architecture being linked (x86_64)
 [java]  [exec] Undefined symbols:
 [java]  [exec]   "_JNI_CreateJavaVM_Impl", referenced from:
 [java]  [exec]   _create_vmnot done
 [java]  [exec]  in jgdi_test.o
 [java]  [exec]   _main in jgdi_test.o
 [java]  [exec] ld: symbol(s) not found
 [java]  [exec] collect2: ld returned 1 exit status
 [java]  [exec] make: *** [jgdi_test] Error 1
 [java]
 [java] BUILD FAILED
 [java] /opt/github-gridengine/gridengine/source/libs/jgdi/build.xml:76: 
The following error occurred while executing this line:
 [java] /opt/github-gridengine/gridengine/source/libs/jgdi/build.xml:67: 
exec returned: 1
 [java]
 [java] Total time: 12 seconds

BUILD FAILED
/opt/github-gridengine/gridengine/source/build.xml:82: The following error 
occurred while executing this line:
/opt/github-gridengine/gridengine/source/build.xml:27: Java returned: 1



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] [gridengine dev] building on centos 5.5 64 bit?

2011-03-27 Thread Chris Dagdigian



I built Son-of-GridEngine and the Github Grid Engine over the weekend, 
I'll post a blog entry about it on Monday.


Links to 32 and 64 bit Linux binaries for both versions are at this 
short link:


 http://biote.am/sge

If anyone feels like testing these and giving feedback I'd appreciate 
it. Have only run it on the systems where I built the code so there is a 
chance that the binaries will complain about missing dependencies or 
some other issue on more pristine Linux systems.


Still needs testing but I'm getting back into being comfortable with the 
build process. Next step is Mac OS X


The Son of Grid Engine herd issues are mainly because building of it is 
commented out in the build.xml file


The Son of Grid Engine GUI issues are (I think) caused by missing 
logo.png files -- the build process bombs out complaining that it can't 
find a .png file that is needed to build one of the visualization panes.


I'm going to do some testing of those binaries this week and also work 
on OS X builds. The goal on our end (bioteam) is to have curtesy 
binaries that we don't mind sharing with others.


-Chris



Dave Love wrote:

Chris Dagdigian  writes:


Everything built from source without too much hassle

... including the java stuff classes, the GUI installer and the hadoop
herd classes. Never got the hadoop stuff to build before so this was a first


I finally figured out the problems with herd and the GUI installer too,
though I've only built my version with some non-fundamental build
changes.  One thing that was fooling me was confusion between the IzPack
distribution and 3rdparty/IzPack, partly because the build requires to
write to the installed IzPack distribution; I wonder if that's actually
necessary.

Can anyone say what the exact Java dependencies are?  If not, I'll try
to build in a minimal VM sometime to check (modulo versions).


I had compiled the gridengine codebase against an older version of
berkelydb (one that still supports RPC server) that I had built and
stored at /opt/berkeley-db/lib/libdb-4.7.so


For what it's worth, I made changes to use the system libraries (with an
aimk flag), but I don't remember if there was much too it.  I should try
to do some more work on rpm building.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] [gridengine dev] building on centos 5.5 64 bit?

2011-03-24 Thread Chris Dagdigian



I just did a 'git clone' of the current source over at Github and was 
able to build (I think) 100% of the code. This is a 64 bit system 
running CentOS 5.5


Everything built from source without too much hassle

... including the java stuff classes, the GUI installer and the hadoop 
herd classes. Never got the hadoop stuff to build before so this was a first


Looks like Univa is going to call this "Grid Engine 8.0.0 alpha" at 
least according to the GUI and client versions that show up.


So far a test installation as gone fine except for one small problem ...

I had compiled the gridengine codebase against an older version of 
berkelydb (one that still supports RPC server) that I had built and 
stored at /opt/berkeley-db/lib/libdb-4.7.so



The problem is that when I packege up my build and go to install Grid 
Engine, the install fails because when it tries to set up the inital 
berkeley spooling database the local SGE lib "libspoolb.so" complains 
about not being able to find "libdb-4.7.so"


Easy fix on my test system via ld.so.conf or whatever but it makes me 
think that I've built something wrong ... I thought I was doing a static 
build where all the external libraries would come along for the ride ...


Any aimk or aimk.site people have any tips for making sure the proper 
BDB libraries/objects accompany the courtesy binaries?


dag


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] posted some user-centric training materials online

2011-03-10 Thread Chris Dagdigian



FYI,

Following up on the 2009 posting of some Admin-centric training 
materials I threw some PDFs on the bioteam blog that represent a quick 
and dirty introduction to Grid Engine usage and workflows -- the 
materials are simple and aimed at an audience of users rather than admins.


Shortened link:
http://biote.am/4g

Direct link:
http://blog.bioteam.net/2011/03/grid-engine-for-users/

-Chris



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] does anyone have workshop proceedings archived?

2011-02-28 Thread Chris Dagdigian



The 2007 workshop proceedings are here:

http://gridengine.org/assets/static/ws2007/

And I hacked through the index HTML page, I think I have all the links 
working off of this index/contents page:


http://gridengine.org/assets/static/ws2007/SGEWorkshop2007.htm


-Chris


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] 6.2u5 courtesy binaries for OSX/Darwin ?

2011-02-22 Thread Chris Dagdigian



... anyone have a copied/saved version of the OS X binaries for the last 
open Sun/Oracle SGE release?


Regards,
Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] SGE and Matlab Distributed Computing Server integration?

2011-02-22 Thread Chris Dagdigian



Has anyone done any real world integration with MDCS and modern versions 
of grid engine?


A quick google search pulls up this old URL:
http://www.mathworks.com/support/solutions/en/data/1-2MC1RY/?solution=1-2MC1RY

.. and from that the method looks looks pretty straightforward.

Any real world anecdotes or "gotchas" ?

-Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE Benchmark Tools

2011-02-16 Thread Chris Dagdigian



What exactly are you trying to benchmark? Job types and workflows are 
far to variable to produce a usable generic reference.


The real benchmark is "does it do what I need?" and there are many 
people on this list who can help you zero in on answering that question.


SGE is used on anything from single-node servers to the 60,000+ CPU 
cores on the RANGER cluster over at TACC.


The devil is in the details of what you are trying to do of course!

-Chris



Eric Kaufmann wrote:

I am fairly new to SGE. I am interested in getting some benchmark
information from SGE.

Are there any tools for this etc?

Thanks,

Eric



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

97 matches

Mail list logo