Hi Marcus,

There are several things you need to do:

1) You need to make sure that $SGE_ROOT is shared between the master and all 
potential shadow_master machines (the master needs to run a shadow_master as 
well) this is how the new qmaster will take over if things go wrong on the real 
qmaster.  the shadow_master will start a new qmaster which will simply read the 
configuration open the spooling and start rebuilding the state of the system.  
(Most smaller clusters just put $SGE_ROOT on a shared filesystem like NFS, 
larger clusters use an Appliance like NetApp or Isilon, or something similar.

2) You need to run the shadow_master installation on all hosts that will be 
shadow masters and the qmaster, there is an installation script 
'$SGE_ROOT/inst_sge -sm', it does all sorts of things  including installing 
shadow masters. (Note:  if you google you will probably find instructions 
saying - create a shadow_master file in $SGE_ROOT/default and put your host 
names in there.  Yes this does work but it doesn’t check a few more things to 
make sure the shadow_master will work)

3) The shadow master daemon needs to be running on the qmaster machine and all 
machines that could be masters (all potential backup machines) need to be 
running a shadow_master.  If you ran the ‘inst_sge -sm’ above then it also 
installed and started the shadow daemon for you.  Make sure you run inst_sge 
-sm on all shadow master candidate machines.

You make your changes using ‘qconf’, really try to avoid making any changes to 
the system by editing the files, while it is possible it is much better to do 
it with ‘qconf’.

You don’t need to create queues on the shadow_masters.  It is a pretty simple 
system.  The shadow_master daemons are really just heartbeat daemons.  They 
talk to each other and the one on the master pings the qmaster to see if it is 
alive.  If it is dead then they ‘wait a bit’ and start the process of migrating 
the master.  The master candidate machine shadow_master daemon will launch a 
new qmaster which promptly takes over and writes its name into the act_qmaster 
file and becomes the master of the cluster.

So you see the shadow_master doesn’t run anything - it starts a new qmaster 
which reads configuration, opens spooling, rebuilds state and the cluster runs 
as usual.

Regards,

Bill.



> On Jun 2, 2016, at 7:35 PM, Coleman, Marcus [JRDUS Non-J&J] 
> <[email protected]> wrote:
> 
> Another Question I have about shadow master is do I need to configure/create 
> the same queues that I have on the setup on the master?
> 
> How will the shadow master run the jobs without queues?
> 
> 
> 
> From: Coleman, Marcus [JRDUS Non-J&J]
> Sent: Thursday, June 02, 2016 4:19 PM
> To: '[email protected]'
> Subject: Shadow master
> 
> Hi All
> 
> My main question is do we make all the configuration changes on the Slave or 
> Master?
> 
> 
> Does the “shadow_masters” file need to be on the slave or master? I have the 
> file on both
> Does the shadowd needs to be running on the slave or master? I have the 
> daemon running on the slave
> 
> 
> I have the $SGE_ROOT/$SGE_CELL/common    and    /spool     directory shared 
> via NFS in FSTAB…
> 
> 
> 
> Is this correctly configured or am I missing somethings…
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

William Bryce | VP Products
Univa Corporation, Toronto
E: [email protected] | D: 647-9742841 | Toll-Free (800) 370-5320
W: Univa.com <http://univa.com/> | FB: facebook.com/univa.corporation 
<http://facebook.com/univa.corporation> | T: twitter.com/Grid_Engine 
<http://twitter.com/Grid_Engine>

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to