Hi,

I posted recently looking for help with migrating from OGS 2011.11p1 to Son
of Gridengine 8.1.8 (SoGE). This was part of a Rocks cluster upgrade from
6.1 to 6.2 (CentOS 6.3 to 6.6). Thanks to all those who helped me out! I
got it all worked out finally, and below are my notes in the hope that the
next person who has to do this will have a much easier time.

Because SoGE doesn't exist as a Rocks roll, you have to do the install
manually.

==== Preparation ====

** Backup your OGS/SGE config

Run $SGE_ROOT/util/upgrade_modules/save_sge_config.sh

Make a manual copy of $SGE_ROOT for sanity's sake and reference to things
that don't get saved in the above dump.

(Another option is to use '$SGE_ROOT/inst_sge -bup'. This appears to use a
different backup mechanism. I ran this one as well in case it restored
better than from the above option. However I had weird permissions trouble
with the restore using '$SGE_ROOT/inst_sge -rst', so I didn't end up using
it.)

==== Uninstall ====

If you're not doing a fresh OS install, you probably want to uninstall OGS,
or at least stop sgemaster and sge_execd and move $SGE_ROOT to a backup
location, on the FE and nodes. There might be other uninstall steps, I
don't know since I did a fresh OS install.

==== Install SoGE ====

Get SoGE RPMS from here: https://arc.liv.ac.uk/trac/SGE

** Do a full install and get things running before restoring your previous
config.
Below are notes on issues that I encountered.
Detailed instructions are here, in multiple pages:
http://www.softpanorama.org/HPC/Grid_engine/

   - Master host:
      - Installation of Son of Grid Engine 8.1.8 RPMs for Master Host
      
<http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_master_host.shtml>
      - Installation of Grid Engine Master Host
      
<http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_master_host.shtml>
   - Execution host
      - Installation of the Son of Grid Engine 8.1.8 RPMs for Execution Host
      
<http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_execution_host.shtml>
      - Installation of the Grid Engine Execution Host
      
<http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml>
      - Using the command line installer
      
<http://www.softpanorama.org/HPC/Grid_engine/Installation/using_the_command_line_installer.shtml>

** Share $SGE_ROOT with compute nodes (execution hosts)?

You'll see discussion about this in the above docs. I decided to share the
complete $SGE_ROOT from the front end to the nodes via NFS. This is simpler
and shouldn't cause a problem on my smallish cluster (21 nodes, 332 cores)
which has 10Gb local switching.

** install_initd error in ./install_qmaster

Running $SGE_ROOT/install_qmaster gave me an error at the step where the
sgemaster.<cluster name> init script is installed.
While debugging not that install_initd doesn't output an error, it just
returns 1 instead of 0.
I traced this to the dependency lines


 # Required-Start: $network $remote_fs
 # Required-Stop: $network $remote_fs

in /etc/init.d/sgemaster.<cluster-name>. For whatever reason it doesn't
like $remote_fs dependency.

I changed $SGE_ROOT/util/rctemplates/sgemaster_template to this instead:

 # Required-Start: $network $local_fs
 # Required-Stop: $network $local_fs

And then reran install_qmaster. I believe this should be fine since SoGE
doesn't rely on remote filesystems on my system (just the compute nodes do,
to mount /opt/sge, but in their init config it doesn't complain about the
$remote_fs dependency. Go figure.) So far things are working fine.

==== Restore SGE configurations ====

** run $SGE_ROOT/util/upgrade_modules/load_sge_config.sh

I had an issue with the hostname on my front end. This script was picking
up the FQDN and it was conflicting with the local hostname that the script
wanted to see for qmaster host. I temporarily set the hostname to the local
one and the script was happy.

==== Add nodes / execution hosts ====

Here are the steps to add an exec host after it's booted up (I have the
rpm's added to the rocks distro per the above install instructions, but I'm
not sure if it's needed with the fully-shared /opt/sge dir). I still need
to add this to an init script of some sort so it can run automatically as
part of the rocks distro for the exec hosts.

usermod -u 399 sgeadmin
groupmod -g 399 sgeadmin
echo "#manually added" >> /etc/fstab
echo "<front-end>:/opt/sge                    /opt/sge    nfs
defaults,noatime      0 0" >> /etc/fstab
mount /opt/sge
 . /etc/profile.d/cfn-sge-env.sh #or whatever you call this file for sge
env setup
cp $SGE_ROOT/default/common/sgeexecd /etc/init.d/sgeexecd.<cluster-name>
/usr/lib/lsb/install_initd /etc/init.d/sgeexecd.<cluster-name>
service sgeexecd.<cluster-name> start
ps -ef | grep sge


HTH!

-M
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to