Hi, I posted recently looking for help with migrating from OGS 2011.11p1 to Son of Gridengine 8.1.8 (SoGE). This was part of a Rocks cluster upgrade from 6.1 to 6.2 (CentOS 6.3 to 6.6). Thanks to all those who helped me out! I got it all worked out finally, and below are my notes in the hope that the next person who has to do this will have a much easier time.
Because SoGE doesn't exist as a Rocks roll, you have to do the install manually. ==== Preparation ==== ** Backup your OGS/SGE config Run $SGE_ROOT/util/upgrade_modules/save_sge_config.sh Make a manual copy of $SGE_ROOT for sanity's sake and reference to things that don't get saved in the above dump. (Another option is to use '$SGE_ROOT/inst_sge -bup'. This appears to use a different backup mechanism. I ran this one as well in case it restored better than from the above option. However I had weird permissions trouble with the restore using '$SGE_ROOT/inst_sge -rst', so I didn't end up using it.) ==== Uninstall ==== If you're not doing a fresh OS install, you probably want to uninstall OGS, or at least stop sgemaster and sge_execd and move $SGE_ROOT to a backup location, on the FE and nodes. There might be other uninstall steps, I don't know since I did a fresh OS install. ==== Install SoGE ==== Get SoGE RPMS from here: https://arc.liv.ac.uk/trac/SGE ** Do a full install and get things running before restoring your previous config. Below are notes on issues that I encountered. Detailed instructions are here, in multiple pages: http://www.softpanorama.org/HPC/Grid_engine/ - Master host: - Installation of Son of Grid Engine 8.1.8 RPMs for Master Host <http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_master_host.shtml> - Installation of Grid Engine Master Host <http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_master_host.shtml> - Execution host - Installation of the Son of Grid Engine 8.1.8 RPMs for Execution Host <http://www.softpanorama.org/HPC/Grid_engine/Implementations/Son_of_grid_engine/installation_of_soge818_rpms_for_execution_host.shtml> - Installation of the Grid Engine Execution Host <http://www.softpanorama.org/HPC/Grid_engine/Installation/installation_of_execution_host.shtml> - Using the command line installer <http://www.softpanorama.org/HPC/Grid_engine/Installation/using_the_command_line_installer.shtml> ** Share $SGE_ROOT with compute nodes (execution hosts)? You'll see discussion about this in the above docs. I decided to share the complete $SGE_ROOT from the front end to the nodes via NFS. This is simpler and shouldn't cause a problem on my smallish cluster (21 nodes, 332 cores) which has 10Gb local switching. ** install_initd error in ./install_qmaster Running $SGE_ROOT/install_qmaster gave me an error at the step where the sgemaster.<cluster name> init script is installed. While debugging not that install_initd doesn't output an error, it just returns 1 instead of 0. I traced this to the dependency lines # Required-Start: $network $remote_fs # Required-Stop: $network $remote_fs in /etc/init.d/sgemaster.<cluster-name>. For whatever reason it doesn't like $remote_fs dependency. I changed $SGE_ROOT/util/rctemplates/sgemaster_template to this instead: # Required-Start: $network $local_fs # Required-Stop: $network $local_fs And then reran install_qmaster. I believe this should be fine since SoGE doesn't rely on remote filesystems on my system (just the compute nodes do, to mount /opt/sge, but in their init config it doesn't complain about the $remote_fs dependency. Go figure.) So far things are working fine. ==== Restore SGE configurations ==== ** run $SGE_ROOT/util/upgrade_modules/load_sge_config.sh I had an issue with the hostname on my front end. This script was picking up the FQDN and it was conflicting with the local hostname that the script wanted to see for qmaster host. I temporarily set the hostname to the local one and the script was happy. ==== Add nodes / execution hosts ==== Here are the steps to add an exec host after it's booted up (I have the rpm's added to the rocks distro per the above install instructions, but I'm not sure if it's needed with the fully-shared /opt/sge dir). I still need to add this to an init script of some sort so it can run automatically as part of the rocks distro for the exec hosts. usermod -u 399 sgeadmin groupmod -g 399 sgeadmin echo "#manually added" >> /etc/fstab echo "<front-end>:/opt/sge /opt/sge nfs defaults,noatime 0 0" >> /etc/fstab mount /opt/sge . /etc/profile.d/cfn-sge-env.sh #or whatever you call this file for sge env setup cp $SGE_ROOT/default/common/sgeexecd /etc/init.d/sgeexecd.<cluster-name> /usr/lib/lsb/install_initd /etc/init.d/sgeexecd.<cluster-name> service sgeexecd.<cluster-name> start ps -ef | grep sge HTH! -M
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
