On 05/01/2014 02:47 PM, Michael Stauffer wrote:

    Random ideas:

    1. try disabling the log redirects to see if anything ends up in the
    standard kickstart log?


OK I'll try this. Have to wait for a host to free up to try a reinstall again.

    2. SGE is unusually sensitive to hostname and DNS resolution. Is your
    kickstart environment giving the node the same IP address during
    provisioning as it has when running? Does your kickstart environment
    have reverse DNS lookup working so that a lookup on the IP returns the
    proper hostname?


I'll dump tests in the kickstart file and check.
Don't know how to check the last bit - you mean a lookup on the IP by the execute host as it's booting?

Here's tip for trouble-shooting kickstart installs:

Depending on where you want to do your debugging (before or after the installation) add something like "sleep 1000" to your pre- or post-install script. Then from the console, use ALT+F1, ALT+F2, etc., to access get a root prompt and run some commands from the command-line. You can also cd to /tmp and look at the logs there, as well as the kickstart file that the install is working from. This is much easier and quicker than changing kickstart file, reboot, test, change kickstart file, reboot, test,...

    3. qconf requires communication with the qmaster, it looks like
    you are
    defining ENV vars that point only to the bin directory rather than
    setting up the full SGE environment during the kickstart. Consider
    sourcing the SGE init scripts or at least setting SGE_ROOT and
    SGE_CELL
    values so that the SGE binaries can navigate to
    $SGE_ROOT/$SGE_CELL/act_qmaster so that it knows what host to be
    communicating with


I source /etc/profile.d/sge-binaries.sh at the begin of my code. Should I need something else than that? In any case I'm dumping relevent env vars in the kickstart now to check them.

Just for the record, I tried doing this a few years ago with SGE 6.2u5, and for whatever reason, I couldn't get the inst_sge script to ever work correctly in the post-install environment. After a few days of fighting with it, I configured everything BUT sge and then used Cluster SSH to run ./inst_sge on all 64 hosts simultaneously, in auto-mode with no interaction, obviously.

potentially dumb question: Are you running inst_sge first, to make sure the host is configured and 'installed' properly before running those qconf commands?

Prentice





_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to