To list, I've been reading some of the information from various web links regarding the differences between "loose" and "tight" integration associated with Parallel Environments (PEs) within Grid Engine (GE). One of the weblinks I found which provides a really good explanation of this is "Dan Templeton's PE Tight Integration (https://blogs.oracle.com/templedf/entry/pe_tight_integration). I would like to just confirm my understanding of "loose"/"tight" integration as well as what the role of the "rsh" wrapper is in the process.
1. Essentially, as best as I can tell an application, regardless if it is setup to use either "loose" or "tight" integration have the GE "sge_execd" execution daemon start up the "Master" task that is part of a parallel job application. An example of this would be an MPI (eg. LAM, Intel, Platform, Open, etc.) application. So I'm assuming I would the "sge_execd" daemon fork off a "sge_shepherd" process which in turn starts up something like "mpirun" or some script. Is this correct? 2. The differences between the "loose" and "tight" integration is how the parallel job application's "Slave" tasks are handled. With "loose" integration the slave tasks/processes are not managed and started by GE. The application would start up the slave tasks via something like "rsh" or "ssh". An example of this is mpirun starting the various slave processes to the various nodes listed in the "$pe_hostlist" provided by GE. With "tight" integration, the slave tasks/processes are managed and started by GE but through the use of "qrsh". Is this correct? 3. One of the things I was reading from the document discussing "loose" and "tight" integration using LAM MPI was the differences in the way they handle "accounting" and how the processes associated with a parallel job are handled if deleted using qdel. By "accounting", does this mean that the GE is able to better keep track of where each of the slave tasks are and how much resources are being used by the slave tasks? So does this mean that "tight" integration is preferable over "loose" integration since one allows GE to better keep track of the resources used by the slave tasks and one is able to better delete a "tight" integration job in a "cleaner" manner? 4. Continuing with "tight" integration. Does this also mean that if a parallel MPI application uses either "rsh" or "ssh" to facilitate the communications between the Master and Slave tasks/processes, that essentially, "qrsh", intercepts or replaces the communications performed by "rsh" or "ssh"? Hence this is why the "rsh" wrapper script is used to facilitate the "tight" integration. Is that correct? 5. I was reading from some of the postings in the GE archive from someone named "Reuti" regarding the "rsh" wrapper script. If I understood what he wrote correctly, it doesn't matter if the Parallel MPI application is using either "rsh" or "ssh", the "rsh" wrapper script provided by GE is just to force the application so use GE's qrsh? Am I stating this correctly? Another way to state this is that "rsh" is just a name. The name could be anything as long as your MPI application is configured to use whatever name of the communications protocol is used by the application, essentially the basic contents of the wrapper script won't change aside from the name "rsh" and locations of scripts referenced by the wrapper script. Again, am I stating this correctly? 6. With regards to the various types and vendor's MPI implementation. What does it exactly mean that certain MPI implementations are GE aware? I tend to think that this means that parallel applications built with GE aware MPI implementations know where to find the "$pe_hostfile" that GE generates based on what resources the parallel application needs. Is that all to it for the MPI implementation to be GE aware? I know that with Intel or Open MPI, the PE environments that I've created don't really require any special scripts for the "start_proc_args" and "stop_proc_args" parameters in the PE. However, based on what little I have seen, LAM and Platform MPI implementations appear to require one to use scripts based on ones like "startmpi.sh" and "stopmpi.sh" in order to setup the proper formatted $pe_hostfile to be used by these MPI implementations. Is my understanding of this correct? 7. I was looking at the following options for the "qconf -sconf" (global configuration) from GE. qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin I was attempting to fully understand how the above parameters are related to the execution of Parallel application jobs in GE. What I'm wonder here is if the parallel application job I would want GE to manage requires and uses "ssh" by default for communications between Master and Slave tasks, does this mean, that the above parameters would need to be configured to use "slogin", "ssh", "sshd", etc.? Apologies for all the questions. I just want to ensure I understand the PEs a bit more. Kind Regards, ------- Wayne Lee
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users