[gridengine users] Q: Understanding of Loose and Tight Integration of PEs.

Lee, Wayne Wed, 18 Nov 2015 13:02:02 -0800

To list,

I've been reading some of the information from various web links regarding the 
differences between "loose" and "tight" integration associated with Parallel 
Environments (PEs) within Grid Engine (GE).   One of the weblinks I found which 
provides a really good explanation of this is "Dan Templeton's PE Tight 
Integration (https://blogs.oracle.com/templedf/entry/pe_tight_integration).  I 
would like to just confirm my understanding of "loose"/"tight" integration as 
well as what the role of the "rsh" wrapper is in the process.



1.       Essentially, as best as I can tell an application, regardless if it is 
setup to use either "loose" or "tight" integration have the GE "sge_execd" 
execution daemon start up the "Master" task that is part of a parallel job 
application.   An example of this would be an MPI (eg. LAM, Intel, Platform, 
Open, etc.) application.   So I'm assuming I would the "sge_execd" daemon fork 
off a "sge_shepherd" process which in turn starts up something like "mpirun" or 
some script.  Is this correct?


2.       The differences between the "loose" and "tight" integration is how the 
parallel job application's "Slave" tasks are handled.   With "loose" 
integration the slave tasks/processes are not managed and started by GE.   The 
application would start up the slave tasks via something like "rsh" or "ssh".   
 An example of this is mpirun starting the various slave processes to the 
various nodes listed in the "$pe_hostlist" provided by GE.  With "tight" 
integration, the slave tasks/processes are managed and started by GE but 
through the use of "qrsh".  Is this correct?



3.       One of the things I was reading from the document discussing "loose" 
and "tight" integration using LAM MPI was the differences in the way they 
handle "accounting" and how the processes associated with a parallel job are 
handled if deleted using qdel.    By "accounting", does this mean that the GE 
is able to better keep track of where each of the slave tasks are and how much 
resources are being used by the slave tasks?    So does this mean that "tight" 
integration is preferable over "loose" integration since one allows GE to 
better keep track of the resources used by the slave tasks and one is able to 
better delete a "tight" integration job in a "cleaner" manner?


4.       Continuing with "tight" integration.   Does this also mean that if a 
parallel MPI application uses either "rsh" or "ssh" to facilitate the 
communications between the Master and Slave tasks/processes, that essentially, 
"qrsh", intercepts or replaces the communications performed by "rsh" or "ssh"?  
   Hence this is why the "rsh" wrapper script is used to facilitate the "tight" 
integration.   Is that correct?



5.       I was reading from some of the postings in the GE archive from someone 
named "Reuti" regarding the "rsh" wrapper script.   If I understood what he 
wrote correctly, it doesn't matter if the Parallel MPI application is using 
either "rsh" or "ssh", the "rsh" wrapper script provided by GE is just to force 
the application so use GE's qrsh?    Am I stating this correctly?    Another 
way to state this is that "rsh" is just a name.   The name could be anything as 
long as your MPI application is configured to use whatever name of the 
communications protocol is used by the application, essentially the basic 
contents of the wrapper script won't change aside from the name "rsh" and 
locations of scripts referenced by the wrapper script.   Again, am I stating 
this correctly?



6.       With regards to the various types and vendor's MPI implementation.   
What does it exactly mean that certain MPI implementations are GE aware?   I 
tend to think that this means that parallel applications built with GE aware 
MPI implementations know where to find the "$pe_hostfile" that GE generates 
based on what resources the parallel application needs.   Is that all to it for 
the MPI implementation to be GE aware?    I know that with Intel or Open MPI, 
the PE environments that I've created don't really require any special scripts 
for the "start_proc_args" and "stop_proc_args" parameters in the PE.    
However, based on what little I have seen, LAM and Platform MPI implementations 
appear to require one to use scripts based on ones like "startmpi.sh" and 
"stopmpi.sh" in order to setup the proper formatted $pe_hostfile to be used by 
these MPI implementations.   Is my understanding of this correct?



7.       I was looking at the following options for the "qconf -sconf" (global 
configuration) from GE.

qlogin_command             builtin
qlogin_daemon                builtin
rlogin_command              builtin
rlogin_daemon                 builtin
rsh_command                   builtin
rsh_daemon                      builtin

I was attempting to fully understand how the above parameters are related to 
the execution of Parallel application jobs in GE.   What I'm wonder here is if 
the parallel application job I would want GE to manage requires and uses "ssh" 
by default for communications between Master and Slave tasks, does this mean, 
that the above parameters would need to be configured to use "slogin", "ssh", 
"sshd", etc.?

Apologies for all the questions.   I just want to ensure I understand the PEs a 
bit more.

Kind Regards,

-------
Wayne Lee

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Q: Understanding of Loose and Tight Integration of PEs.

Reply via email to