Re: [gridengine users] Q: Understanding of Loose and Tight Integration of PEs.

Reuti Wed, 02 Dec 2015 04:22:11 -0800

Hi,

> Am 01.12.2015 um 22:33 schrieb Lee, Wayne <[email protected]>:
> 
> Hi Reuti,
> 
> I just wanted to thank you again for providing answers and comments regarding 
> my inquiries about MPI and GE integration.   I was able to get our 3rd party 
> Platform MPI application to be managed by GE.   I did this by creating a GE 
> script setting the application's environment variables in the script along 
> with setting MPI_TMPDIR=$TMPDIR in the script.   At this point, I've 
> demonstrated to the application vendor that this works.   It is now up to 
> them to get their wrapper scripts to build the GE submission scripts.


Great!

> 
> I wanted to also ask you if there are standardized ways to handled OpenMP 
> applications with GE.   Is it similar to the ways we've been discussing with 
> MPI based applications?   I'm also working on getting another 3rd party 
> application to be managed by GE.   The vendor has stated that they use OpenMP 
> for multi-threading (in a single process) and that they phased out the use of 
> OpenMPI for multi-tasking (running multiple processes on different nodes 
> across the network).  

The integration of a shared memory or thread parallel job is much easier. 
Although one could fool SGE and submit a serial job and use just more cores 
than granted, it's of course best to request the proper PE and amount of cores. 
This way SGE can schedule the jobs correctly and the job script can generic be 
written with:

export OMP_NUM_THREADS=$NSLOTS

or a similar setting, where $NSLOTS will be set during the execution of the job 
to the granted amount of cores (this is especially useful in case you requested 
a range of slots in the PE and can't know the final number in advance). 
Although worth to note, is that some parallel libraries using threads a greedy 
and will take all available cores by default. This is fine if you have 
exclusive access to a node, but will interfere with other jobs otherwise 
(unless the threads are bound to some set of CPUs anyway).


> From what I have seen when running this application interactively, it looks 
> like the application allows one to specify the total number of tasks in a 
> job.   The application also allows one to specify a path to a "machinefile", 
> the number of CPUs to be used for computations on each node and the maximum 
> amount of memory the application is allowed to allocate.    I'm guessing that 
> from a GE perspective, the total number tasks could be thought of as the 
> total number of nodes to use where each node would use a specific number of 
> CPUs.

The original $PE_HOSTFILE of SGE has several columns and specify granted the 
nodes and number of slots. The MPICH(1)* format uses a bare one line per to be 
started task what is done by the default $SGE_ROOT/mpi/startmpi.sh but can 
assemble any other format.

Is the OpenMP application using several nodes too? Then tasks could refer to a 
node, where in addition OpenMP is used. Depends on the application. As long as 
all nodes get the same amount of cores on each machine (which could be done 
with a PE where a fixed number of cores is specified for the application_rule) 
it could be sufficient to define:

((OMP_NUM_THREADS=$NSLOTS/$NHOSTS)); export OMP_NUM_THREADS

Otherwise s special starter_method for each task would be necessary to set the 
proper value or some kind of built in support what you mention to be available.

-- Reuti

*) IIRC MPICH(1) could be compiled in a special way to use forks instead of a 
local `qrsh -inherit ...` to start additional local processes, and in this case 
it was also necessary to have lines with "<hostname>:<cores>" per line in the 
machine file to use it, otherwise the default startup by `qrsh -inherit ...` 
kicked in.


> Any thoughs/comments would be appreciated.
> 
> Regards,
> 
> -----
> Wayne Lee
> 
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]] 
> Sent: Saturday, November 21, 2015 8:26 AM
> To: Lee, Wayne <[email protected]>
> Cc: [email protected] Group <[email protected]>
> Subject: Re: [gridengine users] Q: Understanding of Loose and Tight 
> Integration of PEs.
> 
> Hi,
> 
> Am 21.11.2015 um 06:14 schrieb Lee, Wayne:
> 
>> Hi Reuti,
>> 
>> First of all.   Thanks (Danke) for your very quick and prompt reply.   I was 
>> amazed that you were able to reply so quickly given the time I received your 
>> reply, it must have been late at night for you.
>> 
>> Anyway, thank you for clarifying some things about the details regarding 
>> PEs.  I'm attempting to see how use GE to manage an application using 
>> Platform MPI version 9.1.2.   I did see a some posting from back in 2014 
>> where you provided some assistance to an individual attempting to use GE to 
>> manage a Platform MPI application.    As far as I know, Platform MPI doesn't 
>> have built in support for GE.
> 
> Correct. But they provide a MPICH(1) behavior to accept a plain list of nodes 
> for each slot with the -hostfile option.
> 
> 
>>  I would expect this given that Platform has LSF as their job scheduler and 
>> I wouldn't expect Platform to support a competing job scheduler like GE.
> 
> Well, it's owned by IBM now and they even have a free license of Platform 
> MPI: http://www.ibm.com/developerworks/downloads/im/mpi/ - and before it was 
> at HP.
> 
> 
>> After looking over the postings on the GE forum, I was able to submit the 
>> Platform MPI applications to GE using the following "qsub" command.   Based 
>> on what I did, I believe I've used "tight integration"
> 
> According to the `ps -e f` output you provided, it looks like a proper tight 
> integration.
> 
> 
>> since the slave processes were started up by "qrsh".   The job submitted 
>> used a total of 16 CPU cores, one for each MPI process (rank) that was 
>> requested.   The PE used was configured to distribute the processes in a 
>> round-robin fashion.  
>> 
>> qsub -l myapp_program=1 -l mr_p=8 -h -l AHC_JobType=myapp -l 
>> AHC_JobClass=prod -v 
>> PATH=/apps/myapp/tools/linux_x86_64/platform/9.1.2/bin:$PATH -v 
>> MPI_REMSH=ssh -v  \ MPI_TMPDIR=/tmp/ge -V -t 1-1 -N TEST_DATA -q 
>> prod_myapp_rr_pmpi.low.q -V -cwd -j y -b y -o TEST_DATA.OUT -pe 
>> prod_myapp_rr_pmpi 16  \ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpirun -hostfile 
>> /tmp/ge/pmpi_machines -np 16 -prot -aff=automatic:bandwidth \ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe 
>> CEIBA-V3-LO12-4PV_E100
> 
> Can more than one job being run on  a node at the same time? I fear 
> /tmp/ge/pmpi_machines might get overridden then, like any scratch data in 
> MPI_TMPDIR.
> 
>> 
>> Comments about the above qsub command:
>> ==================================
>> 
>> 1. The above qsub is not what our users would execute to run the Platform 
>> MPI application.   The application vendor provides a Python wrapper script 
>> which essentially builds a qsub script and submits the script on behalf of 
>> the user.   The Python wrapper script accepts various command-line arguments 
>> which provides the end-user a choice of which job scheduler, application, 
>> and MPI versions to use.   I have found that the vendor's wrapper script 
>> does has some minor bugs in it as far as how it supports GE.   Hence, I 
>> decided to bypass the script in order to ensure that the application could 
>> be tested with GE.
>> 
>> 2. Most of the qsub command-line arguments you're familiar with already.   
>> However, I wanted to focus on the "-v" environment variables and the "-V" 
>> options.
>> 
>> - After seeing the postings from 2014, it was suggested to pass the PATH 
>> environment variable to include the path to the Platform MPI binaries, 
>> MPI_REMSH which specifies the protocol that Master and Slave processes in 
>> Platform MPI would communicate.   In the case of the vendor's application, 
>> they chose "ssh".    I'm not quite sure exactly what MPI_TMPDIR is used for 
>> but I guess it is the location on all Slave and Master host for temporary 
>> files to be written to by Platform MPI.   I just know that I set it to 
>> "/tmp/ge" which is a directory on all the nodes I've tested the Platform MPI 
>> application on.    The "/tmp/ge" location is also where I configured 
>> versions of the "startmpi.sh", "stopmpi.sh", and "rsh" scripts write out the 
>> $pe_hostfile to as the job was executed by GE.    From the past postings, it 
>> was suggested that one set the MPI_TMPDIR variable to $TMPDIR which is what 
>> GE sets.   However, when I attempted to set this in the above qsub command, 
>> MPI_TMPDIR wouldn't !
 get set.   So I am wondering how do I set "MPI_TMPDIR" so that it uses the 
GE's value for $TMPDIR?
> 
> If you set it on the command line, it will be expanded at submission time and 
> at that time it's not known. Escaping it might put a plain $TMPDIR in the 
> variable - also not the desired effect. Would it be possible to have a small 
> wrapper script, where you could also use other SGE feature like putting fixed 
> options to `qsub` therein and use some environment variables set by SGE:
> 
> #!/bin/sh
> #$ -cwd
> #$ -j y
> export PATH=/apps/myapp/tools/linux_x86_64/platform/9.1.2/bin:$PATH
> export MPI_TMPDIR=$TMPDIR
> export HOSTFILE=$MPI_TMPDIR/machines
> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpirun -hostfile $HOSTFILE 
> -np $NSLOTS ...
> 
> 
>> - I should point that I included the use of the "-V" option since I wanted 
>> the qsub I used to inherit the shell environment I was using when I ran 
>> qsub.    The shell environment that the application vendor recommends all of 
>> our users to use is C-Shell (i.e. /bin/csh).
> 
> But the shell used in the application script is independent from the shell 
> used on the command line. They suggest to use the csh for their scripts or 
> also on the command line? Just the setting of the queue might need to be 
> adjusted: "shell_start_mode" (see `man queue_conf` - I prefer 
> "unix_behavior") and additional "shell" in some cases, depending on what 
> behavior you prefer.
> 
> 
>>    One thing I am wondering is should I set up this environment along with 
>> the "PATH" to the Platform MPI binaries and the MPI_REMSH and MPI_TMPDIR 
>> variables in a C-Shell script and get this script executed as part of the 
>> "starter_method" for the queue I am using for these Platform MPI jobs?  Or 
>> is there some other preferred methods for passing the login environment and 
>> other environment variables so that the job will execute properly?
> 
> See above. Whether Platform's `mpirun` is executed in a csh or bash script 
> should make no difference. I wonder about the intention of the vendor for its 
> statement and whether its mandatory for his software or just a suggestion 
> like a personal preference.
> 
> 
>>  I hope I'm being clear here.   Also, if the application login shell is 
>> C-shell, does that mean that if I define a script for the "starter_method", 
>> that this script should be a C-shell based script?
> 
> No. You could even switch to the by -S requested shell by checking 
> SGE_STARTER_SHELL_PATH (see `man queue_conf`), or hard code any shell 
> therein: `exec` the desired shell in the "starter_method" with the main 
> script as argument.
> 
> -- Reuti
> 
> 
>> Additional information:
>> ==================
>> 
>> Parallel Environment Used:
>> ---------------------------------
>> pe_name              prod_myapp_rr_pmpi
>> slots                        9999
>> used_slots           0
>> bound_slots          0
>> user_lists           NONE
>> xuser_lists          NONE
>> start_proc_args      /nfs/njs/ge/mpi/pmpi/startpmpi.sh -catch_rsh 
>> $pe_hostfile
>> stop_proc_args               /nfs/njs/ge/mpi/pmpi/stoppmpi.sh
>> allocation_rule              $round_robin
>> control_slaves               TRUE
>> job_is_first_task    TRUE
>> urgency_slots                min
>> accounting_summary   TRUE
>> daemon_forks_slaves  FALSE
>> master_forks_slaves  FALSE
>> 
>> 
>> Startpmpi.sh script used (This is a copy of the template startmpi.sh script 
>> provided by GE.   I only show the parts I modified/added.   The rest 
>> remained the same.)
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> --------------------------------------------------------
>> 
>> # trace machines file
>> cat $machines
>> 
>> # make copy of $machines to /tmp/ge on each node.
>> cp -p $machines /tmp/ge/pmpi_machines
>> .
>> .
>> .
>> if [ $catch_rsh = 1 ]; then
>>  rsh_wrapper=$SGE_ROOT/mpi/pmpi/rsh      ### Changed location of rsh script
>>  if [ ! -x $rsh_wrapper ]; then
>>     echo "$me: can't execute $rsh_wrapper" >&2
>>     echo "     maybe it resides at a file system not available at this 
>> machine" >&2
>>     exit 1
>>  fi
>> 
>> #   rshcmd=rsh
>>  rshcmd=ssh          ### Since Platform MPI application wants to use ssh 
>> instead of rsh.
>>  case "$ARC" in
>>     hp*) rshcmd=remsh ;;
>>     *) ;;
>>  esac
>>  # note: This could also be done using rcp, ftp or s.th.
>>  #       else. We use a symbolic link since it is the
>>  #       cheapest in case of a shared filesystem
>>  #
>>  ln -s $rsh_wrapper $TMPDIR/$rshcmd                  ### Hence an "ssh" link 
>> to $SGE_ROOT/mpi/pmpi/rsh is created.
>> fi
>> .
>> .
>> .
>> if [ $catch_hostname = 1 ]; then
>>  hostname_wrapper=$SGE_ROOT/mpi/pmpi/hostname        ### Changed location of 
>> hostname script
>> .
>> .
>> .
>> exit 0
>> 
>> Stoppmpi.sh script used (This is a copy of the template stopmpi.sh script 
>> provided by GE.   I only show the parts I modified/added.   The rest 
>> remained the same.)
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> --------------------------------------------------------
>> 
>> #
>> # Just remove machine-file that was written by startpmpi.sh # #rm 
>> $TMPDIR/machines
>> rm /tmp/ge/pmpi_machines             ### Remove list of hosts from each node.
>> 
>> #rshcmd=rsh
>> rshcmd=ssh                   ### Changed to ssh for Platform MPI application.
>> case "$ARC" in
>>  hp*) rshcmd=remsh ;;
>>  *) ;;
>> esac
>> rm $TMPDIR/$rshcmd
>> 
>> exit 0
>> 
>> rsh script used (This is a copy of the template rsh script provided by GE.   
>> I only show the parts I modified/added.   The rest remained the same.)
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> --------------------------------------------------------
>> .
>> .
>> .
>> if [ x$just_wrap = x ]; then
>>  if [ $minus_n -eq 1 ]; then
>>     echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd            
>>                 ### -V option added in order to pass login environment to 
>> qrsh.
>>     exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd            
>>         ### -V option added in order to pass login environment to qrsh.
>>  else
>>     echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd                     
>>         ### -V option added in order to pass login environment to qrsh.
>>     exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd                     
>>         ### -V option added in order to pass login environment to qrsh.
>>  fi
>> else
>> .
>> .
>> .
>> 
>> ps -ef f output form job submitted by qsub (Only partial listing shown 
>> for Master and Slave nodes.)
>> ----------------------------------------------------------------------
>> --------------------------------------------------
>> 
>> Node n100
>> ==================
>> sgeadmin 14768     1  0 Oct14 ?        Sl   127:51 
>> /nfs/njs/ge/bin/lx-amd64/sge_execd
>> sgeadmin 32339 14768  0 21:13 ?        S      0:00  \_ sge_shepherd-1304 -bg
>> csh_test 32402 32339  0 21:13 ?        Ss     0:00      \_ -csh -c 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpirun -hostfile 
>> /tmp/ge/pmpi_machines -np 16 -prot -aff=automatic:bandwidth 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA 
>> csh_test 32501 32402  0 21:13 ?        S      0:00          \_ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpirun -hostfile 
>> /tmp/ge/pmpi_machines -np 16 -prot -aff=automatic:bandwidth 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 32504 32501  0 21:13 ?        S      0:00              \_ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 0 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 32630 32504 20 21:13 ?        Rl     0:11              |   \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 32631 32504 97 21:13 ?        R      0:52              |   \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 32632 32504 97 21:13 ?        R      0:52              |   \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 32633 32504 97 21:13 ?        R      0:52              |   \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 32505 32501  0 21:13 ?        S      0:00              \_ cat
>> csh_test 32506 32501  0 21:13 ?        Sl     0:00              \_ 
>> /nfs/njs/ge/bin/lx-amd64/qrsh -V -inherit -nostdin 10.231.82.13 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 1 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 32507 32501  0 21:13 ?        Sl     0:00              \_ 
>> /nfs/njs/ge/bin/lx-amd64/qrsh -V -inherit -nostdin 10.231.82.215 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 2 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 32508 32501  0 21:13 ?        Sl     0:00              \_ 
>> /nfs/njs/ge/bin/lx-amd64/qrsh -V -inherit -nostdin 10.231.83.42 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 3 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> 
>> Node n101
>> ==================
>> sgeadmin  7608     1  0 Aug05 ?        Sl   351:24 
>> /nfs/njs/ge/bin/lx-amd64/sge_execd
>> sgeadmin 25337  7608  0 21:13 ?        Sl     0:00  \_ sge_shepherd-1304 -bg
>> csh_test 25338 25337  0 21:13 ?        Ss     0:00      \_ 
>> /nfs/njs/ge/utilbin/lx-amd64/qrsh_starter 
>> /tmp/ge/n101/active_jobs/1304.1/1.n101
>> csh_test 25345 25338  0 21:13 ?        S      0:00          \_ csh -c 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 1 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 25443 25345  0 21:13 ?        S      0:00              \_ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 1 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 25538 25443 98 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 25539 25443 98 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 25540 25443 98 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 25541 25443 98 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> 
>> Node n102
>> ==================
>> sgeadmin  3647     1  0 Aug05 ?        Sl   346:57 
>> /nfs/njs/ge/bin/lx-amd64/sge_execd
>> sgeadmin 24051  3647  0 21:13 ?        Sl     0:00  \_ sge_shepherd-1304 -bg
>> csh_test 24052 24051  0 21:13 ?        Ss     0:00      \_ 
>> /nfs/njs/ge/utilbin/lx-amd64/qrsh_starter 
>> /tmp/ge/n102/active_jobs/1304.1/1.n102
>> csh_test 24059 24052  0 21:13 ?        S      0:00          \_ csh -c 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 2 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 24157 24059  0 21:13 ?        S      0:00              \_ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 2 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test 24252 24157 97 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 24253 24157 97 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 24254 24157 97 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test 24255 24157 97 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> 
>> Node n103
>> ==================
>> sgeadmin  2412     1  0 Sep03 ?        Sl   250:56 
>> /nfs/njs/ge/bin/lx-amd64/sge_execd
>> sgeadmin  5569  2412  0 21:13 ?        Sl     0:00  \_ sge_shepherd-1304 -bg
>> csh_test  5570  5569  0 21:13 ?        Ss     0:00      \_ 
>> /nfs/njs/ge/utilbin/lx-amd64/qrsh_starter 
>> /tmp/ge/n103/active_jobs/1304.1/1.n103
>> csh_test  5577  5570  0 21:13 ?        S      0:00          \_ csh -c 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 3 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test  5675  5577  0 21:13 ?        S      0:00              \_ 
>> /apps/myapp/tools/linux_x86_64/platform/9.1.2/bin/mpid 3 0 151060992 
>> 10.231.82.15 40823 32501 /apps/myapp/tools/linux_x86_64/platform/9.1.2
>> csh_test  5770  5675 99 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test  5771  5675 99 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test  5772  5675 99 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> csh_test  5773  5675 99 21:13 ?        R      0:52                  \_ 
>> /apps/myapp/2014.1/bin/linux_x86_64/myapp_program_plmpi.exe TEST_DATA
>> 
>> 
>> Kind Regards,
>> 
>> -----
>> Wayne Lee
>> 
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]]
>> Sent: Wednesday, November 18, 2015 4:26 PM
>> To: Lee, Wayne
>> Cc: [email protected] Group
>> Subject: Re: [gridengine users] Q: Understanding of Loose and Tight 
>> Integration of PEs.
>> 
>> Ups - fatal typo - it's late:
>> 
>> Am 18.11.2015 um 23:09 schrieb Reuti:
>> 
>>> Hi,
>>> 
>>> Am 18.11.2015 um 22:00 schrieb Lee, Wayne:
>>> 
>>>> To list,
>>>> 
>>>> I've been reading some of the information from various web links regarding 
>>>> the differences between "loose" and "tight" integration associated with 
>>>> Parallel Environments (PEs) within Grid Engine (GE).   One of the weblinks 
>>>> I found which provides a really good explanation of this is "Dan 
>>>> Templeton's PE Tight Integration 
>>>> (https://blogs.oracle.com/templedf/entry/pe_tight_integration).  I would 
>>>> like to just confirm my understanding of "loose"/"tight" integration as 
>>>> well as what the role of the "rsh" wrapper is in the process. 
>>>> 
>>>> 1.       Essentially, as best as I can tell an application, regardless if 
>>>> it is setup to use either "loose" or "tight" integration have the GE 
>>>> "sge_execd" execution daemon start up the "Master" task that is part of a 
>>>> parallel job application.   An example of this would be an MPI (eg. LAM, 
>>>> Intel, Platform, Open, etc.) application.   So I'm assuming I would the 
>>>> "sge_execd" daemon fork off a "sge_shepherd" process which in turn starts 
>>>> up something like "mpirun" or some script.  Is this correct?
>>> 
>>> Yes.
>>> 
>>> But to be complete: in addition we first have to distinguish whether the 
>>> MPI slave tasks can be started by an `ssh`/`rsh` (resp. `qrsh -inherit ...` 
>>> for a tight integration) on its own, or whether they need some running 
>>> daemons beforehand. Creating a tight integration for a daemon based setup 
>>> is more convoluted by far, and my Howtos for PVM, LAM/MPI and early 
>>> versions of MPICH2 are still available, but I wouldn't recommend to use it 
>>> - unless you have some legacy applications which depend on this and you 
>>> can't recompile them.
>>> 
>>> Recent versions of Intel MPI, Open MPI, MPICH2 and Platform MPI can achieve 
>>> a tight integration with minimal effort. Let me know if you need more 
>>> information about a specific one.
>>> 
>>> 
>>>> 2.       The differences between the "loose" and "tight" integration is 
>>>> how the parallel job application's "Slave" tasks are handled.   With 
>>>> "loose" integration the slave tasks/processes are not managed and started 
>>>> by GE.   The application would start up the slave tasks via something like 
>>>> "rsh" or "ssh".    An example of this is mpirun starting the various slave 
>>>> processes to the various nodes listed in the "$pe_hostlist" provided by 
>>>> GE.  With "tight" integration, the slave tasks/processes are managed and 
>>>> started by GE but through the use of "qrsh".  Is this correct?
>>> 
>>> Yes.
>>> 
>>> 
>>>> 3.       One of the things I was reading from the document discussing 
>>>> "loose" and "tight" integration using LAM MPI was the differences in the 
>>>> way they handle "accounting" and how the processes associated with a 
>>>> parallel job are handled if deleted using qdel.    By "accounting", does 
>>>> this mean that the GE is able to better keep track of where each of the 
>>>> slave tasks are and how much resources are being used by the slave tasks?  
>>>>   So does this mean that "tight" integration is preferable over "loose" 
>>>> integration since one allows GE to better keep track of the resources used 
>>>> by the slave tasks and one is able to better delete a "tight" integration 
>>>> job in a "cleaner" manner?
>>> 
>>> Yes - absolutely.
>>> 
>>> 
>>>> 4.       Continuing with "tight" integration.   Does this also mean that 
>>>> if a parallel MPI application uses either "rsh" or "ssh" to facilitate the 
>>>> communications between the Master and Slave tasks/processes, that 
>>>> essentially, "qrsh", intercepts or replaces the communications performed 
>>>> by "rsh" or "ssh"?     Hence this is why the "rsh" wrapper script is used 
>>>> to facilitate the "tight" integration.   Is that correct?
>>> 
>>> The wrapper solution is only necessary in case the actual MPI library 
>>> has now builtin support for SGE. In case of Open MPI (./configure 
>>> --with-sge ...) and
>> 
>> Should read: [...] actual MPI library has no builtin support [...]
>> 
>> -- Reuti
>> 
>> 
>>> MPICH2 the support is built in and you can find hints to set it up on their 
>>> websites - no wrapper necessary and the start_/stop-_proc_args can be set 
>>> to NONE (i.e.: they call `qrsh` directly, in case they discover that they 
>>> are executed under SGE [by certain set environment variables]). The 
>>> start_proc_args in the PE was/is used to set up the links to the wrapper(s) 
>>> and reformat the $pe_hostfile, in case the parallel library understands 
>>> only a different format*. This is necessary e.g. for MPICH(1).
>>> 
>>> *) In case you heard of the application Gaussian: I also create the 
>>> "%lindaworkers=..." list of nodes for the input file line in the 
>>> start_proc_args.
>>> 
>>> 
>>>> 5.       I was reading from some of the postings in the GE archive from 
>>>> someone named "Reuti" regarding the "rsh" wrapper script.   If I 
>>>> understood what he wrote correctly, it doesn't matter if the Parallel MPI 
>>>> application is using either "rsh" or "ssh", the "rsh" wrapper script 
>>>> provided by GE is just to force the application so use GE's qrsh?    Am I 
>>>> stating this correctly?    Another way to state this is that "rsh" is just 
>>>> a name.   The name could be anything as long as your MPI application is 
>>>> configured to use whatever name of the communications protocol is used by 
>>>> the application, essentially the basic contents of the wrapper script 
>>>> won't change aside from the name "rsh" and locations of scripts referenced 
>>>> by the wrapper script.   Again, am I stating this correctly?
>>> 
>>> Yes to all.
>>> 
>>> 
>>>> 6.       With regards to the various types and vendor's MPI 
>>>> implementation.   What does it exactly mean that certain MPI 
>>>> implementations are GE aware?   I tend to think that this means that 
>>>> parallel applications built with GE aware MPI implementations know where 
>>>> to find the "$pe_hostfile" that GE generates based on what resources the 
>>>> parallel application needs.   Is that all to it for the MPI implementation 
>>>> to be GE aware?    I know that with Intel or Open MPI, the PE environments 
>>>> that I've created don't really require any special scripts for the 
>>>> "start_proc_args" and "stop_proc_args" parameters in the PE.    However, 
>>>> based on what little I have seen, LAM and Platform MPI implementations 
>>>> appear to require one to use scripts based on ones like "startmpi.sh" and 
>>>> "stopmpi.sh" in order to setup the proper formatted $pe_hostfile to be 
>>>> used by these MPI implementations.   Is my understanding of this correct?
>>> 
>>> Yes. While LAM/MPI is daemon based, Platform MPI uses a plain call to the 
>>> slave nodes and can be tightly integrated by the wrapper and setting 
>>> `export MPI_REMSH=rsh`.
>>> 
>>> For a builtin tight integration the MPI library needs to a) discover under 
>>> what queuing system it is running (set environment variables, cna be SGE, 
>>> SLURM, LSF, PBS, ...), b) find and honor the $pe_hostfile automatically 
>>> (resp. other files for other queuing systems), c) start `qrsh -inherit ...` 
>>> to start something on the granted nodes (some implementations need -V here 
>>> too (you can check the source of Open MPI for example), to forward some 
>>> variable to the slaves - *not* `qsub -V ...` which I try to avoid, as a 
>>> random adjustment to the user's shell might lead to a crash of the job when 
>>> it finally starts and this can be really hard to investigate, as a new 
>>> submission with a fresh shell might work again.
>>> 
>>> 
>>>> 7.       I was looking at the following options for the "qconf -sconf" 
>>>> (global configuration) from GE.   
>>>> 
>>>> qlogin_command             builtin
>>>> qlogin_daemon                builtin
>>>> rlogin_command              builtin
>>>> rlogin_daemon                 builtin
>>>> rsh_command                   builtin
>>>> rsh_daemon                      builtin
>>>> 
>>>> I was attempting to fully understand how the above parameters are related 
>>>> to the execution of Parallel application jobs in GE.   What I'm wonder 
>>>> here is if the parallel application job I would want GE to manage requires 
>>>> and uses "ssh" by default for communications between Master and Slave 
>>>> tasks, does this mean, that the above parameters would need to be 
>>>> configured to use "slogin", "ssh", "sshd", etc.?
>>> 
>>> No. These are two different things. With all of the above settings (before 
>>> this question) you first configure SGE to intercept the `rsh` resp. `ssh` 
>>> call (hence the application should never use an absolute path to start 
>>> them). This will lead the to effect that `qrsh - inherit ...` will finally 
>>> call the communication method which is configured by "rsh_command" and 
>>> "rsh_daemon". If possible they should stay as "builtin". Then SGE will use 
>>> its own internal communication to start the slave tasks, hence the cluster 
>>> needs no `ssh` or `rsh` at all. In my clusters this is even disabled for 
>>> normal users, and only admins can `ssh` to the nodes (if a user needs X11 
>>> forwarding to a node, this would be special of course). To let users check 
>>> a node they have to run an interactive job in a special queue, which grants 
>>> only 10 seconds CPU time (while the wallclock time can be almost infinity).
>>> 
>>> Other settings for these parameters are covered in this document - also 
>>> different kinds of communication can be set up for different nodes and 
>>> direction of the calls:
>>> 
>>> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
>>> 
>>> Let me know in case you need further details.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> 
>>>> Apologies for all the questions.   I just want to ensure I understand the 
>>>> PEs a bit more.
>>>> 
>>>> Kind Regards,
>>>> 
>>>> -------
>>>> Wayne Lee
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Q: Understanding of Loose and Tight Integration of PEs.

Reply via email to