Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Gowtham Thu, 17 Dec 2015 06:14:32 -0800

Yes sir. mpirun and mpiexec.hydra are both from Intel Cluster Studio suite. To 
make sure of this, I ran a quick batch job with


  which mpirun
  which mpiexec.hydra

and it returned 

  /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpirun
  /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpiexec.hydra

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu


On Thu, 17 Dec 2015, Reuti wrote:

| 
| > Am 17.12.2015 um 13:41 schrieb Gowtham <[email protected]>:
| > 
| > 
| > I tried replacing the call to mpirun with mpiexec.hydra and it seems to 
work successfully as before. Please find below the contents of *.sh.o#### file 
corresponding to the Hello, World! run spanning more than one compute node:
| 
| Are both `mpiexec` and `mpirun.hydra` from the same Intel MPI library? One 
can't use for example Open MPI's `mpiexec` to start an application of Intel MPI 
in parallel (if it runs, it will only several times on serial).
| 
| -- Reuti
| 
| 
| >    Parallel version of 'Go Huskies!' with 16 processors
| >  -----------------------------------------------------------------  
| >    Rank  Hostname                       Local Date & Time
| >  -----------------------------------------------------------------  
| >    0     compute-0-4.local              Thu Dec 17 07:29:54 2015
| > 
| >    1     compute-0-4.local              Thu Dec 17 07:29:58 2015
| >    2     compute-0-4.local              Thu Dec 17 07:29:59 2015
| >    3     compute-0-4.local              Thu Dec 17 07:30:00 2015
| >    4     compute-0-4.local              Thu Dec 17 07:30:01 2015
| >    5     compute-0-4.local              Thu Dec 17 07:30:02 2015
| >    6     compute-0-4.local              Thu Dec 17 07:30:03 2015
| >    7     compute-0-4.local              Thu Dec 17 07:30:04 2015
| >    8     compute-0-4.local              Thu Dec 17 07:30:05 2015
| >    9     compute-0-4.local              Thu Dec 17 07:30:06 2015
| >    10    compute-0-4.local              Thu Dec 17 07:30:07 2015
| >    11    compute-0-4.local              Thu Dec 17 07:30:08 2015
| >    12    compute-0-2.local              Thu Dec 17 07:30:09 2015
| >    13    compute-0-2.local              Thu Dec 17 07:30:10 2015
| >    14    compute-0-2.local              Thu Dec 17 07:30:11 2015
| >    15    compute-0-2.local              Thu Dec 17 07:30:12 2015
| >  -----------------------------------------------------------------
| > 
| > Any insight into why this made a difference would be greatly appreciated.
| > 
| > Best regards,
| > g
| > 
| > --
| > Gowtham, PhD
| > Director of Research Computing, IT
| > Adj. Asst. Professor, Physics/ECE
| > Michigan Technological University
| > 
| > P: (906) 487-3593
| > F: (906) 487-2787
| > http://it.mtu.edu
| > http://hpc.mtu.edu
| > 
| > 
| > On Thu, 17 Dec 2015, Gowtham wrote:
| > 
| > | 
| > | Here you go, Sir.
| > | 
| > | These two PEs are created by me (not from Rocks) to help our researchers 
pick one depending on the nature of their job. If a software suite required 
that all processors/cores belong to the same physical compute node (e.g., 
MATLAB with Parallel Computing Toolbox), then they would use mpich_staged. If a 
software suite could spread the job amongst processors from multiple compute 
nodes (e.g., Hello World, VASP, LAMMPS, etc.), then they would use 
mpich_unstaged.
| > | 
| > | I am including their definitions below.
| > | 
| > | 
| > | pe_name            mpich_unstaged
| > | slots              9999
| > | user_lists         NONE
| > | xuser_lists        NONE
| > | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
| > | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
| > | allocation_rule    $fill_up
| > | control_slaves     TRUE
| > | job_is_first_task  FALSE
| > | urgency_slots      min
| > | accounting_summary TRUE
| > | 
| > | pe_name            mpich_staged
| > | slots              9999
| > | user_lists         NONE
| > | xuser_lists        NONE
| > | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
| > | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
| > | allocation_rule    $pe_slots
| > | control_slaves     TRUE
| > | job_is_first_task  FALSE
| > | urgency_slots      min
| > | accounting_summary TRUE
| > | 
| > | 
| > | Best regards,
| > | g
| > | 
| > | --
| > | Gowtham, PhD
| > | Director of Research Computing, IT
| > | Adj. Asst. Professor, Physics/ECE
| > | Michigan Technological University
| > | 
| > | P: (906) 487-3593
| > | F: (906) 487-2787
| > | http://it.mtu.edu
| > | http://hpc.mtu.edu
| > | 
| > | 
| > | On Thu, 17 Dec 2015, Reuti wrote:
| > | 
| > | | 
| > | | > Am 16.12.2015 um 21:32 schrieb Gowtham <[email protected]>:
| > | | > 
| > | | > 
| > | | > Hi Reuti,
| > | | > 
| > | | > The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, 
and I do not need mpdboot. The PE used for this purpose is called 
mpich_unstaged (basically, a copy of the original mpich with '$fill_up' rule). 
The only other PE in this system is called mpich_staged (a copy of the original 
mpich with '$pe_slots' rule).
| > | | 
| > | | I have no clue what mpich_(un)staged refers to, I assume it's some 
setting from ROCKS. Can you please post the particular PE settings you want to 
use during submission.
| > | | 
| > | | -- Reuti
| > | | 
| > | | 
| > | | > The same Go Huskies! program compiled with same Intel Cluster Studio 
on a different cluster running same Rocks 6.1 and Grid Engine 2011.11p1 
combination using the same mpich_unstaged PE works successfully.
| > | | > 
| > | | > Best regards,
| > | | > g
| > | | > 
| > | | > --
| > | | > Gowtham, PhD
| > | | > Director of Research Computing, IT
| > | | > Adj. Asst. Professor, Physics/ECE
| > | | > Michigan Technological University
| > | | > 
| > | | > P: (906) 487-3593
| > | | > F: (906) 487-2787
| > | | > http://it.mtu.edu
| > | | > http://hpc.mtu.edu
| > | | > 
| > | | > 
| > | | > On Wed, 16 Dec 2015, Reuti wrote:
| > | | > 
| > | | > | Hi,
| > | | > | 
| > | | > | Am 16.12.2015 um 19:53 schrieb Gowtham:
| > | | > | 
| > | | > | > 
| > | | > | > Dear fellow Grid Engine users,
| > | | > | > 
| > | | > | > Over the past few days, I have had to re-install compute nodes 
(12 cores each) in an existing cluster running Rocks 6.1 and Grid Engine 
2011.11p1. I ensured the extend-*.xml files had no error in them using the 
xmllint command before rebuilding the distribution. All six compute nodes 
installed successfully, and so did running several test "Hello, World!" cases 
up to 72 cores. I can SSH into any one of these nodes, and SSH between any two 
compute nodes just fine.
| > | | > | > 
| > | | > | > As of this morning all submitted jobs that require more than 12 
cores (i.e., spanning more than one compute node) fail about a minute after 
starting successfully. However, all jobs with 12 or less cores within the a 
given compute node run just fine. The error message for failed job is as 
follows:
| > | | > | > 
| > | | > | >  error: got no connection within 60 seconds. "Timeout occured 
while waiting for connection"
| > | | > | >  Ctrl-C caught... cleaning up processes
| > | | > | > 
| > | | > | > "Hello, World!" and one other program, both compiled with Intel 
Cluster Studio 2013.0.028, display the same behavior. The line corresponding to 
the failed job from /opt/gridengine/default/spool/qmaster/messages is as 
follows:
| > | | > | > 
| > | | > | >  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel 
task 6129.1 task 1.compute-0-1 failed - killing job
| > | | > | > 
| > | | > | > I'd appreciate any insight or help to resolve this issue. If you 
need additional information from my end, please let me know.
| > | | > | 
| > | | > | What plain version of Intel MPI is Cluster Studio 2013.0.028? Less 
than 4.1? IIRC a tight integration was not supported before this one, as there 
was no call to `qrsh` automatically set up as you would need to start certain 
daemons beforehand.
| > | | > | 
| > | | > | Does your version still need mpdboot?
| > | | > | 
| > | | > | Do you request a proper set up PE in your job submission?
| > | | > | 
| > | | > | -- Reuti
| > | | > | 
| > | | > | > 
| > | | > | > Thank you for your time and help.
| > | | > | > 
| > | | > | > Best regards,
| > | | > | > g
| > | | > | > 
| > | | > | > --
| > | | > | > Gowtham, PhD
| > | | > | > Director of Research Computing, IT
| > | | > | > Adj. Asst. Professor, Physics/ECE
| > | | > | > Michigan Technological University
| > | | > | > 
| > | | > | > P: (906) 487-3593
| > | | > | > F: (906) 487-2787
| > | | > | > http://it.mtu.edu
| > | | > | > http://hpc.mtu.edu
| > | | > | > 
| > | | > | > _______________________________________________
| > | | > | > users mailing list
| > | | > | > [email protected]
| > | | > | > https://gridengine.org/mailman/listinfo/users
| > | | > | 
| > | | > | 
| > | | 
| > | | 
| > | 
| > _______________________________________________
| > users mailing list
| > [email protected]
| > https://gridengine.org/mailman/listinfo/users
| 
| 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to