Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Kamel Mazouzi Thu, 17 Dec 2015 08:57:02 -0800

Hi,

mpirun (Intel) is just a wrapper for mpdboot + mpiexec + mpdallexit
while mpiexec.hydra is the new intel mpi process spawner which is tightly
integrated with Grid Engine since the version 4.3.1


Regards,

On Thu, Dec 17, 2015 at 4:19 PM, Reuti <[email protected]> wrote:

> Maybe `mpirun` doesn't support/use Hydra. Although not required, the MPI
> standard specifies `mpiexec` as a portable startup mechanism. Doesn't Intel
> MPI also have an `mpiexec`, which would match the `mpirun` behavior (and
> doesn't use Hydra)?
>
> -- Reuti
>
> > Am 17.12.2015 um 15:06 schrieb Gowtham <[email protected]>:
> >
> >
> > Yes sir. mpirun and mpiexec.hydra are both from Intel Cluster Studio
> suite. To make sure of this, I ran a quick batch job with
> >
> >  which mpirun
> >  which mpiexec.hydra
> >
> > and it returned
> >
> >  /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpirun
> >  /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpiexec.hydra
> >
> > Best regards,
> > g
> >
> > --
> > Gowtham, PhD
> > Director of Research Computing, IT
> > Adj. Asst. Professor, Physics/ECE
> > Michigan Technological University
> >
> > P: (906) 487-3593
> > F: (906) 487-2787
> > http://it.mtu.edu
> > http://hpc.mtu.edu
> >
> >
> > On Thu, 17 Dec 2015, Reuti wrote:
> >
> > |
> > | > Am 17.12.2015 um 13:41 schrieb Gowtham <[email protected]>:
> > | >
> > | >
> > | > I tried replacing the call to mpirun with mpiexec.hydra and it seems
> to work successfully as before. Please find below the contents of
> *.sh.o#### file corresponding to the Hello, World! run spanning more than
> one compute node:
> > |
> > | Are both `mpiexec` and `mpirun.hydra` from the same Intel MPI library?
> One can't use for example Open MPI's `mpiexec` to start an application of
> Intel MPI in parallel (if it runs, it will only several times on serial).
> > |
> > | -- Reuti
> > |
> > |
> > | >    Parallel version of 'Go Huskies!' with 16 processors
> > | >  -----------------------------------------------------------------
> > | >    Rank  Hostname                       Local Date & Time
> > | >  -----------------------------------------------------------------
> > | >    0     compute-0-4.local              Thu Dec 17 07:29:54 2015
> > | >
> > | >    1     compute-0-4.local              Thu Dec 17 07:29:58 2015
> > | >    2     compute-0-4.local              Thu Dec 17 07:29:59 2015
> > | >    3     compute-0-4.local              Thu Dec 17 07:30:00 2015
> > | >    4     compute-0-4.local              Thu Dec 17 07:30:01 2015
> > | >    5     compute-0-4.local              Thu Dec 17 07:30:02 2015
> > | >    6     compute-0-4.local              Thu Dec 17 07:30:03 2015
> > | >    7     compute-0-4.local              Thu Dec 17 07:30:04 2015
> > | >    8     compute-0-4.local              Thu Dec 17 07:30:05 2015
> > | >    9     compute-0-4.local              Thu Dec 17 07:30:06 2015
> > | >    10    compute-0-4.local              Thu Dec 17 07:30:07 2015
> > | >    11    compute-0-4.local              Thu Dec 17 07:30:08 2015
> > | >    12    compute-0-2.local              Thu Dec 17 07:30:09 2015
> > | >    13    compute-0-2.local              Thu Dec 17 07:30:10 2015
> > | >    14    compute-0-2.local              Thu Dec 17 07:30:11 2015
> > | >    15    compute-0-2.local              Thu Dec 17 07:30:12 2015
> > | >  -----------------------------------------------------------------
> > | >
> > | > Any insight into why this made a difference would be greatly
> appreciated.
> > | >
> > | > Best regards,
> > | > g
> > | >
> > | > --
> > | > Gowtham, PhD
> > | > Director of Research Computing, IT
> > | > Adj. Asst. Professor, Physics/ECE
> > | > Michigan Technological University
> > | >
> > | > P: (906) 487-3593
> > | > F: (906) 487-2787
> > | > http://it.mtu.edu
> > | > http://hpc.mtu.edu
> > | >
> > | >
> > | > On Thu, 17 Dec 2015, Gowtham wrote:
> > | >
> > | > |
> > | > | Here you go, Sir.
> > | > |
> > | > | These two PEs are created by me (not from Rocks) to help our
> researchers pick one depending on the nature of their job. If a software
> suite required that all processors/cores belong to the same physical
> compute node (e.g., MATLAB with Parallel Computing Toolbox), then they
> would use mpich_staged. If a software suite could spread the job amongst
> processors from multiple compute nodes (e.g., Hello World, VASP, LAMMPS,
> etc.), then they would use mpich_unstaged.
> > | > |
> > | > | I am including their definitions below.
> > | > |
> > | > |
> > | > | pe_name            mpich_unstaged
> > | > | slots              9999
> > | > | user_lists         NONE
> > | > | xuser_lists        NONE
> > | > | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh
> $pe_hostfile
> > | > | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> > | > | allocation_rule    $fill_up
> > | > | control_slaves     TRUE
> > | > | job_is_first_task  FALSE
> > | > | urgency_slots      min
> > | > | accounting_summary TRUE
> > | > |
> > | > | pe_name            mpich_staged
> > | > | slots              9999
> > | > | user_lists         NONE
> > | > | xuser_lists        NONE
> > | > | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh
> $pe_hostfile
> > | > | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> > | > | allocation_rule    $pe_slots
> > | > | control_slaves     TRUE
> > | > | job_is_first_task  FALSE
> > | > | urgency_slots      min
> > | > | accounting_summary TRUE
> > | > |
> > | > |
> > | > | Best regards,
> > | > | g
> > | > |
> > | > | --
> > | > | Gowtham, PhD
> > | > | Director of Research Computing, IT
> > | > | Adj. Asst. Professor, Physics/ECE
> > | > | Michigan Technological University
> > | > |
> > | > | P: (906) 487-3593
> > | > | F: (906) 487-2787
> > | > | http://it.mtu.edu
> > | > | http://hpc.mtu.edu
> > | > |
> > | > |
> > | > | On Thu, 17 Dec 2015, Reuti wrote:
> > | > |
> > | > | |
> > | > | | > Am 16.12.2015 um 21:32 schrieb Gowtham <[email protected]>:
> > | > | | >
> > | > | | >
> > | > | | > Hi Reuti,
> > | > | | >
> > | > | | > The MPI associated with Intel Cluster Studio 2013.0.028 is
> 4.1.0.024, and I do not need mpdboot. The PE used for this purpose is
> called mpich_unstaged (basically, a copy of the original mpich with
> '$fill_up' rule). The only other PE in this system is called mpich_staged
> (a copy of the original mpich with '$pe_slots' rule).
> > | > | |
> > | > | | I have no clue what mpich_(un)staged refers to, I assume it's
> some setting from ROCKS. Can you please post the particular PE settings you
> want to use during submission.
> > | > | |
> > | > | | -- Reuti
> > | > | |
> > | > | |
> > | > | | > The same Go Huskies! program compiled with same Intel Cluster
> Studio on a different cluster running same Rocks 6.1 and Grid Engine
> 2011.11p1 combination using the same mpich_unstaged PE works successfully.
> > | > | | >
> > | > | | > Best regards,
> > | > | | > g
> > | > | | >
> > | > | | > --
> > | > | | > Gowtham, PhD
> > | > | | > Director of Research Computing, IT
> > | > | | > Adj. Asst. Professor, Physics/ECE
> > | > | | > Michigan Technological University
> > | > | | >
> > | > | | > P: (906) 487-3593
> > | > | | > F: (906) 487-2787
> > | > | | > http://it.mtu.edu
> > | > | | > http://hpc.mtu.edu
> > | > | | >
> > | > | | >
> > | > | | > On Wed, 16 Dec 2015, Reuti wrote:
> > | > | | >
> > | > | | > | Hi,
> > | > | | > |
> > | > | | > | Am 16.12.2015 um 19:53 schrieb Gowtham:
> > | > | | > |
> > | > | | > | >
> > | > | | > | > Dear fellow Grid Engine users,
> > | > | | > | >
> > | > | | > | > Over the past few days, I have had to re-install compute
> nodes (12 cores each) in an existing cluster running Rocks 6.1 and Grid
> Engine 2011.11p1. I ensured the extend-*.xml files had no error in them
> using the xmllint command before rebuilding the distribution. All six
> compute nodes installed successfully, and so did running several test
> "Hello, World!" cases up to 72 cores. I can SSH into any one of these
> nodes, and SSH between any two compute nodes just fine.
> > | > | | > | >
> > | > | | > | > As of this morning all submitted jobs that require more
> than 12 cores (i.e., spanning more than one compute node) fail about a
> minute after starting successfully. However, all jobs with 12 or less cores
> within the a given compute node run just fine. The error message for failed
> job is as follows:
> > | > | | > | >
> > | > | | > | >  error: got no connection within 60 seconds. "Timeout
> occured while waiting for connection"
> > | > | | > | >  Ctrl-C caught... cleaning up processes
> > | > | | > | >
> > | > | | > | > "Hello, World!" and one other program, both compiled with
> Intel Cluster Studio 2013.0.028, display the same behavior. The line
> corresponding to the failed job from
> /opt/gridengine/default/spool/qmaster/messages is as follows:
> > | > | | > | >
> > | > | | > | >  12/16/2015 11:15:36|worker|athena|E|tightly integrated
> parallel task 6129.1 task 1.compute-0-1 failed - killing job
> > | > | | > | >
> > | > | | > | > I'd appreciate any insight or help to resolve this issue.
> If you need additional information from my end, please let me know.
> > | > | | > |
> > | > | | > | What plain version of Intel MPI is Cluster Studio
> 2013.0.028? Less than 4.1? IIRC a tight integration was not supported
> before this one, as there was no call to `qrsh` automatically set up as you
> would need to start certain daemons beforehand.
> > | > | | > |
> > | > | | > | Does your version still need mpdboot?
> > | > | | > |
> > | > | | > | Do you request a proper set up PE in your job submission?
> > | > | | > |
> > | > | | > | -- Reuti
> > | > | | > |
> > | > | | > | >
> > | > | | > | > Thank you for your time and help.
> > | > | | > | >
> > | > | | > | > Best regards,
> > | > | | > | > g
> > | > | | > | >
> > | > | | > | > --
> > | > | | > | > Gowtham, PhD
> > | > | | > | > Director of Research Computing, IT
> > | > | | > | > Adj. Asst. Professor, Physics/ECE
> > | > | | > | > Michigan Technological University
> > | > | | > | >
> > | > | | > | > P: (906) 487-3593
> > | > | | > | > F: (906) 487-2787
> > | > | | > | > http://it.mtu.edu
> > | > | | > | > http://hpc.mtu.edu
> > | > | | > | >
> > | > | | > | > _______________________________________________
> > | > | | > | > users mailing list
> > | > | | > | > [email protected]
> > | > | | > | > https://gridengine.org/mailman/listinfo/users
> > | > | | > |
> > | > | | > |
> > | > | |
> > | > | |
> > | > |
> > | > _______________________________________________
> > | > users mailing list
> > | > [email protected]
> > | > https://gridengine.org/mailman/listinfo/users
> > |
> > |
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to