Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reuti Thu, 17 Dec 2015 05:03:57 -0800

> Am 17.12.2015 um 13:41 schrieb Gowtham <[email protected]>:
> 
> 
> I tried replacing the call to mpirun with mpiexec.hydra and it seems to work 
> successfully as before. Please find below the contents of *.sh.o#### file 
> corresponding to the Hello, World! run spanning more than one compute node:


Are both `mpiexec` and `mpirun.hydra` from the same Intel MPI library? One 
can't use for example Open MPI's `mpiexec` to start an application of Intel MPI 
in parallel (if it runs, it will only several times on serial).

-- Reuti


>    Parallel version of 'Go Huskies!' with 16 processors
>  -----------------------------------------------------------------  
>    Rank  Hostname                       Local Date & Time
>  -----------------------------------------------------------------  
>    0     compute-0-4.local              Thu Dec 17 07:29:54 2015
> 
>    1     compute-0-4.local              Thu Dec 17 07:29:58 2015
>    2     compute-0-4.local              Thu Dec 17 07:29:59 2015
>    3     compute-0-4.local              Thu Dec 17 07:30:00 2015
>    4     compute-0-4.local              Thu Dec 17 07:30:01 2015
>    5     compute-0-4.local              Thu Dec 17 07:30:02 2015
>    6     compute-0-4.local              Thu Dec 17 07:30:03 2015
>    7     compute-0-4.local              Thu Dec 17 07:30:04 2015
>    8     compute-0-4.local              Thu Dec 17 07:30:05 2015
>    9     compute-0-4.local              Thu Dec 17 07:30:06 2015
>    10    compute-0-4.local              Thu Dec 17 07:30:07 2015
>    11    compute-0-4.local              Thu Dec 17 07:30:08 2015
>    12    compute-0-2.local              Thu Dec 17 07:30:09 2015
>    13    compute-0-2.local              Thu Dec 17 07:30:10 2015
>    14    compute-0-2.local              Thu Dec 17 07:30:11 2015
>    15    compute-0-2.local              Thu Dec 17 07:30:12 2015
>  -----------------------------------------------------------------
> 
> Any insight into why this made a difference would be greatly appreciated.
> 
> Best regards,
> g
> 
> --
> Gowtham, PhD
> Director of Research Computing, IT
> Adj. Asst. Professor, Physics/ECE
> Michigan Technological University
> 
> P: (906) 487-3593
> F: (906) 487-2787
> http://it.mtu.edu
> http://hpc.mtu.edu
> 
> 
> On Thu, 17 Dec 2015, Gowtham wrote:
> 
> | 
> | Here you go, Sir.
> | 
> | These two PEs are created by me (not from Rocks) to help our researchers 
> pick one depending on the nature of their job. If a software suite required 
> that all processors/cores belong to the same physical compute node (e.g., 
> MATLAB with Parallel Computing Toolbox), then they would use mpich_staged. If 
> a software suite could spread the job amongst processors from multiple 
> compute nodes (e.g., Hello World, VASP, LAMMPS, etc.), then they would use 
> mpich_unstaged.
> | 
> | I am including their definitions below.
> | 
> | 
> | pe_name            mpich_unstaged
> | slots              9999
> | user_lists         NONE
> | xuser_lists        NONE
> | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> | allocation_rule    $fill_up
> | control_slaves     TRUE
> | job_is_first_task  FALSE
> | urgency_slots      min
> | accounting_summary TRUE
> | 
> | pe_name            mpich_staged
> | slots              9999
> | user_lists         NONE
> | xuser_lists        NONE
> | start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
> | stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
> | allocation_rule    $pe_slots
> | control_slaves     TRUE
> | job_is_first_task  FALSE
> | urgency_slots      min
> | accounting_summary TRUE
> | 
> | 
> | Best regards,
> | g
> | 
> | --
> | Gowtham, PhD
> | Director of Research Computing, IT
> | Adj. Asst. Professor, Physics/ECE
> | Michigan Technological University
> | 
> | P: (906) 487-3593
> | F: (906) 487-2787
> | http://it.mtu.edu
> | http://hpc.mtu.edu
> | 
> | 
> | On Thu, 17 Dec 2015, Reuti wrote:
> | 
> | | 
> | | > Am 16.12.2015 um 21:32 schrieb Gowtham <[email protected]>:
> | | > 
> | | > 
> | | > Hi Reuti,
> | | > 
> | | > The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, 
> and I do not need mpdboot. The PE used for this purpose is called 
> mpich_unstaged (basically, a copy of the original mpich with '$fill_up' 
> rule). The only other PE in this system is called mpich_staged (a copy of the 
> original mpich with '$pe_slots' rule).
> | | 
> | | I have no clue what mpich_(un)staged refers to, I assume it's some 
> setting from ROCKS. Can you please post the particular PE settings you want 
> to use during submission.
> | | 
> | | -- Reuti
> | | 
> | | 
> | | > The same Go Huskies! program compiled with same Intel Cluster Studio on 
> a different cluster running same Rocks 6.1 and Grid Engine 2011.11p1 
> combination using the same mpich_unstaged PE works successfully.
> | | > 
> | | > Best regards,
> | | > g
> | | > 
> | | > --
> | | > Gowtham, PhD
> | | > Director of Research Computing, IT
> | | > Adj. Asst. Professor, Physics/ECE
> | | > Michigan Technological University
> | | > 
> | | > P: (906) 487-3593
> | | > F: (906) 487-2787
> | | > http://it.mtu.edu
> | | > http://hpc.mtu.edu
> | | > 
> | | > 
> | | > On Wed, 16 Dec 2015, Reuti wrote:
> | | > 
> | | > | Hi,
> | | > | 
> | | > | Am 16.12.2015 um 19:53 schrieb Gowtham:
> | | > | 
> | | > | > 
> | | > | > Dear fellow Grid Engine users,
> | | > | > 
> | | > | > Over the past few days, I have had to re-install compute nodes (12 
> cores each) in an existing cluster running Rocks 6.1 and Grid Engine 
> 2011.11p1. I ensured the extend-*.xml files had no error in them using the 
> xmllint command before rebuilding the distribution. All six compute nodes 
> installed successfully, and so did running several test "Hello, World!" cases 
> up to 72 cores. I can SSH into any one of these nodes, and SSH between any 
> two compute nodes just fine.
> | | > | > 
> | | > | > As of this morning all submitted jobs that require more than 12 
> cores (i.e., spanning more than one compute node) fail about a minute after 
> starting successfully. However, all jobs with 12 or less cores within the a 
> given compute node run just fine. The error message for failed job is as 
> follows:
> | | > | > 
> | | > | >  error: got no connection within 60 seconds. "Timeout occured while 
> waiting for connection"
> | | > | >  Ctrl-C caught... cleaning up processes
> | | > | > 
> | | > | > "Hello, World!" and one other program, both compiled with Intel 
> Cluster Studio 2013.0.028, display the same behavior. The line corresponding 
> to the failed job from /opt/gridengine/default/spool/qmaster/messages is as 
> follows:
> | | > | > 
> | | > | >  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel 
> task 6129.1 task 1.compute-0-1 failed - killing job
> | | > | > 
> | | > | > I'd appreciate any insight or help to resolve this issue. If you 
> need additional information from my end, please let me know.
> | | > | 
> | | > | What plain version of Intel MPI is Cluster Studio 2013.0.028? Less 
> than 4.1? IIRC a tight integration was not supported before this one, as 
> there was no call to `qrsh` automatically set up as you would need to start 
> certain daemons beforehand.
> | | > | 
> | | > | Does your version still need mpdboot?
> | | > | 
> | | > | Do you request a proper set up PE in your job submission?
> | | > | 
> | | > | -- Reuti
> | | > | 
> | | > | > 
> | | > | > Thank you for your time and help.
> | | > | > 
> | | > | > Best regards,
> | | > | > g
> | | > | > 
> | | > | > --
> | | > | > Gowtham, PhD
> | | > | > Director of Research Computing, IT
> | | > | > Adj. Asst. Professor, Physics/ECE
> | | > | > Michigan Technological University
> | | > | > 
> | | > | > P: (906) 487-3593
> | | > | > F: (906) 487-2787
> | | > | > http://it.mtu.edu
> | | > | > http://hpc.mtu.edu
> | | > | > 
> | | > | > _______________________________________________
> | | > | > users mailing list
> | | > | > [email protected]
> | | > | > https://gridengine.org/mailman/listinfo/users
> | | > | 
> | | > | 
> | | 
> | | 
> | 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to