Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Kamel Mazouzi Wed, 16 Dec 2015 13:37:00 -0800

Hi,

Within Grid Engine try using mpiexec.hydra instead of mpirun.


check if mpiexec.hydra integrate sge:
 strings mpiexec.hydra | grep sge

Regards,

On Wed, Dec 16, 2015 at 9:32 PM, Gowtham <[email protected]> wrote:

>
> Hi Reuti,
>
> The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and
> I do not need mpdboot. The PE used for this purpose is called
> mpich_unstaged (basically, a copy of the original mpich with '$fill_up'
> rule). The only other PE in this system is called mpich_staged (a copy of
> the original mpich with '$pe_slots' rule).
>
> The same Go Huskies! program compiled with same Intel Cluster Studio on a
> different cluster running same Rocks 6.1 and Grid Engine 2011.11p1
> combination using the same mpich_unstaged PE works successfully.
>
> Best regards,
> g
>
> --
> Gowtham, PhD
> Director of Research Computing, IT
> Adj. Asst. Professor, Physics/ECE
> Michigan Technological University
>
> P: (906) 487-3593
> F: (906) 487-2787
> http://it.mtu.edu
> http://hpc.mtu.edu
>
>
> On Wed, 16 Dec 2015, Reuti wrote:
>
> | Hi,
> |
> | Am 16.12.2015 um 19:53 schrieb Gowtham:
> |
> | >
> | > Dear fellow Grid Engine users,
> | >
> | > Over the past few days, I have had to re-install compute nodes (12
> cores each) in an existing cluster running Rocks 6.1 and Grid Engine
> 2011.11p1. I ensured the extend-*.xml files had no error in them using the
> xmllint command before rebuilding the distribution. All six compute nodes
> installed successfully, and so did running several test "Hello, World!"
> cases up to 72 cores. I can SSH into any one of these nodes, and SSH
> between any two compute nodes just fine.
> | >
> | > As of this morning all submitted jobs that require more than 12 cores
> (i.e., spanning more than one compute node) fail about a minute after
> starting successfully. However, all jobs with 12 or less cores within the a
> given compute node run just fine. The error message for failed job is as
> follows:
> | >
> | >  error: got no connection within 60 seconds. "Timeout occured while
> waiting for connection"
> | >  Ctrl-C caught... cleaning up processes
> | >
> | > "Hello, World!" and one other program, both compiled with Intel
> Cluster Studio 2013.0.028, display the same behavior. The line
> corresponding to the failed job from
> /opt/gridengine/default/spool/qmaster/messages is as follows:
> | >
> | >  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task
> 6129.1 task 1.compute-0-1 failed - killing job
> | >
> | > I'd appreciate any insight or help to resolve this issue. If you need
> additional information from my end, please let me know.
> |
> | What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than
> 4.1? IIRC a tight integration was not supported before this one, as there
> was no call to `qrsh` automatically set up as you would need to start
> certain daemons beforehand.
> |
> | Does your version still need mpdboot?
> |
> | Do you request a proper set up PE in your job submission?
> |
> | -- Reuti
> |
> | >
> | > Thank you for your time and help.
> | >
> | > Best regards,
> | > g
> | >
> | > --
> | > Gowtham, PhD
> | > Director of Research Computing, IT
> | > Adj. Asst. Professor, Physics/ECE
> | > Michigan Technological University
> | >
> | > P: (906) 487-3593
> | > F: (906) 487-2787
> | > http://it.mtu.edu
> | > http://hpc.mtu.edu
> | >
> | > _______________________________________________
> | > users mailing list
> | > [email protected]
> | > https://gridengine.org/mailman/listinfo/users
> |
> |
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to