Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reuti Thu, 17 Dec 2015 04:10:38 -0800

> Am 16.12.2015 um 21:32 schrieb Gowtham <[email protected]>:
> 
> 
> Hi Reuti,
> 
> The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and I 
> do not need mpdboot. The PE used for this purpose is called mpich_unstaged 
> (basically, a copy of the original mpich with '$fill_up' rule). The only 
> other PE in this system is called mpich_staged (a copy of the original mpich 
> with '$pe_slots' rule).


I have no clue what mpich_(un)staged refers to, I assume it's some setting from 
ROCKS. Can you please post the particular PE settings you want to use during 
submission.

-- Reuti


> The same Go Huskies! program compiled with same Intel Cluster Studio on a 
> different cluster running same Rocks 6.1 and Grid Engine 2011.11p1 
> combination using the same mpich_unstaged PE works successfully.
> 
> Best regards,
> g
> 
> --
> Gowtham, PhD
> Director of Research Computing, IT
> Adj. Asst. Professor, Physics/ECE
> Michigan Technological University
> 
> P: (906) 487-3593
> F: (906) 487-2787
> http://it.mtu.edu
> http://hpc.mtu.edu
> 
> 
> On Wed, 16 Dec 2015, Reuti wrote:
> 
> | Hi,
> | 
> | Am 16.12.2015 um 19:53 schrieb Gowtham:
> | 
> | > 
> | > Dear fellow Grid Engine users,
> | > 
> | > Over the past few days, I have had to re-install compute nodes (12 cores 
> each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I 
> ensured the extend-*.xml files had no error in them using the xmllint command 
> before rebuilding the distribution. All six compute nodes installed 
> successfully, and so did running several test "Hello, World!" cases up to 72 
> cores. I can SSH into any one of these nodes, and SSH between any two compute 
> nodes just fine.
> | > 
> | > As of this morning all submitted jobs that require more than 12 cores 
> (i.e., spanning more than one compute node) fail about a minute after 
> starting successfully. However, all jobs with 12 or less cores within the a 
> given compute node run just fine. The error message for failed job is as 
> follows:
> | > 
> | >  error: got no connection within 60 seconds. "Timeout occured while 
> waiting for connection"
> | >  Ctrl-C caught... cleaning up processes
> | > 
> | > "Hello, World!" and one other program, both compiled with Intel Cluster 
> Studio 2013.0.028, display the same behavior. The line corresponding to the 
> failed job from /opt/gridengine/default/spool/qmaster/messages is as follows:
> | > 
> | >  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 
> 6129.1 task 1.compute-0-1 failed - killing job
> | > 
> | > I'd appreciate any insight or help to resolve this issue. If you need 
> additional information from my end, please let me know.
> | 
> | What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than 
> 4.1? IIRC a tight integration was not supported before this one, as there was 
> no call to `qrsh` automatically set up as you would need to start certain 
> daemons beforehand.
> | 
> | Does your version still need mpdboot?
> | 
> | Do you request a proper set up PE in your job submission?
> | 
> | -- Reuti
> | 
> | > 
> | > Thank you for your time and help.
> | > 
> | > Best regards,
> | > g
> | > 
> | > --
> | > Gowtham, PhD
> | > Director of Research Computing, IT
> | > Adj. Asst. Professor, Physics/ECE
> | > Michigan Technological University
> | > 
> | > P: (906) 487-3593
> | > F: (906) 487-2787
> | > http://it.mtu.edu
> | > http://hpc.mtu.edu
> | > 
> | > _______________________________________________
> | > users mailing list
> | > [email protected]
> | > https://gridengine.org/mailman/listinfo/users
> | 
> | 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to