Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Gowtham Wed, 16 Dec 2015 12:35:42 -0800

Hi Reuti,

The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and I do 
not need mpdboot. The PE used for this purpose is called mpich_unstaged 
(basically, a copy of the original mpich with '$fill_up' rule). The only other 
PE in this system is called mpich_staged (a copy of the original mpich with 
'$pe_slots' rule).


The same Go Huskies! program compiled with same Intel Cluster Studio on a 
different cluster running same Rocks 6.1 and Grid Engine 2011.11p1 combination 
using the same mpich_unstaged PE works successfully.

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu


On Wed, 16 Dec 2015, Reuti wrote:

| Hi,
| 
| Am 16.12.2015 um 19:53 schrieb Gowtham:
| 
| > 
| > Dear fellow Grid Engine users,
| > 
| > Over the past few days, I have had to re-install compute nodes (12 cores 
each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I 
ensured the extend-*.xml files had no error in them using the xmllint command 
before rebuilding the distribution. All six compute nodes installed 
successfully, and so did running several test "Hello, World!" cases up to 72 
cores. I can SSH into any one of these nodes, and SSH between any two compute 
nodes just fine.
| > 
| > As of this morning all submitted jobs that require more than 12 cores 
(i.e., spanning more than one compute node) fail about a minute after starting 
successfully. However, all jobs with 12 or less cores within the a given 
compute node run just fine. The error message for failed job is as follows:
| > 
| >  error: got no connection within 60 seconds. "Timeout occured while waiting 
for connection"
| >  Ctrl-C caught... cleaning up processes
| > 
| > "Hello, World!" and one other program, both compiled with Intel Cluster 
Studio 2013.0.028, display the same behavior. The line corresponding to the 
failed job from /opt/gridengine/default/spool/qmaster/messages is as follows:
| > 
| >  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 
6129.1 task 1.compute-0-1 failed - killing job
| > 
| > I'd appreciate any insight or help to resolve this issue. If you need 
additional information from my end, please let me know.
| 
| What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than 4.1? 
IIRC a tight integration was not supported before this one, as there was no 
call to `qrsh` automatically set up as you would need to start certain daemons 
beforehand.
| 
| Does your version still need mpdboot?
| 
| Do you request a proper set up PE in your job submission?
| 
| -- Reuti
| 
| > 
| > Thank you for your time and help.
| > 
| > Best regards,
| > g
| > 
| > --
| > Gowtham, PhD
| > Director of Research Computing, IT
| > Adj. Asst. Professor, Physics/ECE
| > Michigan Technological University
| > 
| > P: (906) 487-3593
| > F: (906) 487-2787
| > http://it.mtu.edu
| > http://hpc.mtu.edu
| > 
| > _______________________________________________
| > users mailing list
| > [email protected]
| > https://gridengine.org/mailman/listinfo/users
| 
| 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to