[gridengine users] GE 2011.11p1: got no connection within 60 seconds

Gowtham Wed, 16 Dec 2015 10:55:28 -0800


Dear fellow Grid Engine users,


Over the past few days, I have had to re-install compute nodes (12 cores each) in an 
existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I ensured the extend-*.xml 
files had no error in them using the xmllint command before rebuilding the distribution. 
All six compute nodes installed successfully, and so did running several test 
"Hello, World!" cases up to 72 cores. I can SSH into any one of these nodes, 
and SSH between any two compute nodes just fine.

As of this morning all submitted jobs that require more than 12 cores (i.e., 
spanning more than one compute node) fail about a minute after starting 
successfully. However, all jobs with 12 or less cores within the a given 
compute node run just fine. The error message for failed job is as follows:

  error: got no connection within 60 seconds. "Timeout occured while waiting for 
connection"
  Ctrl-C caught... cleaning up processes

"Hello, World!" and one other program, both compiled with Intel Cluster Studio 
2013.0.028, display the same behavior. The line corresponding to the failed job from 
/opt/gridengine/default/spool/qmaster/messages is as follows:

  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 6129.1 
task 1.compute-0-1 failed - killing job

I'd appreciate any insight or help to resolve this issue. If you need 
additional information from my end, please let me know.

Thank you for your time and help.

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to