Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Chris Dagdigian Wed, 16 Dec 2015 11:04:57 -0800


This looks and feels like an MPI job launching failure

Especially as it fails exactly when it tries to cross the threshold fromsingle chassis to multiple boxes


The #1 debugging advice in this scenario is this:

 -- Can you definitively run on more than 12 cores OUTSIDE of grid engine?

My experience with failures similar to this is that you first need tosee if the problem is with the app or if the problem is with GridEngine. Testing to see if your "hello world" example works beyond 12cores WITHOUT grid engine will be a valuable datapoint andtroubleshooting step. When MPI is involved this is doubly true.


-Chris

Gowtham <mailto:[email protected]>
December 16, 2015 at 1:53 PM

Dear fellow Grid Engine users,
Over the past few days, I have had to re-install compute nodes (12cores each) in an existing cluster running Rocks 6.1 and Grid Engine2011.11p1. I ensured the extend-*.xml files had no error in them usingthe xmllint command before rebuilding the distribution. All sixcompute nodes installed successfully, and so did running several test"Hello, World!" cases up to 72 cores. I can SSH into any one of thesenodes, and SSH between any two compute nodes just fine.
As of this morning all submitted jobs that require more than 12 cores(i.e., spanning more than one compute node) fail about a minute afterstarting successfully. However, all jobs with 12 or less cores withinthe a given compute node run just fine. The error message for failedjob is as follows:
error: got no connection within 60 seconds. "Timeout occured whilewaiting for connection"
  Ctrl-C caught... cleaning up processes
"Hello, World!" and one other program, both compiled with IntelCluster Studio 2013.0.028, display the same behavior. The linecorresponding to the failed job from/opt/gridengine/default/spool/qmaster/messages is as follows:
12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task6129.1 task 1.compute-0-1 failed - killing job
I'd appreciate any insight or help to resolve this issue. If you needadditional information from my end, please let me know.
Thank you for your time and help.

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to