Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Gowtham Wed, 16 Dec 2015 11:46:44 -0800

Thank you, Sir. I made a 'machines' file with round robin list of compute node 
names (repeated 12 times for a total of 72):


compute-0-0
compute-0-1
compute-0-2
compute-0-3
compute-0-4
compute-0-5

I then ran the 'Hello, World!' program (renamed 'Go Huskies!' in honor of my 
University's mascot), and below is the output.


mpirun -n 16 -machine ~/machines /share/apps/go_huskies/bin/go_huskies.x

    Parallel version of 'Go Huskies!' with 16 processors
  -----------------------------------------------------------------  
    Rank  Hostname                       Local Date & Time
  -----------------------------------------------------------------  
    0     compute-0-0.local              Wed Dec 16 14:26:25 2015

    1     compute-0-1.local              Wed Dec 16 14:26:29 2015
    2     compute-0-2.local              Wed Dec 16 14:26:30 2015
    3     compute-0-3.local              Wed Dec 16 14:26:31 2015
    4     compute-0-4.local              Wed Dec 16 14:26:32 2015
    5     compute-0-5.local              Wed Dec 16 14:26:33 2015
    6     compute-0-0.local              Wed Dec 16 14:26:34 2015
    7     compute-0-1.local              Wed Dec 16 14:26:35 2015
    8     compute-0-2.local              Wed Dec 16 14:26:36 2015
    9     compute-0-3.local              Wed Dec 16 14:26:37 2015
    10    compute-0-4.local              Wed Dec 16 14:26:38 2015
    11    compute-0-5.local              Wed Dec 16 14:26:39 2015
    12    compute-0-0.local              Wed Dec 16 14:26:40 2015
    13    compute-0-1.local              Wed Dec 16 14:26:41 2015
    14    compute-0-2.local              Wed Dec 16 14:26:42 2015
    15    compute-0-3.local              Wed Dec 16 14:26:43 2015
  -----------------------------------------------------------------

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

P: (906) 487-3593
F: (906) 487-2787
http://it.mtu.edu
http://hpc.mtu.edu


On Wed, 16 Dec 2015, Chris Dagdigian wrote:

| 
| This looks and feels like an MPI job launching failure
| 
| Especially as it fails exactly when it tries to cross the threshold from
| single chassis to multiple boxes
| 
| The #1 debugging advice in this scenario is this:
| 
|  -- Can you definitively run on more than 12 cores OUTSIDE of grid engine?
| 
| My experience with failures similar to this is that you first need to see if
| the problem is with the app or if the problem is with Grid Engine. Testing to
| see if your "hello world" example works beyond 12 cores WITHOUT grid engine
| will be a valuable datapoint and troubleshooting step. When MPI is involved
| this is doubly true.
| 
| -Chris
| 
| > Gowtham <mailto:[email protected]>
| > December 16, 2015 at 1:53 PM
| > 
| > Dear fellow Grid Engine users,
| > 
| > Over the past few days, I have had to re-install compute nodes (12 cores
| > each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I
| > ensured the extend-*.xml files had no error in them using the xmllint
| > command before rebuilding the distribution. All six compute nodes installed
| > successfully, and so did running several test "Hello, World!" cases up to 72
| > cores. I can SSH into any one of these nodes, and SSH between any two
| > compute nodes just fine.
| > 
| > As of this morning all submitted jobs that require more than 12 cores (i.e.,
| > spanning more than one compute node) fail about a minute after starting
| > successfully. However, all jobs with 12 or less cores within the a given
| > compute node run just fine. The error message for failed job is as follows:
| > 
| >   error: got no connection within 60 seconds. "Timeout occured while waiting
| > for connection"
| >   Ctrl-C caught... cleaning up processes
| > 
| > "Hello, World!" and one other program, both compiled with Intel Cluster
| > Studio 2013.0.028, display the same behavior. The line corresponding to the
| > failed job from /opt/gridengine/default/spool/qmaster/messages is as
| > follows:
| > 
| >   12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task
| > 6129.1 task 1.compute-0-1 failed - killing job
| > 
| > I'd appreciate any insight or help to resolve this issue. If you need
| > additional information from my end, please let me know.
| > 
| > Thank you for your time and help.
| > 
| > Best regards,
| > g
| > 
| > -- 
| > Gowtham, PhD
| > Director of Research Computing, IT
| > Adj. Asst. Professor, Physics/ECE
| > Michigan Technological University
| > 
| > P: (906) 487-3593
| > F: (906) 487-2787
| > http://it.mtu.edu
| > http://hpc.mtu.edu
| > 
| > _______________________________________________
| > users mailing list
| > [email protected]
| > https://gridengine.org/mailman/listinfo/users
| 
| 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] GE 2011.11p1: got no connection within 60 seconds

Reply via email to