Hi,

Am 16.12.2015 um 19:53 schrieb Gowtham:

> 
> Dear fellow Grid Engine users,
> 
> Over the past few days, I have had to re-install compute nodes (12 cores 
> each) in an existing cluster running Rocks 6.1 and Grid Engine 2011.11p1. I 
> ensured the extend-*.xml files had no error in them using the xmllint command 
> before rebuilding the distribution. All six compute nodes installed 
> successfully, and so did running several test "Hello, World!" cases up to 72 
> cores. I can SSH into any one of these nodes, and SSH between any two compute 
> nodes just fine.
> 
> As of this morning all submitted jobs that require more than 12 cores (i.e., 
> spanning more than one compute node) fail about a minute after starting 
> successfully. However, all jobs with 12 or less cores within the a given 
> compute node run just fine. The error message for failed job is as follows:
> 
>  error: got no connection within 60 seconds. "Timeout occured while waiting 
> for connection"
>  Ctrl-C caught... cleaning up processes
> 
> "Hello, World!" and one other program, both compiled with Intel Cluster 
> Studio 2013.0.028, display the same behavior. The line corresponding to the 
> failed job from /opt/gridengine/default/spool/qmaster/messages is as follows:
> 
>  12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task 6129.1 
> task 1.compute-0-1 failed - killing job
> 
> I'd appreciate any insight or help to resolve this issue. If you need 
> additional information from my end, please let me know.

What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than 4.1? 
IIRC a tight integration was not supported before this one, as there was no 
call to `qrsh` automatically set up as you would need to start certain daemons 
beforehand.

Does your version still need mpdboot?

Do you request a proper set up PE in your job submission?

-- Reuti

> 
> Thank you for your time and help.
> 
> Best regards,
> g
> 
> --
> Gowtham, PhD
> Director of Research Computing, IT
> Adj. Asst. Professor, Physics/ECE
> Michigan Technological University
> 
> P: (906) 487-3593
> F: (906) 487-2787
> http://it.mtu.edu
> http://hpc.mtu.edu
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to