Hi, Within Grid Engine try using mpiexec.hydra instead of mpirun.
check if mpiexec.hydra integrate sge: strings mpiexec.hydra | grep sge Regards, On Wed, Dec 16, 2015 at 9:32 PM, Gowtham <[email protected]> wrote: > > Hi Reuti, > > The MPI associated with Intel Cluster Studio 2013.0.028 is 4.1.0.024, and > I do not need mpdboot. The PE used for this purpose is called > mpich_unstaged (basically, a copy of the original mpich with '$fill_up' > rule). The only other PE in this system is called mpich_staged (a copy of > the original mpich with '$pe_slots' rule). > > The same Go Huskies! program compiled with same Intel Cluster Studio on a > different cluster running same Rocks 6.1 and Grid Engine 2011.11p1 > combination using the same mpich_unstaged PE works successfully. > > Best regards, > g > > -- > Gowtham, PhD > Director of Research Computing, IT > Adj. Asst. Professor, Physics/ECE > Michigan Technological University > > P: (906) 487-3593 > F: (906) 487-2787 > http://it.mtu.edu > http://hpc.mtu.edu > > > On Wed, 16 Dec 2015, Reuti wrote: > > | Hi, > | > | Am 16.12.2015 um 19:53 schrieb Gowtham: > | > | > > | > Dear fellow Grid Engine users, > | > > | > Over the past few days, I have had to re-install compute nodes (12 > cores each) in an existing cluster running Rocks 6.1 and Grid Engine > 2011.11p1. I ensured the extend-*.xml files had no error in them using the > xmllint command before rebuilding the distribution. All six compute nodes > installed successfully, and so did running several test "Hello, World!" > cases up to 72 cores. I can SSH into any one of these nodes, and SSH > between any two compute nodes just fine. > | > > | > As of this morning all submitted jobs that require more than 12 cores > (i.e., spanning more than one compute node) fail about a minute after > starting successfully. However, all jobs with 12 or less cores within the a > given compute node run just fine. The error message for failed job is as > follows: > | > > | > error: got no connection within 60 seconds. "Timeout occured while > waiting for connection" > | > Ctrl-C caught... cleaning up processes > | > > | > "Hello, World!" and one other program, both compiled with Intel > Cluster Studio 2013.0.028, display the same behavior. The line > corresponding to the failed job from > /opt/gridengine/default/spool/qmaster/messages is as follows: > | > > | > 12/16/2015 11:15:36|worker|athena|E|tightly integrated parallel task > 6129.1 task 1.compute-0-1 failed - killing job > | > > | > I'd appreciate any insight or help to resolve this issue. If you need > additional information from my end, please let me know. > | > | What plain version of Intel MPI is Cluster Studio 2013.0.028? Less than > 4.1? IIRC a tight integration was not supported before this one, as there > was no call to `qrsh` automatically set up as you would need to start > certain daemons beforehand. > | > | Does your version still need mpdboot? > | > | Do you request a proper set up PE in your job submission? > | > | -- Reuti > | > | > > | > Thank you for your time and help. > | > > | > Best regards, > | > g > | > > | > -- > | > Gowtham, PhD > | > Director of Research Computing, IT > | > Adj. Asst. Professor, Physics/ECE > | > Michigan Technological University > | > > | > P: (906) 487-3593 > | > F: (906) 487-2787 > | > http://it.mtu.edu > | > http://hpc.mtu.edu > | > > | > _______________________________________________ > | > users mailing list > | > [email protected] > | > https://gridengine.org/mailman/listinfo/users > | > | > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
