Hi, mpirun (Intel) is just a wrapper for mpdboot + mpiexec + mpdallexit while mpiexec.hydra is the new intel mpi process spawner which is tightly integrated with Grid Engine since the version 4.3.1
Regards, On Thu, Dec 17, 2015 at 4:19 PM, Reuti <[email protected]> wrote: > Maybe `mpirun` doesn't support/use Hydra. Although not required, the MPI > standard specifies `mpiexec` as a portable startup mechanism. Doesn't Intel > MPI also have an `mpiexec`, which would match the `mpirun` behavior (and > doesn't use Hydra)? > > -- Reuti > > > Am 17.12.2015 um 15:06 schrieb Gowtham <[email protected]>: > > > > > > Yes sir. mpirun and mpiexec.hydra are both from Intel Cluster Studio > suite. To make sure of this, I ran a quick batch job with > > > > which mpirun > > which mpiexec.hydra > > > > and it returned > > > > /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpirun > > /share/apps/intel/2013.0.028/impi/4.1.0.024/intel64/bin/mpiexec.hydra > > > > Best regards, > > g > > > > -- > > Gowtham, PhD > > Director of Research Computing, IT > > Adj. Asst. Professor, Physics/ECE > > Michigan Technological University > > > > P: (906) 487-3593 > > F: (906) 487-2787 > > http://it.mtu.edu > > http://hpc.mtu.edu > > > > > > On Thu, 17 Dec 2015, Reuti wrote: > > > > | > > | > Am 17.12.2015 um 13:41 schrieb Gowtham <[email protected]>: > > | > > > | > > > | > I tried replacing the call to mpirun with mpiexec.hydra and it seems > to work successfully as before. Please find below the contents of > *.sh.o#### file corresponding to the Hello, World! run spanning more than > one compute node: > > | > > | Are both `mpiexec` and `mpirun.hydra` from the same Intel MPI library? > One can't use for example Open MPI's `mpiexec` to start an application of > Intel MPI in parallel (if it runs, it will only several times on serial). > > | > > | -- Reuti > > | > > | > > | > Parallel version of 'Go Huskies!' with 16 processors > > | > ----------------------------------------------------------------- > > | > Rank Hostname Local Date & Time > > | > ----------------------------------------------------------------- > > | > 0 compute-0-4.local Thu Dec 17 07:29:54 2015 > > | > > > | > 1 compute-0-4.local Thu Dec 17 07:29:58 2015 > > | > 2 compute-0-4.local Thu Dec 17 07:29:59 2015 > > | > 3 compute-0-4.local Thu Dec 17 07:30:00 2015 > > | > 4 compute-0-4.local Thu Dec 17 07:30:01 2015 > > | > 5 compute-0-4.local Thu Dec 17 07:30:02 2015 > > | > 6 compute-0-4.local Thu Dec 17 07:30:03 2015 > > | > 7 compute-0-4.local Thu Dec 17 07:30:04 2015 > > | > 8 compute-0-4.local Thu Dec 17 07:30:05 2015 > > | > 9 compute-0-4.local Thu Dec 17 07:30:06 2015 > > | > 10 compute-0-4.local Thu Dec 17 07:30:07 2015 > > | > 11 compute-0-4.local Thu Dec 17 07:30:08 2015 > > | > 12 compute-0-2.local Thu Dec 17 07:30:09 2015 > > | > 13 compute-0-2.local Thu Dec 17 07:30:10 2015 > > | > 14 compute-0-2.local Thu Dec 17 07:30:11 2015 > > | > 15 compute-0-2.local Thu Dec 17 07:30:12 2015 > > | > ----------------------------------------------------------------- > > | > > > | > Any insight into why this made a difference would be greatly > appreciated. > > | > > > | > Best regards, > > | > g > > | > > > | > -- > > | > Gowtham, PhD > > | > Director of Research Computing, IT > > | > Adj. Asst. Professor, Physics/ECE > > | > Michigan Technological University > > | > > > | > P: (906) 487-3593 > > | > F: (906) 487-2787 > > | > http://it.mtu.edu > > | > http://hpc.mtu.edu > > | > > > | > > > | > On Thu, 17 Dec 2015, Gowtham wrote: > > | > > > | > | > > | > | Here you go, Sir. > > | > | > > | > | These two PEs are created by me (not from Rocks) to help our > researchers pick one depending on the nature of their job. If a software > suite required that all processors/cores belong to the same physical > compute node (e.g., MATLAB with Parallel Computing Toolbox), then they > would use mpich_staged. If a software suite could spread the job amongst > processors from multiple compute nodes (e.g., Hello World, VASP, LAMMPS, > etc.), then they would use mpich_unstaged. > > | > | > > | > | I am including their definitions below. > > | > | > > | > | > > | > | pe_name mpich_unstaged > > | > | slots 9999 > > | > | user_lists NONE > > | > | xuser_lists NONE > > | > | start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh > $pe_hostfile > > | > | stop_proc_args /opt/gridengine/mpi/stopmpi.sh > > | > | allocation_rule $fill_up > > | > | control_slaves TRUE > > | > | job_is_first_task FALSE > > | > | urgency_slots min > > | > | accounting_summary TRUE > > | > | > > | > | pe_name mpich_staged > > | > | slots 9999 > > | > | user_lists NONE > > | > | xuser_lists NONE > > | > | start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh > $pe_hostfile > > | > | stop_proc_args /opt/gridengine/mpi/stopmpi.sh > > | > | allocation_rule $pe_slots > > | > | control_slaves TRUE > > | > | job_is_first_task FALSE > > | > | urgency_slots min > > | > | accounting_summary TRUE > > | > | > > | > | > > | > | Best regards, > > | > | g > > | > | > > | > | -- > > | > | Gowtham, PhD > > | > | Director of Research Computing, IT > > | > | Adj. Asst. Professor, Physics/ECE > > | > | Michigan Technological University > > | > | > > | > | P: (906) 487-3593 > > | > | F: (906) 487-2787 > > | > | http://it.mtu.edu > > | > | http://hpc.mtu.edu > > | > | > > | > | > > | > | On Thu, 17 Dec 2015, Reuti wrote: > > | > | > > | > | | > > | > | | > Am 16.12.2015 um 21:32 schrieb Gowtham <[email protected]>: > > | > | | > > > | > | | > > > | > | | > Hi Reuti, > > | > | | > > > | > | | > The MPI associated with Intel Cluster Studio 2013.0.028 is > 4.1.0.024, and I do not need mpdboot. The PE used for this purpose is > called mpich_unstaged (basically, a copy of the original mpich with > '$fill_up' rule). The only other PE in this system is called mpich_staged > (a copy of the original mpich with '$pe_slots' rule). > > | > | | > > | > | | I have no clue what mpich_(un)staged refers to, I assume it's > some setting from ROCKS. Can you please post the particular PE settings you > want to use during submission. > > | > | | > > | > | | -- Reuti > > | > | | > > | > | | > > | > | | > The same Go Huskies! program compiled with same Intel Cluster > Studio on a different cluster running same Rocks 6.1 and Grid Engine > 2011.11p1 combination using the same mpich_unstaged PE works successfully. > > | > | | > > > | > | | > Best regards, > > | > | | > g > > | > | | > > > | > | | > -- > > | > | | > Gowtham, PhD > > | > | | > Director of Research Computing, IT > > | > | | > Adj. Asst. Professor, Physics/ECE > > | > | | > Michigan Technological University > > | > | | > > > | > | | > P: (906) 487-3593 > > | > | | > F: (906) 487-2787 > > | > | | > http://it.mtu.edu > > | > | | > http://hpc.mtu.edu > > | > | | > > > | > | | > > > | > | | > On Wed, 16 Dec 2015, Reuti wrote: > > | > | | > > > | > | | > | Hi, > > | > | | > | > > | > | | > | Am 16.12.2015 um 19:53 schrieb Gowtham: > > | > | | > | > > | > | | > | > > > | > | | > | > Dear fellow Grid Engine users, > > | > | | > | > > > | > | | > | > Over the past few days, I have had to re-install compute > nodes (12 cores each) in an existing cluster running Rocks 6.1 and Grid > Engine 2011.11p1. I ensured the extend-*.xml files had no error in them > using the xmllint command before rebuilding the distribution. All six > compute nodes installed successfully, and so did running several test > "Hello, World!" cases up to 72 cores. I can SSH into any one of these > nodes, and SSH between any two compute nodes just fine. > > | > | | > | > > > | > | | > | > As of this morning all submitted jobs that require more > than 12 cores (i.e., spanning more than one compute node) fail about a > minute after starting successfully. However, all jobs with 12 or less cores > within the a given compute node run just fine. The error message for failed > job is as follows: > > | > | | > | > > > | > | | > | > error: got no connection within 60 seconds. "Timeout > occured while waiting for connection" > > | > | | > | > Ctrl-C caught... cleaning up processes > > | > | | > | > > > | > | | > | > "Hello, World!" and one other program, both compiled with > Intel Cluster Studio 2013.0.028, display the same behavior. The line > corresponding to the failed job from > /opt/gridengine/default/spool/qmaster/messages is as follows: > > | > | | > | > > > | > | | > | > 12/16/2015 11:15:36|worker|athena|E|tightly integrated > parallel task 6129.1 task 1.compute-0-1 failed - killing job > > | > | | > | > > > | > | | > | > I'd appreciate any insight or help to resolve this issue. > If you need additional information from my end, please let me know. > > | > | | > | > > | > | | > | What plain version of Intel MPI is Cluster Studio > 2013.0.028? Less than 4.1? IIRC a tight integration was not supported > before this one, as there was no call to `qrsh` automatically set up as you > would need to start certain daemons beforehand. > > | > | | > | > > | > | | > | Does your version still need mpdboot? > > | > | | > | > > | > | | > | Do you request a proper set up PE in your job submission? > > | > | | > | > > | > | | > | -- Reuti > > | > | | > | > > | > | | > | > > > | > | | > | > Thank you for your time and help. > > | > | | > | > > > | > | | > | > Best regards, > > | > | | > | > g > > | > | | > | > > > | > | | > | > -- > > | > | | > | > Gowtham, PhD > > | > | | > | > Director of Research Computing, IT > > | > | | > | > Adj. Asst. Professor, Physics/ECE > > | > | | > | > Michigan Technological University > > | > | | > | > > > | > | | > | > P: (906) 487-3593 > > | > | | > | > F: (906) 487-2787 > > | > | | > | > http://it.mtu.edu > > | > | | > | > http://hpc.mtu.edu > > | > | | > | > > > | > | | > | > _______________________________________________ > > | > | | > | > users mailing list > > | > | | > | > [email protected] > > | > | | > | > https://gridengine.org/mailman/listinfo/users > > | > | | > | > > | > | | > | > > | > | | > > | > | | > > | > | > > | > _______________________________________________ > > | > users mailing list > > | > [email protected] > > | > https://gridengine.org/mailman/listinfo/users > > | > > | > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
