On 11/29/12 5:52 PM, Duke Nguyen wrote: > On 11/28/12 1:56 AM, Gus Correa wrote: >> On 11/27/2012 01:52 PM, Gus Correa wrote: >>> On 11/27/2012 02:14 AM, Duke Nguyen wrote: >>>> On 11/27/12 1:44 PM, Christopher Samuel wrote: >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> On 27/11/12 15:51, Duke Nguyen wrote: >>>>> >>>>>> Thanks! Yes, I am trying to get the system work with >>>>>> Torque/Maui/OpenMPI now. >>>>> Make sure you build Open-MPI with support for Torques TM interface, >>>>> that will save you a lot of hassle as it means mpiexec/mpirun will >>>>> find out directly from Torque what nodes and processors have been >>>>> allocated for the job. >>>> Christopher, how would I check that? I got Torque/Maui/OpenMPI up, >>>> working with root (not with normal user yet :( !!!), tried mpirun >>>> and it >>>> worked fine: >>>> >> PS - Do 'qsub myjob' as a regular user, not as root. >> >>>> # /usr/lib64/openmpi/bin/mpirun -pernode --hostfile >>>> /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello >>>> Hello world! I am process number: 3 on host node0118 >>>> Hello world! I am process number: 1 on host node0104 >>>> Hello world! I am process number: 0 on host node0103 >>>> Hello world! I am process number: 2 on host node0117 >>>> >>>> Thanks, >>>> >>>> D. >>> D. >>> >>> Try to omit the hostfile from your mpirun command line, >>> put it inside a Torque/PBS script, and submit it with qsub. >>> Like this: >>> >>> ********************************* >>> myPBSScript.tcsh >>> ********************************* >>> #! /bin/tcsh >>> #PBS -l nodes=2:ppn=8 [Assuming your Torque 'nodes' file has np=8] >>> #PBS -q [email protected] >>> #PBS -N hello >>> @ NP = `cat $PBS_NODEFILE | wc -l` >>> mpirun -np ${NP} ./mpihello >>> ********************************* >>> >>> $ qsub myPBSScript.tcsh >>> >>> >>> If OpenMPI was built with Torque support, >>> the job will run on the nodes/processors allocated by Torque. >>> [The nodes/processors are listed in $PBS_NODEFILE, >>> but you don't need to refer to it in the mpirun line if >>> OpenMPI was built with Torque support. If OpenMPI lacks >>> Torque support, then you can use $PBS_NODEFILE as your hostfile: >>> mpirun -hostfile $PBS_NODEFILE.] >>> >>> If Torque was installed in a standard place, say under /usr, >>> then OpenMPI configure will pick it up automatically. >>> If not in a standard location, then add >>> --with-tm=/torque/directory >>> to the OpenMPI configure line. >>> [./configure --help is your friend!] >>> >>> Another check: >>> >>> $ ompi_info [tons of output that you can grep for "tm" to see >>> if Torque was picked up.] >>> >>> > > OK, after a huge headache of torque/maui things, I finally found out > that my master node's system was a mess :D. Multiple version of torque > (via yum and via src etc...) which cause the confuse for different > users logging in (root or normal users) - well, mainly because I > followed different guides on the net. Then I decided to delete > everything related to pbs (torque, maui, openmpi) and start from > scratch. So I built torque rpms for masters/nodes, installed them, > then built maui rpm, installed with support for torque, then built > openmpi rpm with support for torque too. This time I think I got > almost everything: > > [mpiwulf@biobos:~]$ ompi_info | grep tm > MCA ras: tm (MCA v2.0, API v2.0, Component v1.6.3) > MCA plm: tm (MCA v2.0, API v2.0, Component v1.6.3) > MCA ess: tm (MCA v2.0, API v2.0, Component v1.6.3) > > openmpi now works with infiniband: > > [mpiwulf@biobos:~]$ /usr/local/bin/mpirun -mca btl ^tcp -pernode > --hostfile /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello > Hello world! I am process number: 3 on host node0118 > Hello world! I am process number: 1 on host node0104 > Hello world! I am process number: 2 on host node0117 > Hello world! I am process number: 0 on host node0103 > > openmpi also works with torque: > > ---------------- > [mpiwulf@biobos:~]$ cat test/KCBATCH > #!/bin/bash > # > #PBS -l nodes=6:ppn=1 > #PBS -N kcTEST > #PBS -m be > #PBS -e qsub.er.log > #PBS -o qsub.ou.log > # > { time { > /usr/local/bin/mpirun /home/mpiwulf/test/mpihello > } } &>output.log > > [mpiwulf@biobos:~]$ qsub test/KCBATCH > 21.biobos > > [mpiwulf@biobos:~]$ cat output.log > -------------------------------------------------------------------------- > > The OpenFabrics (openib) BTL failed to initialize while trying to > allocate some locked memory. This typically can indicate that the > memlock limits are set too low. For most HPC installations, the > memlock limits should be set to "unlimited". The failure occured > here: > > Local host: node0103 > OMPI source: btl_openib_component.c:1200 > Function: ompi_free_list_init_ex_new() > Device: mthca0 > Memlock limit: 65536 > > You may need to consult with your system administrator to get this > problem fixed. This FAQ entry on the Open MPI web site may also be > helpful: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > WARNING: There was an error initializing an OpenFabrics device. > > Local host: node0103 > Local device: mthca0 > -------------------------------------------------------------------------- > > Hello world! I am process number: 5 on host node0103 > Hello world! I am process number: 0 on host node0104 > Hello world! I am process number: 2 on host node0110 > Hello world! I am process number: 4 on host node0118 > Hello world! I am process number: 1 on host node0109 > Hello world! I am process number: 3 on host node0117 > [node0104:02221] 5 more processes have sent help message > help-mpi-btl-openib.txt / init-fail-no-mem > [node0104:02221] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > [node0104:02221] 5 more processes have sent help message > help-mpi-btl-openib.txt / error in device init > > real 0m0.291s > user 0m0.034s > sys 0m0.043s > ---------------- > > Unfortunately I still got the problem of "error registering openib > memory" with non-interactive job. Any experience on this would be great.
Got it now, though I *do not* really like the solution. I had to edit the pbs_mom daemon: # vi /etc/rc.d/init.d/pbs_mom and make sure to have: ulimit -l unlimited #ulimit -n 32768 and now openib works fine :). D. _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
