Re: [OMPI users] How to check OMPI is using IB or not?
- "Sangamesh B" <forum@gmail.com> wrote: > Hi all, > > If an infiniband network is configured successfully, how to confirm > that Open MPI is using infiniband, not other ethernet network > available? > At a low level simplistic way, how about: [root@tango003 ~]# lsof | grep /dev/infiniband namd2 7271 weimin mem CHR231,192 8306 /dev/infiniband/uverbs0 namd2 7271 weimin 13u CHR231,192 8306 /dev/infiniband/uverbs0 ... Here i can see that the namd that I compiled with openmpi is using IB. cheers, / Brett -- Brett Pemberton - VPAC HPC Team Leader http://www.vpac.org/ - (03) 9925 4899
Re: [OMPI users] local config files / recursive includes
Jeff, Didn't even think of that! Yes, when I unload the pointless ib kernel drivers on the non-ib nodes, it quiets the warnings. thanks for that, / Brett - "Jeff Squyres" <jsquy...@cisco.com> wrote: > Starting in the v1.3 series, we put in slightly better checks about > when to issue this warning or not. It *shouldn't* display the > warnings if the OpenFabrics drivers are not loaded. > > Can you verify if your OpenFabrics drivers are loaded on your new, > non- > IB nodes? > > > On May 8, 2009, at 4:40 AM, Brett Pemberton wrote: > > > Hey, > > > > We have a cluster with infiniband, and openmpi working happily. > > We've just added some new nodes, with no ib. The scheduler has been > > > told to only schedule jobs onto those nodes, which don't span > > nodes. Easy. > > > > Except that openmpi warns the user that no openib was found, and > > it's dropping back to another transport (possibly at a penalty). > > This is no problem to me, but it worries our users for no reason. > > > > My plan was to put some local openmpi mca config files on those > > nodes that only allow sm,self,tcp which (I'd hope) would eliminate > > > the warning that it can't use openib. However our openmpi installs > > > are to a global fs. > > > > Is it possible to put a line in the global $SYSCONFDIR/etc/openmpi- > > > mca-params.conf to tell it to also include a subsequent > /etc/openmpi- > > mca-params.conf > > > > Any better ways of handling this would also be appreciated. > > > > cheers, > > > > / Brett > > > > -- > > Brett Pemberton - VPAC Senior Systems Administrator > > http://www.vpac.org/ - (03) 9925 4899 > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899
[OMPI users] local config files / recursive includes
Hey, We have a cluster with infiniband, and openmpi working happily. We've just added some new nodes, with no ib. The scheduler has been told to only schedule jobs onto those nodes, which don't span nodes. Easy. Except that openmpi warns the user that no openib was found, and it's dropping back to another transport (possibly at a penalty). This is no problem to me, but it worries our users for no reason. My plan was to put some local openmpi mca config files on those nodes that only allow sm,self,tcp which (I'd hope) would eliminate the warning that it can't use openib. However our openmpi installs are to a global fs. Is it possible to put a line in the global $SYSCONFDIR/etc/openmpi-mca-params.conf to tell it to also include a subsequent /etc/openmpi-mca-params.conf Any better ways of handling this would also be appreciated. cheers, / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899
Re: [OMPI users] openib RETRY EXCEEDED ERROR
Matt Hughes wrote: 2009/2/26 Brett Pemberton <br...@vpac.org>: [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0 What OS are you using? Centos 5 I've seen this error and many other Infiniband related errors on RedHat enterprise linux 4 update 4, with ConnectX cards and various versions of OFED, up to version 1.3. Depending on the MCA parameters, I also see hangs often enough to make native Infiniband unusable on this OS. I'd appreciate some advice on if I'm using OFED correctly. I'm running OFED 1.4, however not the kernel modules, just userland. Is this a bad idea? Basically, I recompile the ofed src.rpms for: dapl, libibcm, libibcommon, libibmad, libibumad, libibverbs, libmthca, librdmacm, libsdp, mstflint And install onto CentOS, upgrading the in-distro versions. Should I also be compiling ofa_kernel ? Could this be causing problems ? As explained off-list, I'm running the most recent firmware for my cards, although the release is quite old: hca_id: mthca0 fw_ver: 1.2.0 node_guid: 0002:c902:0024:3c6c sys_image_guid: 0002:c902:0024:3c6f vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 34 port_lmc: 0x00 cheers, / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899 signature.asc Description: OpenPGP digital signature
Re: [OMPI users] undefined symbol: tm_init
Ralph Castain wrote: On Feb 9, 2009, at 6:41 PM, Brett Pemberton wrote: Hey, I've just installed OpenMPI 1.3 on our cluster, and am getting this issue on jobs > 1 node. mpiexec: symbol lookup error: /usr/local/openmpi/1.3-pgi/lib/openmpi/mca_plm_tm.so: undefined symbol: tm_init As reported before, I saw someone saying that they solved this with: --enable-mca-static=plm:tm A new install using this configure option does work for me, but only for code recompiled with this new mpicc. Existing code doesn't spawn properly. No, it won't since the static libraries for tm plm component weren't linked directly into the code. Ahh, didn't think of that. As such, I'd much rather get the existing install working again. It was suggested that I need the torque libraries on the compute nodes, which they are. However adding them to ld.so.conf has not solved this, so I'm not sure what more needs to be done to solve this without recompiling openmpi. I'm not sure what you mean by adding them to ld.so.conf. What you need to do is install the torque libraries on the compute node in the same absolute path where they reside on the node where OMPI was built. OMPI points the executable to look for that location. Other than that, there shouldn't be a problem. This is what confuses me. We export /usr/local from the mgt node to all compute nodes. Both torque and openmpi are installed to /usr/local. So why are we hitting this issue? cheers, / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899 signature.asc Description: OpenPGP digital signature
[OMPI users] undefined symbol: tm_init
Hey, I've just installed OpenMPI 1.3 on our cluster, and am getting this issue on jobs > 1 node. mpiexec: symbol lookup error: /usr/local/openmpi/1.3-pgi/lib/openmpi/mca_plm_tm.so: undefined symbol: tm_init As reported before, I saw someone saying that they solved this with: --enable-mca-static=plm:tm A new install using this configure option does work for me, but only for code recompiled with this new mpicc. Existing code doesn't spawn properly. As such, I'd much rather get the existing install working again. It was suggested that I need the torque libraries on the compute nodes, which they are. However adding them to ld.so.conf has not solved this, so I'm not sure what more needs to be done to solve this without recompiling openmpi. Thanks in advance for any help. / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899 signature.asc Description: OpenPGP digital signature