Re: [OMPI users] How to check OMPI is using IB or not?

2010-01-27 Thread Brett Pemberton

- "Sangamesh B" <forum@gmail.com> wrote:

> Hi all,
> 
> If an infiniband network is configured successfully, how to confirm
> that Open MPI is using infiniband, not other ethernet network
> available?
> 

At a low level simplistic way, how about:

[root@tango003 ~]# lsof | grep /dev/infiniband
namd2  7271  weimin  mem   CHR231,192   
   8306 /dev/infiniband/uverbs0
namd2  7271  weimin   13u  CHR231,192   
   8306 /dev/infiniband/uverbs0
...

Here i can see that the namd that I compiled with openmpi is using IB.

cheers,

 / Brett

-- 
Brett Pemberton - VPAC HPC Team Leader
http://www.vpac.org/ - (03) 9925 4899


Re: [OMPI users] local config files / recursive includes

2009-05-09 Thread Brett Pemberton

Jeff,

Didn't even think of that!

Yes, when I unload the pointless ib kernel drivers on the non-ib nodes, it 
quiets the warnings.

thanks for that,

 / Brett


- "Jeff Squyres" <jsquy...@cisco.com> wrote:

> Starting in the v1.3 series, we put in slightly better checks about  
> when to issue this warning or not.  It *shouldn't* display the  
> warnings if the OpenFabrics drivers are not loaded.
> 
> Can you verify if your OpenFabrics drivers are loaded on your new,
> non- 
> IB nodes?
> 
> 
> On May 8, 2009, at 4:40 AM, Brett Pemberton wrote:
> 
> > Hey,
> >
> > We have a cluster with infiniband, and openmpi working happily.
> > We've just added some new nodes, with no ib.  The scheduler has been
>  
> > told to only schedule jobs onto those nodes, which don't span  
> > nodes.  Easy.
> >
> > Except that openmpi warns the user that no openib was found, and  
> > it's dropping back to another transport (possibly at a penalty).
> > This is no problem to me, but it worries our users for no reason.
> >
> > My plan was to put some local openmpi mca config files on those  
> > nodes that only allow sm,self,tcp which (I'd hope) would eliminate 
> 
> > the warning that it can't use openib.  However our openmpi installs 
> 
> > are to a global fs.
> >
> > Is it possible to put a line in the global $SYSCONFDIR/etc/openmpi-
> 
> > mca-params.conf to tell it to also include a subsequent
> /etc/openmpi- 
> > mca-params.conf
> >
> > Any better ways of handling this would also be appreciated.
> >
> > cheers,
> >
> >  / Brett
> >
> > --
> > Brett Pemberton - VPAC Senior Systems Administrator
> > http://www.vpac.org/ - (03) 9925 4899
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899


[OMPI users] local config files / recursive includes

2009-05-08 Thread Brett Pemberton
Hey,

We have a cluster with infiniband, and openmpi working happily.
We've just added some new nodes, with no ib.  The scheduler has been told to 
only schedule jobs onto those nodes, which don't span nodes.  Easy.

Except that openmpi warns the user that no openib was found, and it's dropping 
back to another transport (possibly at a penalty).
This is no problem to me, but it worries our users for no reason.

My plan was to put some local openmpi mca config files on those nodes that only 
allow sm,self,tcp which (I'd hope) would eliminate the warning that it can't 
use openib.  However our openmpi installs are to a global fs.

Is it possible to put a line in the global 
$SYSCONFDIR/etc/openmpi-mca-params.conf to tell it to also include a subsequent 
/etc/openmpi-mca-params.conf

Any better ways of handling this would also be appreciated.

cheers,

 / Brett

-- 
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899


Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-03-01 Thread Brett Pemberton

Matt Hughes wrote:

2009/2/26 Brett Pemberton <br...@vpac.org>:

[[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 38996224 opcode 0 qp_idx 0


What OS are you using?


Centos 5

  I've seen this error and many other Infiniband

related errors on RedHat enterprise linux 4 update 4, with ConnectX
cards and various versions of OFED, up to version 1.3.  Depending on
the MCA parameters, I also see hangs often enough to make native
Infiniband unusable on this OS.



I'd appreciate some advice on if I'm using OFED correctly.

I'm running OFED 1.4, however not the kernel modules, just userland.
Is this a bad idea?

Basically, I recompile the ofed src.rpms for:

dapl, libibcm, libibcommon, libibmad, libibumad, libibverbs, libmthca, 
librdmacm, libsdp, mstflint


And install onto CentOS, upgrading the in-distro versions.
Should I also be compiling ofa_kernel ?
Could this be causing problems ?

As explained off-list, I'm running the most recent firmware for my 
cards, although the release is quite old:


hca_id: mthca0
fw_ver: 1.2.0
node_guid:  0002:c902:0024:3c6c
sys_image_guid: 0002:c902:0024:3c6f
vendor_id:  0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id:   MT_03B0140001
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid:   34
port_lmc:   0x00

cheers,

/ Brett

--
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] undefined symbol: tm_init

2009-02-11 Thread Brett Pemberton

Ralph Castain wrote:


On Feb 9, 2009, at 6:41 PM, Brett Pemberton wrote:


Hey,

I've just installed OpenMPI 1.3 on our cluster, and am getting this 
issue on jobs > 1 node.


mpiexec: symbol lookup error: 
/usr/local/openmpi/1.3-pgi/lib/openmpi/mca_plm_tm.so: undefined 
symbol: tm_init


As reported before, I saw someone saying that they solved this with: 
--enable-mca-static=plm:tm


A new install using this configure option does work for me, but only 
for code recompiled with this new mpicc.  Existing code doesn't spawn 
properly.


No, it won't since the static libraries for tm plm component weren't 
linked directly into the code.


Ahh, didn't think of that.






As such, I'd much rather get the existing install working again.

It was suggested that I need the torque libraries on the compute 
nodes, which they are.  However adding them to ld.so.conf has not 
solved this, so I'm not sure what more needs to be done to solve this 
without recompiling openmpi.


I'm not sure what you mean by adding them to ld.so.conf. What you need 
to do is install the torque libraries on the compute node in the same 
absolute path where they reside on the node where OMPI was built. OMPI 
points the executable to look for that location.


Other than that, there shouldn't be a problem.



This is what confuses me.
We export /usr/local from the mgt node to all compute nodes.

Both torque and openmpi are installed to /usr/local.

So why are we hitting this issue?

cheers,

/ Brett

--
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899



signature.asc
Description: OpenPGP digital signature


[OMPI users] undefined symbol: tm_init

2009-02-09 Thread Brett Pemberton

Hey,

I've just installed OpenMPI 1.3 on our cluster, and am getting this 
issue on jobs > 1 node.


mpiexec: symbol lookup error: 
/usr/local/openmpi/1.3-pgi/lib/openmpi/mca_plm_tm.so: undefined symbol: 
tm_init


As reported before, I saw someone saying that they solved this with: 
--enable-mca-static=plm:tm


A new install using this configure option does work for me, but only for 
code recompiled with this new mpicc.  Existing code doesn't spawn properly.


As such, I'd much rather get the existing install working again.

It was suggested that I need the torque libraries on the compute nodes, 
which they are.  However adding them to ld.so.conf has not solved this, 
so I'm not sure what more needs to be done to solve this without 
recompiling openmpi.


Thanks in advance for any help.

/ Brett

--
Brett Pemberton - VPAC Senior Systems Administrator
http://www.vpac.org/ - (03) 9925 4899



signature.asc
Description: OpenPGP digital signature