On Thu, 11 Oct 2012 13:56:46 -0700
Josh England <jjen...@gmail.com> wrote:

> I'm upgrading a cluster to CentOS-6.2 running an OFED-1.5.4.1 IB
> stack.  Every time a node tries to join the fabric, opensmd comes back
> with this:
> 
> Oct 11 12:09:42 777493 [41F7700] 0x01 -> state_mgr_light_sweep_start:
> ERR 3315: Unknown remote side for node 0x0008f10500108bfa (Voltaire
> 4036 # p3r17i1) port 15. Adding to light sweep sampling list
> Oct 11 12:09:42 777532 [41F7700] 0x01 -> Directed Path Dump of 3 hop
> path: Path = 0,1,23,5Oct 11 12:09:43 578014 [37F6700] 0x01 ->
> log_send_error: ERR 5411: DR SMP Send completed with error
> (IB_TIMEOUT) -- dropping
>                         Method 0x1, Attr 0x15, TID 0x14a2
> Oct 11 12:09:43 578050 [37F6700] 0x01 -> Received SMP on a 4 hop path:
> Initial path = 0,1,23,5,15, Return path  = 0,0,0,0,0
> Oct 11 12:09:43 578065 [37F6700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
> 3113: MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod
> 0x0, TID 0x14a2

First off do the errors continue?  Or does OpenSM pick the nodes up on the next 
sweep?

What does iblinkinfo -D 0,1,23,5 return?

Also does smpquery portinfo -D 0,1,23,5,15  1 fail?  (Assuming that HCA is 
connected on port 1)

If so, perhaps try "-t 1000" to the smpquery command to give the node more time 
to see if it is a timeout issue?

Ira

> 
> These nodes work just fine on an older stack (CentOS-5.5,
> OFED-1.5.3.1), and I've been running the same stack that I'm trying to
> upgrade to (CentOS-6.2, OFED-1.5.4.1 with opensm 3.1.3.14) in
> production for months on other clusters.  I've tried multiple versions
> of opensm already (both old and new).  This cluster has slightly
> different hardware (including the HCAs), but why isn't the SM able to
> reach these nodes?
> 
> ibv_devinfo (on the old stack) shows:
> hca_id:    mlx4_0
>     transport:            InfiniBand (0)
>     fw_ver:                2.7.9294
>     node_guid:            78e7:d103:0021:6984
>     sys_image_guid:            78e7:d103:0021:6987
>     vendor_id:            0x02c9
>     vendor_part_id:            26438
>     hw_ver:                0xB0
>     board_id:            HP_0200000003
>     phys_port_cnt:            2
>         port:    1
>             state:            PORT_ACTIVE (4)
>             max_mtu:        2048 (4)
>             active_mtu:        2048 (4)
>             sm_lid:            10
>             port_lid:        306
>             port_lmc:        0x00
>             link_layer:        IB
> 
>         port:    2
>             state:            PORT_DOWN (1)
>             max_mtu:        2048 (4)
>             active_mtu:        256 (1)
>             sm_lid:            0
>             port_lid:        0
>             port_lmc:        0x00
>             link_layer:        Ethernet
> 
> Let me know and I can provide any information necessary to help debug.
> 
> -JE
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to