I'm upgrading a cluster to CentOS-6.2 running an OFED-1.5.4.1 IB
stack.  Every time a node tries to join the fabric, opensmd comes back
with this:

Oct 11 12:09:42 777493 [41F7700] 0x01 -> state_mgr_light_sweep_start:
ERR 3315: Unknown remote side for node 0x0008f10500108bfa (Voltaire
4036 # p3r17i1) port 15. Adding to light sweep sampling list
Oct 11 12:09:42 777532 [41F7700] 0x01 -> Directed Path Dump of 3 hop
path: Path = 0,1,23,5Oct 11 12:09:43 578014 [37F6700] 0x01 ->
log_send_error: ERR 5411: DR SMP Send completed with error
(IB_TIMEOUT) -- dropping
                        Method 0x1, Attr 0x15, TID 0x14a2
Oct 11 12:09:43 578050 [37F6700] 0x01 -> Received SMP on a 4 hop path:
Initial path = 0,1,23,5,15, Return path  = 0,0,0,0,0
Oct 11 12:09:43 578065 [37F6700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod
0x0, TID 0x14a2

These nodes work just fine on an older stack (CentOS-5.5,
OFED-1.5.3.1), and I've been running the same stack that I'm trying to
upgrade to (CentOS-6.2, OFED-1.5.4.1 with opensm 3.1.3.14) in
production for months on other clusters.  I've tried multiple versions
of opensm already (both old and new).  This cluster has slightly
different hardware (including the HCAs), but why isn't the SM able to
reach these nodes?

ibv_devinfo (on the old stack) shows:
hca_id:    mlx4_0
    transport:            InfiniBand (0)
    fw_ver:                2.7.9294
    node_guid:            78e7:d103:0021:6984
    sys_image_guid:            78e7:d103:0021:6987
    vendor_id:            0x02c9
    vendor_part_id:            26438
    hw_ver:                0xB0
    board_id:            HP_0200000003
    phys_port_cnt:            2
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        2048 (4)
            active_mtu:        2048 (4)
            sm_lid:            10
            port_lid:        306
            port_lmc:        0x00
            link_layer:        IB

        port:    2
            state:            PORT_DOWN (1)
            max_mtu:        2048 (4)
            active_mtu:        256 (1)
            sm_lid:            0
            port_lid:        0
            port_lmc:        0x00
            link_layer:        Ethernet

Let me know and I can provide any information necessary to help debug.

-JE
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to