On Thu, 11 Oct 2012 13:56:46 -0700 Josh England <jjen...@gmail.com> wrote:
> I'm upgrading a cluster to CentOS-6.2 running an OFED-1.5.4.1 IB > stack. Every time a node tries to join the fabric, opensmd comes back > with this: > > Oct 11 12:09:42 777493 [41F7700] 0x01 -> state_mgr_light_sweep_start: > ERR 3315: Unknown remote side for node 0x0008f10500108bfa (Voltaire > 4036 # p3r17i1) port 15. Adding to light sweep sampling list > Oct 11 12:09:42 777532 [41F7700] 0x01 -> Directed Path Dump of 3 hop > path: Path = 0,1,23,5Oct 11 12:09:43 578014 [37F6700] 0x01 -> > log_send_error: ERR 5411: DR SMP Send completed with error > (IB_TIMEOUT) -- dropping > Method 0x1, Attr 0x15, TID 0x14a2 > Oct 11 12:09:43 578050 [37F6700] 0x01 -> Received SMP on a 4 hop path: > Initial path = 0,1,23,5,15, Return path = 0,0,0,0,0 > Oct 11 12:09:43 578065 [37F6700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR > 3113: MAD completed in error (IB_TIMEOUT): SubnGet(PortInfo), attr_mod > 0x0, TID 0x14a2 First off do the errors continue? Or does OpenSM pick the nodes up on the next sweep? What does iblinkinfo -D 0,1,23,5 return? Also does smpquery portinfo -D 0,1,23,5,15 1 fail? (Assuming that HCA is connected on port 1) If so, perhaps try "-t 1000" to the smpquery command to give the node more time to see if it is a timeout issue? Ira > > These nodes work just fine on an older stack (CentOS-5.5, > OFED-1.5.3.1), and I've been running the same stack that I'm trying to > upgrade to (CentOS-6.2, OFED-1.5.4.1 with opensm 3.1.3.14) in > production for months on other clusters. I've tried multiple versions > of opensm already (both old and new). This cluster has slightly > different hardware (including the HCAs), but why isn't the SM able to > reach these nodes? > > ibv_devinfo (on the old stack) shows: > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.7.9294 > node_guid: 78e7:d103:0021:6984 > sys_image_guid: 78e7:d103:0021:6987 > vendor_id: 0x02c9 > vendor_part_id: 26438 > hw_ver: 0xB0 > board_id: HP_0200000003 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 10 > port_lid: 306 > port_lmc: 0x00 > link_layer: IB > > port: 2 > state: PORT_DOWN (1) > max_mtu: 2048 (4) > active_mtu: 256 (1) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > Let me know and I can provide any information necessary to help debug. > > -JE > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html