Greetings, We have recently expanded our Infiniband tree and are running into problems when all hosts are booted. Details are below. Please let me know if there is a more appropriate forum for this issue. Thanks!
With less than 600 hosts, everything seems to be working fine. With more than 650 or so, we start seeing the following symptoms: # ibdiagnet -o . -lw 4x -pc -I- Discovering ... 721 nodes (68 Switches & 653 CA-s) discovered. ... -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -E- Could not get PM info: "pmGetPortCounters 0x0139 1" failed 4 consecutive times. -E- Could not get PM info: "pmGetPortCounters 0x0139 4" failed 4 consecutive times. There are 29 of those "Could not get PM info" errors. Basic IB communication still works at this point, but after restarting the subnet manager, ping via IPoIB stops working between some of the switches, and a LOT of messages like the following show up in osm.log: Jul 16 22:32:13 795167 [41E02940] 0x01 -> __osm_pr_rcv_get_path_parms: ERR 1F07: Dead end on path to LID 0x9 from switch for GUID 0x000002c900000023 Jul 16 22:36:04 895497 [45007940] 0x01 -> __osm_pr_rcv_get_path_parms: ERR 1F07: Dead end on path to LID 0x5D7 from switch for GUID 0x000002c900000052 I have tried modifying "opensm.conf" to include: LMC=0 (was 2) TIMEOUT=500 (was 200) but that did not seem to help. Subnet manager host is running CentOS-5.1, kernel 2.6.18-53.1.21.el5, OFED-1.3.1, OpenSM 3.1.11 Hosts are running either RHEL-4.4, kernel 2.6.20.20, OFED-1.2.5.1 CentOS-5.1, kernel 2.6.22.19, OFED-1.3.1 Storage vendor OS based on CentOS, kernel 2.6.9-42.0.10.ELsmp, OFED-1.2.5.1 Can anyone suggest a fix or other diagnostics we can run to help narrow down the problem? -Nathan _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general