On Thursday 20 March 2008 15:41:40 Hal Rosenstock wrote: > On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote: > > On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote: > > > On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote: > > > > On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote: > > > > > On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote: > > > > > > On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote: > > > > > > > On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > on one of our systems we get a rather huge numbers of > > > > > > > > RcvSwRelayErrors. All I find about RcvSwRelayErrors is > > > > > > > > > > > > > > > > "This counter can increase due to a valid network event" > > > > > > > > > > > > > > > > But what might cause? > > > > > > > > > > > > Ooops. This should have been "But what might cause it?" > > > > > > > > > > > > > Are you running IB multicast (e.g. IPoIB) ? That's the most > > > > > > > common cause. > > > > > > > > > > > > IPoIB is up, but so far only used initially by lustre for initial > > > > > > lnet o2ib setup, but then AFAIK not any more. I think some MPI > > > > > > stacks/applications also do their intial connection using IPoIB. > > > > > > > > > > > > But in general, once these connections are established, IPoIB is > > > > > > not much used anymore. > > > > > > > > > > The causes are: > > > > > 1. DLID mapping > > > > > 2. VL mapping > > > > > 3. looping (out port = in port) > > > > > > > > > > Is your subnet unstable in some way ? Are you using QoS ? > > > > > > > > We have seen some odd problems with opensm (from ofef-1.2.5) in the > > > > past and once only rebooting the switches did help. > > > > > > You might want to update OpenSM to OFED 1.3 version. > > > > I won't manage to build new debian packages today, but I will do over > > Easter. Hope to also find the time to clean the debian rules a bit, to > > have it officially included in Debian. > > > > But will a new opensm help for these errors? > > Perhaps; but not knowing more about the cause it's hard to say. It might > be interesting to see if there are any errors in your OpenSM log.
Well, these opensm logs are a big mystery for me, I have not the slightest idea, what it wants to tell me with this: osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41 Can I find some doku somewhere or does only reading the source help? Here's the logs from the last day: Mar 19 17:20:53 463683 [44007960] -> SUBNET UP Mar 20 10:22:27 864281 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x000000000000001f Mar 20 10:22:27 864533 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50 Mar 20 10:22:28 153211 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35 Mar 20 10:22:28 153231 [42003960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies Mar 20 10:22:28 192987 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches Mar 20 10:22:28 270978 [44007960] -> SUBNET UP Mar 20 10:25:50 333350 [44007960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x002E TID:0x0000000000000020 Mar 20 10:25:50 333579 [44007960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x002E GID:0xfe80000000000000,0x000b8cffff002b50 Mar 20 10:25:50 644817 [42003960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35 Mar 20 10:25:50 644840 [42003960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c902002587c6 LID range [0xF7,0xF7] of node:MT25408 ConnectX Mellanox Technologies Mar 20 10:25:50 679661 [42003960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches Mar 20 10:25:50 755437 [41001960] -> SUBNET UP Mar 20 14:24:04 611501 [42003960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000051 Mar 20 14:24:04 611713 [42003960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41 Mar 20 14:24:04 913422 [44808960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35 Mar 20 14:24:04 913444 [44808960] -> __osm_drop_mgr_remove_port: Removed port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies Mar 20 14:24:04 952959 [44808960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches Mar 20 14:24:05 027280 [41802960] -> SUBNET UP Mar 20 14:26:49 795337 [41802960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0020 TID:0x0000000000000052 Mar 20 14:26:49 795578 [41802960] -> osm_report_notice: Reporting Generic Notice type:1 num:128 from LID:0x0020 GID:0xfe80000000000000,0x000b8cffff002b41 Mar 20 14:26:50 096861 [42804960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x001A GID:0xfe80000000000000,0x0002c9020025ae35 Mar 20 14:26:50 096874 [42804960] -> __osm_state_mgr_report_new_ports: Discovered new port with GUID:0x0002c9020025871d LID range [0x8,0x8] of node:MT25408 ConnectX Mellanox Technologies Mar 20 14:26:50 131620 [42804960] -> osm_ucast_mgr_process: null (min-hop) tables configured on all switches Mar 20 14:26:50 207641 [43806960] -> SUBNET UP Mar 20 14:28:06 751962 [43806960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0008 TID:0x0000000000000000 Thanks again for your help, Bernd -- Bernd Schubert Q-Leap Networks GmbH _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
