Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
> On May 11, 2016, at 12:32 PM, Stephan Budachwrote: > I will try to get one node free of all services running on it, as I will have > to reboot the system, since I will have to change the ixgbe.conf, haven't I? > This is a RSF-1 host, so this will likely be done over the weekend. You can use dladm on a live system: dladm set-linkprop -p flowctrl=no ixgbeN Where ixgbeN is your ixgbe interfaces (probably ixgbe0 and ixgbe1) /dale signature.asc Description: Message signed with OpenPGP using GPGMail ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
Am 11.05.16 um 16:48 schrieb Dale Ghent: On May 11, 2016, at 7:36 AM, Stephan Budachwrote: Am 09.05.16 um 20:43 schrieb Dale Ghent: On May 9, 2016, at 2:04 PM, Stephan Budach wrote: Am 09.05.16 um 16:33 schrieb Dale Ghent: On May 9, 2016, at 8:24 AM, Stephan Budach wrote: Hi, I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first starts with a couple if link downs/ups on one port and finally the link on that port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel on my Cisco Nexus for this connection. I have tried swapping and interchangeing cables and thus switchports, but to no avail. Anyone else noticed this and even better… knows a solution to this? Was this an issue noticed only with r151018 and not with previous versions, or have you only tried this with 018? By your description, I presume that the two ixgbe physical links will stay at 10Gb and not bounce down to 1Gb if not LACP'd together? /dale I have noticed that on prior versions of OmniOS as well, but we only recently started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our network. I will have to check if both links stay at 10GbE, when not being configured as a LACP bond. Let me check that tomorrow and report back. As we're heading for a streched DC, we are mainly configuring 2-way LACP bonds over our Nexus gear, so we don't actually have any single 10GbE connection, as they will all have to be conencted to both DCs. This is achieved by using VPCs on our Nexus switches. Provide as much detail as you can - if you're using hw flow control, whether both links act this way at the same time or independently, and so-on. Problems like this often boil down to a very small and seemingly insignificant detail. I currently have ixgbe on the operating table for adding X550 support, so I can take a look at this; however I don't have your type of switches available to me so LACP-specific testing is something I can't do for you. /dale I checked the ixgbe.conf files on each host and they all are still at the standard setting, which includes flow_control = 3; As, so you are using ethernet flow control. Could you try disabling that on both sides (on the ixgbe host and on the switch) and see if that corrects the link stability issues? There's an outstanding issue with hw flow control on ixgbe that you *might* be running into regarding pause frame timing, which could manifest in the way you describe. /dale I will try to get one node free of all services running on it, as I will have to reboot the system, since I will have to change the ixgbe.conf, haven't I? This is a RSF-1 host, so this will likely be done over the weekend. Thanks, Stephan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
> On May 11, 2016, at 7:36 AM, Stephan Budachwrote: > > Am 09.05.16 um 20:43 schrieb Dale Ghent: >>> On May 9, 2016, at 2:04 PM, Stephan Budach wrote: >>> >>> Am 09.05.16 um 16:33 schrieb Dale Ghent: > On May 9, 2016, at 8:24 AM, Stephan Budach wrote: > > Hi, > > I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break > the LACP aggr-link on different boxes, when Intel X540-T2s are involved. > It first starts with a couple if link downs/ups on one port and finally > the link on that port negiotates to 1GbE instead of 10GbE, which then > breaks the LACP channel on my Cisco Nexus for this connection. > > I have tried swapping and interchangeing cables and thus switchports, but > to no avail. > > Anyone else noticed this and even better… knows a solution to this? Was this an issue noticed only with r151018 and not with previous versions, or have you only tried this with 018? By your description, I presume that the two ixgbe physical links will stay at 10Gb and not bounce down to 1Gb if not LACP'd together? /dale >>> I have noticed that on prior versions of OmniOS as well, but we only >>> recently started deploying 10GbE LACP bonds, when we introduced our Nexus >>> gear to our network. I will have to check if both links stay at 10GbE, when >>> not being configured as a LACP bond. Let me check that tomorrow and report >>> back. As we're heading for a streched DC, we are mainly configuring 2-way >>> LACP bonds over our Nexus gear, so we don't actually have any single 10GbE >>> connection, as they will all have to be conencted to both DCs. This is >>> achieved by using VPCs on our Nexus switches. >> Provide as much detail as you can - if you're using hw flow control, whether >> both links act this way at the same time or independently, and so-on. >> Problems like this often boil down to a very small and seemingly >> insignificant detail. >> >> I currently have ixgbe on the operating table for adding X550 support, so I >> can take a look at this; however I don't have your type of switches >> available to me so LACP-specific testing is something I can't do for you. >> >> /dale > I checked the ixgbe.conf files on each host and they all are still at the > standard setting, which includes flow_control = 3; As, so you are using ethernet flow control. Could you try disabling that on both sides (on the ixgbe host and on the switch) and see if that corrects the link stability issues? There's an outstanding issue with hw flow control on ixgbe that you *might* be running into regarding pause frame timing, which could manifest in the way you describe. /dale signature.asc Description: Message signed with OpenPGP using GPGMail ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
Am 11.05.16 um 14:50 schrieb Stephan Budach: Am 11.05.16 um 13:36 schrieb Stephan Budach: Am 09.05.16 um 20:43 schrieb Dale Ghent: On May 9, 2016, at 2:04 PM, Stephan Budachwrote: Am 09.05.16 um 16:33 schrieb Dale Ghent: On May 9, 2016, at 8:24 AM, Stephan Budach wrote: Hi, I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first starts with a couple if link downs/ups on one port and finally the link on that port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel on my Cisco Nexus for this connection. I have tried swapping and interchangeing cables and thus switchports, but to no avail. Anyone else noticed this and even better… knows a solution to this? Was this an issue noticed only with r151018 and not with previous versions, or have you only tried this with 018? By your description, I presume that the two ixgbe physical links will stay at 10Gb and not bounce down to 1Gb if not LACP'd together? /dale I have noticed that on prior versions of OmniOS as well, but we only recently started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our network. I will have to check if both links stay at 10GbE, when not being configured as a LACP bond. Let me check that tomorrow and report back. As we're heading for a streched DC, we are mainly configuring 2-way LACP bonds over our Nexus gear, so we don't actually have any single 10GbE connection, as they will all have to be conencted to both DCs. This is achieved by using VPCs on our Nexus switches. Provide as much detail as you can - if you're using hw flow control, whether both links act this way at the same time or independently, and so-on. Problems like this often boil down to a very small and seemingly insignificant detail. I currently have ixgbe on the operating table for adding X550 support, so I can take a look at this; however I don't have your type of switches available to me so LACP-specific testing is something I can't do for you. /dale I checked the ixgbe.conf files on each host and they all are still at the standard setting, which includes flow_control = 3; So they all have flow control enabled. As for the Nexus config, all of those ports are still on standard ethernet ports and modifications have only been made globally to the switch. I will now have to yank the one port on one of the hosts from the aggr and configure it as a standalone port. Then we will see, if it still receives the disconnects/reconnects and finally the negotiation to 1GbE instead of 10GbE. As this only seems to happen to the same port I never experienced other ports of the affected aggrs acting up. I also thought to notice, that those were always the "same" physical ports, that is the first port on the card (ixgbe0), but that might of course be a coincidence. Thanks, Stephan Ok, so we can likely rule out LACP as a generic reason for this issue… After removing ixgbe0 from the aggr1, I plugged it into an unused port of my Nexus FEX and low and behold, here we go: root@tr1206902:/root# tail -f /var/adm/messages May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1000 Mbps, full duplex May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link down May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1 Mbps, full duplex May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link down May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1 Mbps, full duplex So, after less than an hour, we had the first link-cycle on ixgbe0, alas on another port, which has no LACP config whatsoever. I will monitor this for a while and see, if we will get more of those. Thanks, Stephan Ehh… and sorry, I almost forgot to paste the log from the Cisco Nexus switch: 2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-SPEED: Interface Ethernet141/1/9, operational speed changed to 10 Gbps 2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface Ethernet141/1/9, operational duplex mode changed to Full 2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet141/1/9, operational Receive Flow Control state changed to off 2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet141/1/9, operational Transmit Flow Control state changed to on 2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_UP: Interface Ethernet141/1/9 is up in mode access 2016 May 11 14:07:29 gh79-nx-01 %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet141/1/9 is down (Link failure) 2016 May 11 14:07:45 gh79-nx-01 last message repeated 1 time 2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-SPEED: Interface Ethernet141/1/9, operational speed changed to 10 Gbps 2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface
Re: [OmniOS-discuss] sudden reboot
You had a kernel panic. Can you share that vmdump.0 file? Dan Sent from my iPhone (typos, autocorrect, and all) > On May 11, 2016, at 4:24 AM, Martijn Fenniswrote: > > Hi, > > I’m experiencing an unexpected reboot. > > System is supermicro with ECC mem, qlogic FC and LSI SAS. > > Temperatures and voltages look OK. > > The message i find is about the express bus… but how to find the cause? > Should i set something like IRQ-steering or so in the BIOS? > > May 11 10:01:09 ZFS01 savecore: [ID 570001 auth.error] reboot after panic: > pcieb-0: PCI(-X) Express Fatal Error. (0x101) > May 11 10:01:09 ZFS01 savecore: [ID 365739 auth.error] Saving compressed > system crash dump in /var/crash/unknown/vmdump.0 > > Thanks, > > Martijn > ___ > OmniOS-discuss mailing list > OmniOS-discuss@lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
Am 11.05.16 um 13:36 schrieb Stephan Budach: Am 09.05.16 um 20:43 schrieb Dale Ghent: On May 9, 2016, at 2:04 PM, Stephan Budachwrote: Am 09.05.16 um 16:33 schrieb Dale Ghent: On May 9, 2016, at 8:24 AM, Stephan Budach wrote: Hi, I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first starts with a couple if link downs/ups on one port and finally the link on that port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel on my Cisco Nexus for this connection. I have tried swapping and interchangeing cables and thus switchports, but to no avail. Anyone else noticed this and even better… knows a solution to this? Was this an issue noticed only with r151018 and not with previous versions, or have you only tried this with 018? By your description, I presume that the two ixgbe physical links will stay at 10Gb and not bounce down to 1Gb if not LACP'd together? /dale I have noticed that on prior versions of OmniOS as well, but we only recently started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our network. I will have to check if both links stay at 10GbE, when not being configured as a LACP bond. Let me check that tomorrow and report back. As we're heading for a streched DC, we are mainly configuring 2-way LACP bonds over our Nexus gear, so we don't actually have any single 10GbE connection, as they will all have to be conencted to both DCs. This is achieved by using VPCs on our Nexus switches. Provide as much detail as you can - if you're using hw flow control, whether both links act this way at the same time or independently, and so-on. Problems like this often boil down to a very small and seemingly insignificant detail. I currently have ixgbe on the operating table for adding X550 support, so I can take a look at this; however I don't have your type of switches available to me so LACP-specific testing is something I can't do for you. /dale I checked the ixgbe.conf files on each host and they all are still at the standard setting, which includes flow_control = 3; So they all have flow control enabled. As for the Nexus config, all of those ports are still on standard ethernet ports and modifications have only been made globally to the switch. I will now have to yank the one port on one of the hosts from the aggr and configure it as a standalone port. Then we will see, if it still receives the disconnects/reconnects and finally the negotiation to 1GbE instead of 10GbE. As this only seems to happen to the same port I never experienced other ports of the affected aggrs acting up. I also thought to notice, that those were always the "same" physical ports, that is the first port on the card (ixgbe0), but that might of course be a coincidence. Thanks, Stephan Ok, so we can likely rule out LACP as a generic reason for this issue… After removing ixgbe0 from the aggr1, I plugged it into an unused port of my Nexus FEX and low and behold, here we go: root@tr1206902:/root# tail -f /var/adm/messages May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1000 Mbps, full duplex May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link down May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1 Mbps, full duplex May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link down May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link up, 1 Mbps, full duplex So, after less than an hour, we had the first link-cycle on ixgbe0, alas on another port, which has no LACP config whatsoever. I will monitor this for a while and see, if we will get more of those. Thanks, Stephan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
Am 09.05.16 um 20:43 schrieb Dale Ghent: On May 9, 2016, at 2:04 PM, Stephan Budachwrote: Am 09.05.16 um 16:33 schrieb Dale Ghent: On May 9, 2016, at 8:24 AM, Stephan Budach wrote: Hi, I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first starts with a couple if link downs/ups on one port and finally the link on that port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel on my Cisco Nexus for this connection. I have tried swapping and interchangeing cables and thus switchports, but to no avail. Anyone else noticed this and even better… knows a solution to this? Was this an issue noticed only with r151018 and not with previous versions, or have you only tried this with 018? By your description, I presume that the two ixgbe physical links will stay at 10Gb and not bounce down to 1Gb if not LACP'd together? /dale I have noticed that on prior versions of OmniOS as well, but we only recently started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our network. I will have to check if both links stay at 10GbE, when not being configured as a LACP bond. Let me check that tomorrow and report back. As we're heading for a streched DC, we are mainly configuring 2-way LACP bonds over our Nexus gear, so we don't actually have any single 10GbE connection, as they will all have to be conencted to both DCs. This is achieved by using VPCs on our Nexus switches. Provide as much detail as you can - if you're using hw flow control, whether both links act this way at the same time or independently, and so-on. Problems like this often boil down to a very small and seemingly insignificant detail. I currently have ixgbe on the operating table for adding X550 support, so I can take a look at this; however I don't have your type of switches available to me so LACP-specific testing is something I can't do for you. /dale I checked the ixgbe.conf files on each host and they all are still at the standard setting, which includes flow_control = 3; So they all have flow control enabled. As for the Nexus config, all of those ports are still on standard ethernet ports and modifications have only been made globally to the switch. I will now have to yank the one port on one of the hosts from the aggr and configure it as a standalone port. Then we will see, if it still receives the disconnects/reconnects and finally the negotiation to 1GbE instead of 10GbE. As this only seems to happen to the same port I never experienced other ports of the affected aggrs acting up. I also thought to notice, that those were always the "same" physical ports, that is the first port on the card (ixgbe0), but that might of course be a coincidence. Thanks, Stephan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
[OmniOS-discuss] sudden reboot
Hi, I’m experiencing an unexpected reboot. System is supermicro with ECC mem, qlogic FC and LSI SAS. Temperatures and voltages look OK. The message i find is about the express bus… but how to find the cause? Should i set something like IRQ-steering or so in the BIOS? May 11 10:01:09 ZFS01 savecore: [ID 570001 auth.error] reboot after panic: pcieb-0: PCI(-X) Express Fatal Error. (0x101) May 11 10:01:09 ZFS01 savecore: [ID 365739 auth.error] Saving compressed system crash dump in /var/crash/unknown/vmdump.0 Thanks, Martijn ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss