Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On Wed, Dec 20, 2017 at 9:08 AM, Ben Nemec wrote: > > > On 12/19/2017 05:34 PM, Joe Talerico wrote: >> >> On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins wrote: >>> >>> >>> >>> On 19 December 2017 at 22:23, Brian Haley wrote: On 12/19/2017 04:00 PM, Ben Nemec wrote: > > > > > On 12/19/2017 02:43 PM, Brian Haley wrote: >> >> >> On 12/19/2017 11:53 AM, Ben Nemec wrote: >>> >>> >>> The reboot is done (mostly...see below). >>> >>> On 12/18/2017 05:11 PM, Joe Talerico wrote: Ben - Can you provide some links to the ovs port exhaustion issue for some background? >>> >>> >>> >>> I don't know if we ever had a bug opened, but there's some discussion >>> of it in >>> >>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html >>> I've also copied Derek since I believe he was the one who found it >>> originally. >>> >>> The gist is that after about 3 months of tripleo-ci running in this >>> cloud we start to hit errors creating instances because of problems >>> creating >>> OVS ports on the compute nodes. Sometimes we see a huge number of >>> ports in >>> general, other times we see a lot of ports that look like this: >>> >>> Port "qvod2cade14-7c" >>> tag: 4095 >>> Interface "qvod2cade14-7c" >>> >>> Notably they all have a tag of 4095, which seems suspicious to me. I >>> don't know whether it's actually an issue though. >> >> >> >> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the >> agent. >> >> The 'qvo' here shows it's part of the VETH pair that os-vif created >> when >> it plugged in the VM (the other half is 'qvb'), and they're created so >> that >> iptables rules can be applied by neutron. It's part of the "old" way >> to do >> security groups with the OVSHybridIptablesFirewallDriver, and can >> eventually >> go away once the OVSFirewallDriver can be used everywhere (requires >> newer >> OVS and agent). >> >> I wonder if you can run the ovs_cleanup utility to clean some of these >> up? > > > > As in neutron-ovs-cleanup? Doesn't that wipe out everything, including > any ports that are still in use? Or is there a different tool I'm not > aware > of that can do more targeted cleanup? Crap, I thought there was an option to just cleanup these dead devices, I should have read the code, it's either neutron ports (default) or all ports. Maybe that should be an option. >>> >>> >>> >>> iirc neutron-ovs-cleanup was being run following the reboot as part of a >>> ExecStartPre= on one of the neutron services this is what essentially >>> removed the ports for us. >>> >>> >> >> There is actually unit files for cleanup (netns|ovs|lb), specifically >> for ovs-cleanup[1] >> >> Maybe this can be ran to mitigate the need for a reboot? > > > That's what Brian suggested too, but running it with instances on the node > will cause an outage because it cleans up everything, including in-use > ports. The reason a reboot works is basically that it causes this unit to > run when the node comes back up because it's a dep of the other services. > So it's possible we could use this to skip the complete reboot, but that's > not the time-consuming part of the process. It's waiting for all the > instances to cycle off so we don't cause spurious failures when we wipe the > ovs ports. Actually rebooting the nodes takes about five minutes (and it's > only that long because of an old TripleO bug). ack. There are options you can pass with the cleanup to not nuke everything. I wonder if it is a combination of ovs-cleanup + restarting the ovs-agent? Anyway, doesn't seem that big of a problem then. /me gets off his uptime soapbox Joe > > >> >> [1] >> [Unit] >> Description=OpenStack Neutron Open vSwitch Cleanup Utility >> After=syslog.target network.target openvswitch.service >> Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service >> neutron-l3-agent.service openstack-nova-compute.service >> >> [Service] >> Type=oneshot >> User=neutron >> ExecStart=/usr/bin/neutron-ovs-cleanup --config-file >> /usr/share/neutron/neutron-dist.conf --config-file >> /etc/neutron/neutron.conf --config-file >> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir >> /etc/neutron/conf.d/common --config-dir >> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file >> /var/log/neutron/ovs-cleanup.log >> ExecStop=/usr/bin/neutron-ovs-cleanup --config-file >> /usr/share/neutron/neutron-dist.conf --config-file >> /etc/neutron/neutron.conf --config-file >> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir >> /etc/neutron/conf.d/common --config-dir >> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file >> /var/log/neutron/ovs-cleanup.lo
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On 12/19/2017 05:34 PM, Joe Talerico wrote: On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins wrote: On 19 December 2017 at 22:23, Brian Haley wrote: On 12/19/2017 04:00 PM, Ben Nemec wrote: On 12/19/2017 02:43 PM, Brian Haley wrote: On 12/19/2017 11:53 AM, Ben Nemec wrote: The reboot is done (mostly...see below). On 12/18/2017 05:11 PM, Joe Talerico wrote: Ben - Can you provide some links to the ovs port exhaustion issue for some background? I don't know if we ever had a bug opened, but there's some discussion of it in http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html I've also copied Derek since I believe he was the one who found it originally. The gist is that after about 3 months of tripleo-ci running in this cloud we start to hit errors creating instances because of problems creating OVS ports on the compute nodes. Sometimes we see a huge number of ports in general, other times we see a lot of ports that look like this: Port "qvod2cade14-7c" tag: 4095 Interface "qvod2cade14-7c" Notably they all have a tag of 4095, which seems suspicious to me. I don't know whether it's actually an issue though. Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. The 'qvo' here shows it's part of the VETH pair that os-vif created when it plugged in the VM (the other half is 'qvb'), and they're created so that iptables rules can be applied by neutron. It's part of the "old" way to do security groups with the OVSHybridIptablesFirewallDriver, and can eventually go away once the OVSFirewallDriver can be used everywhere (requires newer OVS and agent). I wonder if you can run the ovs_cleanup utility to clean some of these up? As in neutron-ovs-cleanup? Doesn't that wipe out everything, including any ports that are still in use? Or is there a different tool I'm not aware of that can do more targeted cleanup? Crap, I thought there was an option to just cleanup these dead devices, I should have read the code, it's either neutron ports (default) or all ports. Maybe that should be an option. iirc neutron-ovs-cleanup was being run following the reboot as part of a ExecStartPre= on one of the neutron services this is what essentially removed the ports for us. There is actually unit files for cleanup (netns|ovs|lb), specifically for ovs-cleanup[1] Maybe this can be ran to mitigate the need for a reboot? That's what Brian suggested too, but running it with instances on the node will cause an outage because it cleans up everything, including in-use ports. The reason a reboot works is basically that it causes this unit to run when the node comes back up because it's a dep of the other services. So it's possible we could use this to skip the complete reboot, but that's not the time-consuming part of the process. It's waiting for all the instances to cycle off so we don't cause spurious failures when we wipe the ovs ports. Actually rebooting the nodes takes about five minutes (and it's only that long because of an old TripleO bug). [1] [Unit] Description=OpenStack Neutron Open vSwitch Cleanup Utility After=syslog.target network.target openvswitch.service Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service neutron-l3-agent.service openstack-nova-compute.service [Service] Type=oneshot User=neutron ExecStart=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log ExecStop=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log PrivateTmp=true RemainAfterExit=yes [Install] WantedBy=multi-user.target ~ -Brian Oh, also worth noting that I don't think we have os-vif in this cloud because it's so old. There's no os-vif package installed anyway. -Brian I've had some offline discussions about getting someone on this cloud to debug the problem. Originally we decided not to pursue it since it's not hard to work around and we didn't want to disrupt the environment by trying to move to later OpenStack code (we're still back on Mitaka), but it was pointed out to me this time around that from a downstream perspective we have users on older code as well and it may be worth debugging to make sure they don't hit similar problems. To that end, I've left one compute node un-rebooted for debugging purposes. The downstream discussion is ongoing, but I'll update here if we find anything. Thanks, Joe On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec wrote: Hi, It's that mag
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins wrote: > > > On 19 December 2017 at 22:23, Brian Haley wrote: >> >> On 12/19/2017 04:00 PM, Ben Nemec wrote: >>> >>> >>> >>> On 12/19/2017 02:43 PM, Brian Haley wrote: On 12/19/2017 11:53 AM, Ben Nemec wrote: > > The reboot is done (mostly...see below). > > On 12/18/2017 05:11 PM, Joe Talerico wrote: >> >> Ben - Can you provide some links to the ovs port exhaustion issue for >> some background? > > > I don't know if we ever had a bug opened, but there's some discussion > of it in > http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html > I've also copied Derek since I believe he was the one who found it > originally. > > The gist is that after about 3 months of tripleo-ci running in this > cloud we start to hit errors creating instances because of problems > creating > OVS ports on the compute nodes. Sometimes we see a huge number of ports > in > general, other times we see a lot of ports that look like this: > > Port "qvod2cade14-7c" > tag: 4095 > Interface "qvod2cade14-7c" > > Notably they all have a tag of 4095, which seems suspicious to me. I > don't know whether it's actually an issue though. Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. The 'qvo' here shows it's part of the VETH pair that os-vif created when it plugged in the VM (the other half is 'qvb'), and they're created so that iptables rules can be applied by neutron. It's part of the "old" way to do security groups with the OVSHybridIptablesFirewallDriver, and can eventually go away once the OVSFirewallDriver can be used everywhere (requires newer OVS and agent). I wonder if you can run the ovs_cleanup utility to clean some of these up? >>> >>> >>> As in neutron-ovs-cleanup? Doesn't that wipe out everything, including >>> any ports that are still in use? Or is there a different tool I'm not aware >>> of that can do more targeted cleanup? >> >> >> Crap, I thought there was an option to just cleanup these dead devices, I >> should have read the code, it's either neutron ports (default) or all ports. >> Maybe that should be an option. > > > iirc neutron-ovs-cleanup was being run following the reboot as part of a > ExecStartPre= on one of the neutron services this is what essentially > removed the ports for us. > > There is actually unit files for cleanup (netns|ovs|lb), specifically for ovs-cleanup[1] Maybe this can be ran to mitigate the need for a reboot? [1] [Unit] Description=OpenStack Neutron Open vSwitch Cleanup Utility After=syslog.target network.target openvswitch.service Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service neutron-l3-agent.service openstack-nova-compute.service [Service] Type=oneshot User=neutron ExecStart=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log ExecStop=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log PrivateTmp=true RemainAfterExit=yes [Install] WantedBy=multi-user.target ~ >> >> >> >> -Brian >> >> >>> Oh, also worth noting that I don't think we have os-vif in this cloud >>> because it's so old. There's no os-vif package installed anyway. >>> -Brian > I've had some offline discussions about getting someone on this cloud > to debug the problem. Originally we decided not to pursue it since it's > not > hard to work around and we didn't want to disrupt the environment by > trying > to move to later OpenStack code (we're still back on Mitaka), but it was > pointed out to me this time around that from a downstream perspective we > have users on older code as well and it may be worth debugging to make > sure > they don't hit similar problems. > > To that end, I've left one compute node un-rebooted for debugging > purposes. The downstream discussion is ongoing, but I'll update here if > we > find anything. > >> >> Thanks, >> Joe >> >> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec >> wrote: >>> >>> Hi, >>> >>> It's that magical time again. You know the one, when we reboot rh1 >>> to avoid >>> OVS port exhaustion. :-) >>> >>> If all goes well you won't even notice that this is happening, but >>> there is >>>
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On 19 December 2017 at 22:23, Brian Haley wrote: > On 12/19/2017 04:00 PM, Ben Nemec wrote: > >> >> >> On 12/19/2017 02:43 PM, Brian Haley wrote: >> >>> On 12/19/2017 11:53 AM, Ben Nemec wrote: >>> The reboot is done (mostly...see below). On 12/18/2017 05:11 PM, Joe Talerico wrote: > Ben - Can you provide some links to the ovs port exhaustion issue for > some background? > I don't know if we ever had a bug opened, but there's some discussion of it in http://lists.openstack.org/pipermail/openstack-dev/2016-Dece mber/109182.html I've also copied Derek since I believe he was the one who found it originally. The gist is that after about 3 months of tripleo-ci running in this cloud we start to hit errors creating instances because of problems creating OVS ports on the compute nodes. Sometimes we see a huge number of ports in general, other times we see a lot of ports that look like this: Port "qvod2cade14-7c" tag: 4095 Interface "qvod2cade14-7c" Notably they all have a tag of 4095, which seems suspicious to me. I don't know whether it's actually an issue though. >>> >>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. >>> >>> The 'qvo' here shows it's part of the VETH pair that os-vif created when >>> it plugged in the VM (the other half is 'qvb'), and they're created so that >>> iptables rules can be applied by neutron. It's part of the "old" way to do >>> security groups with the OVSHybridIptablesFirewallDriver, and can >>> eventually go away once the OVSFirewallDriver can be used everywhere >>> (requires newer OVS and agent). >>> >>> I wonder if you can run the ovs_cleanup utility to clean some of these >>> up? >>> >> >> As in neutron-ovs-cleanup? Doesn't that wipe out everything, including >> any ports that are still in use? Or is there a different tool I'm not >> aware of that can do more targeted cleanup? >> > > Crap, I thought there was an option to just cleanup these dead devices, I > should have read the code, it's either neutron ports (default) or all > ports. Maybe that should be an option. iirc neutron-ovs-cleanup was being run following the reboot as part of a ExecStartPre= on one of the neutron services this is what essentially removed the ports for us. > > > -Brian > > > Oh, also worth noting that I don't think we have os-vif in this cloud >> because it's so old. There's no os-vif package installed anyway. >> >> >>> -Brian >>> >>> I've had some offline discussions about getting someone on this cloud to debug the problem. Originally we decided not to pursue it since it's not hard to work around and we didn't want to disrupt the environment by trying to move to later OpenStack code (we're still back on Mitaka), but it was pointed out to me this time around that from a downstream perspective we have users on older code as well and it may be worth debugging to make sure they don't hit similar problems. To that end, I've left one compute node un-rebooted for debugging purposes. The downstream discussion is ongoing, but I'll update here if we find anything. > Thanks, > Joe > > On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec > wrote: > >> Hi, >> >> It's that magical time again. You know the one, when we reboot rh1 >> to avoid >> OVS port exhaustion. :-) >> >> If all goes well you won't even notice that this is happening, but >> there is >> the possibility that a few jobs will fail while the te-broker host is >> rebooted so I wanted to let everyone know. If you notice anything >> else >> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. >> I have >> been known to forget to restart services after the reboot. >> >> I'll send a followup when I'm done. >> >> -Ben >> >> __ >> >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.op >> enstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > __ > > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.op > enstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.op enstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>> __
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On 12/19/2017 04:00 PM, Ben Nemec wrote: On 12/19/2017 02:43 PM, Brian Haley wrote: On 12/19/2017 11:53 AM, Ben Nemec wrote: The reboot is done (mostly...see below). On 12/18/2017 05:11 PM, Joe Talerico wrote: Ben - Can you provide some links to the ovs port exhaustion issue for some background? I don't know if we ever had a bug opened, but there's some discussion of it in http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html I've also copied Derek since I believe he was the one who found it originally. The gist is that after about 3 months of tripleo-ci running in this cloud we start to hit errors creating instances because of problems creating OVS ports on the compute nodes. Sometimes we see a huge number of ports in general, other times we see a lot of ports that look like this: Port "qvod2cade14-7c" tag: 4095 Interface "qvod2cade14-7c" Notably they all have a tag of 4095, which seems suspicious to me. I don't know whether it's actually an issue though. Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. The 'qvo' here shows it's part of the VETH pair that os-vif created when it plugged in the VM (the other half is 'qvb'), and they're created so that iptables rules can be applied by neutron. It's part of the "old" way to do security groups with the OVSHybridIptablesFirewallDriver, and can eventually go away once the OVSFirewallDriver can be used everywhere (requires newer OVS and agent). I wonder if you can run the ovs_cleanup utility to clean some of these up? As in neutron-ovs-cleanup? Doesn't that wipe out everything, including any ports that are still in use? Or is there a different tool I'm not aware of that can do more targeted cleanup? Crap, I thought there was an option to just cleanup these dead devices, I should have read the code, it's either neutron ports (default) or all ports. Maybe that should be an option. -Brian Oh, also worth noting that I don't think we have os-vif in this cloud because it's so old. There's no os-vif package installed anyway. -Brian I've had some offline discussions about getting someone on this cloud to debug the problem. Originally we decided not to pursue it since it's not hard to work around and we didn't want to disrupt the environment by trying to move to later OpenStack code (we're still back on Mitaka), but it was pointed out to me this time around that from a downstream perspective we have users on older code as well and it may be worth debugging to make sure they don't hit similar problems. To that end, I've left one compute node un-rebooted for debugging purposes. The downstream discussion is ongoing, but I'll update here if we find anything. Thanks, Joe On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec wrote: Hi, It's that magical time again. You know the one, when we reboot rh1 to avoid OVS port exhaustion. :-) If all goes well you won't even notice that this is happening, but there is the possibility that a few jobs will fail while the te-broker host is rebooted so I wanted to let everyone know. If you notice anything else hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I have been known to forget to restart services after the reboot. I'll send a followup when I'm done. -Ben __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On 12/19/2017 02:43 PM, Brian Haley wrote: On 12/19/2017 11:53 AM, Ben Nemec wrote: The reboot is done (mostly...see below). On 12/18/2017 05:11 PM, Joe Talerico wrote: Ben - Can you provide some links to the ovs port exhaustion issue for some background? I don't know if we ever had a bug opened, but there's some discussion of it in http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html I've also copied Derek since I believe he was the one who found it originally. The gist is that after about 3 months of tripleo-ci running in this cloud we start to hit errors creating instances because of problems creating OVS ports on the compute nodes. Sometimes we see a huge number of ports in general, other times we see a lot of ports that look like this: Port "qvod2cade14-7c" tag: 4095 Interface "qvod2cade14-7c" Notably they all have a tag of 4095, which seems suspicious to me. I don't know whether it's actually an issue though. Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. The 'qvo' here shows it's part of the VETH pair that os-vif created when it plugged in the VM (the other half is 'qvb'), and they're created so that iptables rules can be applied by neutron. It's part of the "old" way to do security groups with the OVSHybridIptablesFirewallDriver, and can eventually go away once the OVSFirewallDriver can be used everywhere (requires newer OVS and agent). I wonder if you can run the ovs_cleanup utility to clean some of these up? As in neutron-ovs-cleanup? Doesn't that wipe out everything, including any ports that are still in use? Or is there a different tool I'm not aware of that can do more targeted cleanup? Oh, also worth noting that I don't think we have os-vif in this cloud because it's so old. There's no os-vif package installed anyway. -Brian I've had some offline discussions about getting someone on this cloud to debug the problem. Originally we decided not to pursue it since it's not hard to work around and we didn't want to disrupt the environment by trying to move to later OpenStack code (we're still back on Mitaka), but it was pointed out to me this time around that from a downstream perspective we have users on older code as well and it may be worth debugging to make sure they don't hit similar problems. To that end, I've left one compute node un-rebooted for debugging purposes. The downstream discussion is ongoing, but I'll update here if we find anything. Thanks, Joe On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec wrote: Hi, It's that magical time again. You know the one, when we reboot rh1 to avoid OVS port exhaustion. :-) If all goes well you won't even notice that this is happening, but there is the possibility that a few jobs will fail while the te-broker host is rebooted so I wanted to let everyone know. If you notice anything else hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I have been known to forget to restart services after the reboot. I'll send a followup when I'm done. -Ben __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On 12/19/2017 11:53 AM, Ben Nemec wrote: The reboot is done (mostly...see below). On 12/18/2017 05:11 PM, Joe Talerico wrote: Ben - Can you provide some links to the ovs port exhaustion issue for some background? I don't know if we ever had a bug opened, but there's some discussion of it in http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html I've also copied Derek since I believe he was the one who found it originally. The gist is that after about 3 months of tripleo-ci running in this cloud we start to hit errors creating instances because of problems creating OVS ports on the compute nodes. Sometimes we see a huge number of ports in general, other times we see a lot of ports that look like this: Port "qvod2cade14-7c" tag: 4095 Interface "qvod2cade14-7c" Notably they all have a tag of 4095, which seems suspicious to me. I don't know whether it's actually an issue though. Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. The 'qvo' here shows it's part of the VETH pair that os-vif created when it plugged in the VM (the other half is 'qvb'), and they're created so that iptables rules can be applied by neutron. It's part of the "old" way to do security groups with the OVSHybridIptablesFirewallDriver, and can eventually go away once the OVSFirewallDriver can be used everywhere (requires newer OVS and agent). I wonder if you can run the ovs_cleanup utility to clean some of these up? -Brian I've had some offline discussions about getting someone on this cloud to debug the problem. Originally we decided not to pursue it since it's not hard to work around and we didn't want to disrupt the environment by trying to move to later OpenStack code (we're still back on Mitaka), but it was pointed out to me this time around that from a downstream perspective we have users on older code as well and it may be worth debugging to make sure they don't hit similar problems. To that end, I've left one compute node un-rebooted for debugging purposes. The downstream discussion is ongoing, but I'll update here if we find anything. Thanks, Joe On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec wrote: Hi, It's that magical time again. You know the one, when we reboot rh1 to avoid OVS port exhaustion. :-) If all goes well you won't even notice that this is happening, but there is the possibility that a few jobs will fail while the te-broker host is rebooted so I wanted to let everyone know. If you notice anything else hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I have been known to forget to restart services after the reboot. I'll send a followup when I'm done. -Ben __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
On Tue, Dec 19, 2017 at 9:53 AM, Ben Nemec wrote: > The reboot is done (mostly...see below). > > On 12/18/2017 05:11 PM, Joe Talerico wrote: >> >> Ben - Can you provide some links to the ovs port exhaustion issue for >> some background? > > > I don't know if we ever had a bug opened, but there's some discussion of it > in > http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html > I've also copied Derek since I believe he was the one who found it > originally. > > The gist is that after about 3 months of tripleo-ci running in this cloud we > start to hit errors creating instances because of problems creating OVS > ports on the compute nodes. Sometimes we see a huge number of ports in > general, other times we see a lot of ports that look like this: > > Port "qvod2cade14-7c" > tag: 4095 > Interface "qvod2cade14-7c" > > Notably they all have a tag of 4095, which seems suspicious to me. I don't > know whether it's actually an issue though. > > I've had some offline discussions about getting someone on this cloud to > debug the problem. Originally we decided not to pursue it since it's not > hard to work around and we didn't want to disrupt the environment by trying > to move to later OpenStack code (we're still back on Mitaka), but it was > pointed out to me this time around that from a downstream perspective we > have users on older code as well and it may be worth debugging to make sure > they don't hit similar problems. > > To that end, I've left one compute node un-rebooted for debugging purposes. > The downstream discussion is ongoing, but I'll update here if we find > anything. > I just so happened to wander across the bug from last time, https://bugs.launchpad.net/tripleo/+bug/1719334 > >> >> Thanks, >> Joe >> >> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec >> wrote: >>> >>> Hi, >>> >>> It's that magical time again. You know the one, when we reboot rh1 to >>> avoid >>> OVS port exhaustion. :-) >>> >>> If all goes well you won't even notice that this is happening, but there >>> is >>> the possibility that a few jobs will fail while the te-broker host is >>> rebooted so I wanted to let everyone know. If you notice anything else >>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I >>> have >>> been known to forget to restart services after the reboot. >>> >>> I'll send a followup when I'm done. >>> >>> -Ben >>> >>> >>> __ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot
Ben - Can you provide some links to the ovs port exhaustion issue for some background? Thanks, Joe On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec wrote: > Hi, > > It's that magical time again. You know the one, when we reboot rh1 to avoid > OVS port exhaustion. :-) > > If all goes well you won't even notice that this is happening, but there is > the possibility that a few jobs will fail while the te-broker host is > rebooted so I wanted to let everyone know. If you notice anything else > hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I have > been known to forget to restart services after the reboot. > > I'll send a followup when I'm done. > > -Ben > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [TripleO] Tis the season...for a cloud reboot
Hi, It's that magical time again. You know the one, when we reboot rh1 to avoid OVS port exhaustion. :-) If all goes well you won't even notice that this is happening, but there is the possibility that a few jobs will fail while the te-broker host is rebooted so I wanted to let everyone know. If you notice anything else hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I have been known to forget to restart services after the reboot. I'll send a followup when I'm done. -Ben __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev