On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <der...@redhat.com> wrote: > > > On 19 December 2017 at 22:23, Brian Haley <haleyb....@gmail.com> wrote: >> >> On 12/19/2017 04:00 PM, Ben Nemec wrote: >>> >>> >>> >>> On 12/19/2017 02:43 PM, Brian Haley wrote: >>>> >>>> On 12/19/2017 11:53 AM, Ben Nemec wrote: >>>>> >>>>> The reboot is done (mostly...see below). >>>>> >>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote: >>>>>> >>>>>> Ben - Can you provide some links to the ovs port exhaustion issue for >>>>>> some background? >>>>> >>>>> >>>>> I don't know if we ever had a bug opened, but there's some discussion >>>>> of it in >>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html >>>>> I've also copied Derek since I believe he was the one who found it >>>>> originally. >>>>> >>>>> The gist is that after about 3 months of tripleo-ci running in this >>>>> cloud we start to hit errors creating instances because of problems >>>>> creating >>>>> OVS ports on the compute nodes. Sometimes we see a huge number of ports >>>>> in >>>>> general, other times we see a lot of ports that look like this: >>>>> >>>>> Port "qvod2cade14-7c" >>>>> tag: 4095 >>>>> Interface "qvod2cade14-7c" >>>>> >>>>> Notably they all have a tag of 4095, which seems suspicious to me. I >>>>> don't know whether it's actually an issue though. >>>> >>>> >>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent. >>>> >>>> The 'qvo' here shows it's part of the VETH pair that os-vif created when >>>> it plugged in the VM (the other half is 'qvb'), and they're created so that >>>> iptables rules can be applied by neutron. It's part of the "old" way to do >>>> security groups with the OVSHybridIptablesFirewallDriver, and can >>>> eventually >>>> go away once the OVSFirewallDriver can be used everywhere (requires newer >>>> OVS and agent). >>>> >>>> I wonder if you can run the ovs_cleanup utility to clean some of these >>>> up? >>> >>> >>> As in neutron-ovs-cleanup? Doesn't that wipe out everything, including >>> any ports that are still in use? Or is there a different tool I'm not aware >>> of that can do more targeted cleanup? >> >> >> Crap, I thought there was an option to just cleanup these dead devices, I >> should have read the code, it's either neutron ports (default) or all ports. >> Maybe that should be an option. > > > iirc neutron-ovs-cleanup was being run following the reboot as part of a > ExecStartPre= on one of the neutron services this is what essentially > removed the ports for us. > >
There is actually unit files for cleanup (netns|ovs|lb), specifically for ovs-cleanup[1] Maybe this can be ran to mitigate the need for a reboot? [1] [Unit] Description=OpenStack Neutron Open vSwitch Cleanup Utility After=syslog.target network.target openvswitch.service Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service neutron-l3-agent.service openstack-nova-compute.service [Service] Type=oneshot User=neutron ExecStart=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log ExecStop=/usr/bin/neutron-ovs-cleanup --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-ovs-cleanup --log-file /var/log/neutron/ovs-cleanup.log PrivateTmp=true RemainAfterExit=yes [Install] WantedBy=multi-user.target ~ >> >> >> >> -Brian >> >> >>> Oh, also worth noting that I don't think we have os-vif in this cloud >>> because it's so old. There's no os-vif package installed anyway. >>> >>>> >>>> -Brian >>>> >>>>> I've had some offline discussions about getting someone on this cloud >>>>> to debug the problem. Originally we decided not to pursue it since it's >>>>> not >>>>> hard to work around and we didn't want to disrupt the environment by >>>>> trying >>>>> to move to later OpenStack code (we're still back on Mitaka), but it was >>>>> pointed out to me this time around that from a downstream perspective we >>>>> have users on older code as well and it may be worth debugging to make >>>>> sure >>>>> they don't hit similar problems. >>>>> >>>>> To that end, I've left one compute node un-rebooted for debugging >>>>> purposes. The downstream discussion is ongoing, but I'll update here if >>>>> we >>>>> find anything. >>>>> >>>>>> >>>>>> Thanks, >>>>>> Joe >>>>>> >>>>>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <openst...@nemebean.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> It's that magical time again. You know the one, when we reboot rh1 >>>>>>> to avoid >>>>>>> OVS port exhaustion. :-) >>>>>>> >>>>>>> If all goes well you won't even notice that this is happening, but >>>>>>> there is >>>>>>> the possibility that a few jobs will fail while the te-broker host is >>>>>>> rebooted so I wanted to let everyone know. If you notice anything >>>>>>> else >>>>>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I >>>>>>> have >>>>>>> been known to forget to restart services after the reboot. >>>>>>> >>>>>>> I'll send a followup when I'm done. >>>>>>> >>>>>>> -Ben >>>>>>> >>>>>>> >>>>>>> __________________________________________________________________________ >>>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>>> Unsubscribe: >>>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>>> >>>>>> __________________________________________________________________________ >>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>> Unsubscribe: >>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>> >>>>> >>>>> __________________________________________________________________________ >>>>> OpenStack Development Mailing List (not for usage questions) >>>>> Unsubscribe: >>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev