Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

Joe Talerico Tue, 19 Dec 2017 15:41:10 -0800

On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins <der...@redhat.com> wrote:
>
>
> On 19 December 2017 at 22:23, Brian Haley <haleyb....@gmail.com> wrote:
>>
>> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>>>
>>>
>>>
>>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>>>
>>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>>>
>>>>> The reboot is done (mostly...see below).
>>>>>
>>>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>>>>>
>>>>>> Ben - Can you provide some links to the ovs port exhaustion issue for
>>>>>> some background?
>>>>>
>>>>>
>>>>> I don't know if we ever had a bug opened, but there's some discussion
>>>>> of it in
>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>>>> I've also copied Derek since I believe he was the one who found it
>>>>> originally.
>>>>>
>>>>> The gist is that after about 3 months of tripleo-ci running in this
>>>>> cloud we start to hit errors creating instances because of problems 
>>>>> creating
>>>>> OVS ports on the compute nodes.  Sometimes we see a huge number of ports 
>>>>> in
>>>>> general, other times we see a lot of ports that look like this:
>>>>>
>>>>> Port "qvod2cade14-7c"
>>>>>              tag: 4095
>>>>>              Interface "qvod2cade14-7c"
>>>>>
>>>>> Notably they all have a tag of 4095, which seems suspicious to me.  I
>>>>> don't know whether it's actually an issue though.
>>>>
>>>>
>>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.
>>>>
>>>> The 'qvo' here shows it's part of the VETH pair that os-vif created when
>>>> it plugged in the VM (the other half is 'qvb'), and they're created so that
>>>> iptables rules can be applied by neutron.  It's part of the "old" way to do
>>>> security groups with the OVSHybridIptablesFirewallDriver, and can 
>>>> eventually
>>>> go away once the OVSFirewallDriver can be used everywhere (requires newer
>>>> OVS and agent).
>>>>
>>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>>> up?
>>>
>>>
>>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>>> any ports that are still in use?  Or is there a different tool I'm not aware
>>> of that can do more targeted cleanup?
>>
>>
>> Crap, I thought there was an option to just cleanup these dead devices, I
>> should have read the code, it's either neutron ports (default) or all ports.
>> Maybe that should be an option.
>
>
> iirc neutron-ovs-cleanup was being run following the reboot as part of a
> ExecStartPre= on one of the neutron services this is what essentially
> removed the ports for us.
>
>


There is actually unit files for cleanup (netns|ovs|lb), specifically
for ovs-cleanup[1]

Maybe this can be ran to mitigate the need for a reboot?

[1]
[Unit]
Description=OpenStack Neutron Open vSwitch Cleanup Utility
After=syslog.target network.target openvswitch.service
Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
neutron-l3-agent.service openstack-nova-compute.service

[Service]
Type=oneshot
User=neutron
ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
PrivateTmp=true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
~
>>
>>
>>
>> -Brian
>>
>>
>>> Oh, also worth noting that I don't think we have os-vif in this cloud
>>> because it's so old.  There's no os-vif package installed anyway.
>>>
>>>>
>>>> -Brian
>>>>
>>>>> I've had some offline discussions about getting someone on this cloud
>>>>> to debug the problem.  Originally we decided not to pursue it since it's 
>>>>> not
>>>>> hard to work around and we didn't want to disrupt the environment by 
>>>>> trying
>>>>> to move to later OpenStack code (we're still back on Mitaka), but it was
>>>>> pointed out to me this time around that from a downstream perspective we
>>>>> have users on older code as well and it may be worth debugging to make 
>>>>> sure
>>>>> they don't hit similar problems.
>>>>>
>>>>> To that end, I've left one compute node un-rebooted for debugging
>>>>> purposes.  The downstream discussion is ongoing, but I'll update here if 
>>>>> we
>>>>> find anything.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Joe
>>>>>>
>>>>>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec <openst...@nemebean.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It's that magical time again.  You know the one, when we reboot rh1
>>>>>>> to avoid
>>>>>>> OVS port exhaustion. :-)
>>>>>>>
>>>>>>> If all goes well you won't even notice that this is happening, but
>>>>>>> there is
>>>>>>> the possibility that a few jobs will fail while the te-broker host is
>>>>>>> rebooted so I wanted to let everyone know.  If you notice anything
>>>>>>> else
>>>>>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. I
>>>>>>> have
>>>>>>> been known to forget to restart services after the reboot.
>>>>>>>
>>>>>>> I'll send a followup when I'm done.
>>>>>>>
>>>>>>> -Ben
>>>>>>>
>>>>>>>
>>>>>>> __________________________________________________________________________
>>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>>> Unsubscribe:
>>>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>> __________________________________________________________________________
>>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>>> Unsubscribe:
>>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>
>>>>>
>>>>>
>>>>> __________________________________________________________________________
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe:
>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

Reply via email to