Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-20 Thread Joe Talerico
On Wed, Dec 20, 2017 at 9:08 AM, Ben Nemec  wrote:
>
>
> On 12/19/2017 05:34 PM, Joe Talerico wrote:
>>
>> On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins  wrote:
>>>
>>>
>>>
>>> On 19 December 2017 at 22:23, Brian Haley  wrote:


 On 12/19/2017 04:00 PM, Ben Nemec wrote:
>
>
>
>
> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>
>>
>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>
>>>
>>> The reboot is done (mostly...see below).
>>>
>>> On 12/18/2017 05:11 PM, Joe Talerico wrote:


 Ben - Can you provide some links to the ovs port exhaustion issue
 for
 some background?
>>>
>>>
>>>
>>> I don't know if we ever had a bug opened, but there's some discussion
>>> of it in
>>>
>>> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
>>> I've also copied Derek since I believe he was the one who found it
>>> originally.
>>>
>>> The gist is that after about 3 months of tripleo-ci running in this
>>> cloud we start to hit errors creating instances because of problems
>>> creating
>>> OVS ports on the compute nodes.  Sometimes we see a huge number of
>>> ports in
>>> general, other times we see a lot of ports that look like this:
>>>
>>> Port "qvod2cade14-7c"
>>>   tag: 4095
>>>   Interface "qvod2cade14-7c"
>>>
>>> Notably they all have a tag of 4095, which seems suspicious to me.  I
>>> don't know whether it's actually an issue though.
>>
>>
>>
>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the
>> agent.
>>
>> The 'qvo' here shows it's part of the VETH pair that os-vif created
>> when
>> it plugged in the VM (the other half is 'qvb'), and they're created so
>> that
>> iptables rules can be applied by neutron.  It's part of the "old" way
>> to do
>> security groups with the OVSHybridIptablesFirewallDriver, and can
>> eventually
>> go away once the OVSFirewallDriver can be used everywhere (requires
>> newer
>> OVS and agent).
>>
>> I wonder if you can run the ovs_cleanup utility to clean some of these
>> up?
>
>
>
> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
> any ports that are still in use?  Or is there a different tool I'm not
> aware
> of that can do more targeted cleanup?



 Crap, I thought there was an option to just cleanup these dead devices,
 I
 should have read the code, it's either neutron ports (default) or all
 ports.
 Maybe that should be an option.
>>>
>>>
>>>
>>> iirc neutron-ovs-cleanup was being run following the reboot as part of a
>>> ExecStartPre= on one of the neutron services this is what essentially
>>> removed the ports for us.
>>>
>>>
>>
>> There is actually unit files for cleanup (netns|ovs|lb), specifically
>> for ovs-cleanup[1]
>>
>> Maybe this can be ran to mitigate the need for a reboot?
>
>
> That's what Brian suggested too, but running it with instances on the node
> will cause an outage because it cleans up everything, including in-use
> ports.  The reason a reboot works is basically that it causes this unit to
> run when the node comes back up because it's a dep of the other services.
> So it's possible we could use this to skip the complete reboot, but that's
> not the time-consuming part of the process.  It's waiting for all the
> instances to cycle off so we don't cause spurious failures when we wipe the
> ovs ports.  Actually rebooting the nodes takes about five minutes (and it's
> only that long because of an old TripleO bug).

ack. There are options you can pass with the cleanup to not nuke everything.

I wonder if it is a combination of ovs-cleanup + restarting the
ovs-agent? Anyway, doesn't seem that big of a problem then. /me gets
off his uptime soapbox

Joe

>
>
>>
>> [1]
>> [Unit]
>> Description=OpenStack Neutron Open vSwitch Cleanup Utility
>> After=syslog.target network.target openvswitch.service
>> Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
>> neutron-l3-agent.service openstack-nova-compute.service
>>
>> [Service]
>> Type=oneshot
>> User=neutron
>> ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
>> /usr/share/neutron/neutron-dist.conf --config-file
>> /etc/neutron/neutron.conf --config-file
>> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
>> /etc/neutron/conf.d/common --config-dir
>> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file
>> /var/log/neutron/ovs-cleanup.log
>> ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
>> /usr/share/neutron/neutron-dist.conf --config-file
>> /etc/neutron/neutron.conf --config-file
>> /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
>> /etc/neutron/conf.d/common --config-dir
>> /etc/neutron/conf.d/neutron-ovs-cleanup --log-file
>> /var/log/neutron/ovs-cleanup.lo

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-20 Thread Ben Nemec



On 12/19/2017 05:34 PM, Joe Talerico wrote:

On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins  wrote:



On 19 December 2017 at 22:23, Brian Haley  wrote:


On 12/19/2017 04:00 PM, Ben Nemec wrote:




On 12/19/2017 02:43 PM, Brian Haley wrote:


On 12/19/2017 11:53 AM, Ben Nemec wrote:


The reboot is done (mostly...see below).

On 12/18/2017 05:11 PM, Joe Talerico wrote:


Ben - Can you provide some links to the ovs port exhaustion issue for
some background?



I don't know if we ever had a bug opened, but there's some discussion
of it in
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
I've also copied Derek since I believe he was the one who found it
originally.

The gist is that after about 3 months of tripleo-ci running in this
cloud we start to hit errors creating instances because of problems creating
OVS ports on the compute nodes.  Sometimes we see a huge number of ports in
general, other times we see a lot of ports that look like this:

Port "qvod2cade14-7c"
  tag: 4095
  Interface "qvod2cade14-7c"

Notably they all have a tag of 4095, which seems suspicious to me.  I
don't know whether it's actually an issue though.



Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

The 'qvo' here shows it's part of the VETH pair that os-vif created when
it plugged in the VM (the other half is 'qvb'), and they're created so that
iptables rules can be applied by neutron.  It's part of the "old" way to do
security groups with the OVSHybridIptablesFirewallDriver, and can eventually
go away once the OVSFirewallDriver can be used everywhere (requires newer
OVS and agent).

I wonder if you can run the ovs_cleanup utility to clean some of these
up?



As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
any ports that are still in use?  Or is there a different tool I'm not aware
of that can do more targeted cleanup?



Crap, I thought there was an option to just cleanup these dead devices, I
should have read the code, it's either neutron ports (default) or all ports.
Maybe that should be an option.



iirc neutron-ovs-cleanup was being run following the reboot as part of a
ExecStartPre= on one of the neutron services this is what essentially
removed the ports for us.




There is actually unit files for cleanup (netns|ovs|lb), specifically
for ovs-cleanup[1]

Maybe this can be ran to mitigate the need for a reboot?


That's what Brian suggested too, but running it with instances on the 
node will cause an outage because it cleans up everything, including 
in-use ports.  The reason a reboot works is basically that it causes 
this unit to run when the node comes back up because it's a dep of the 
other services.  So it's possible we could use this to skip the complete 
reboot, but that's not the time-consuming part of the process.  It's 
waiting for all the instances to cycle off so we don't cause spurious 
failures when we wipe the ovs ports.  Actually rebooting the nodes takes 
about five minutes (and it's only that long because of an old TripleO bug).




[1]
[Unit]
Description=OpenStack Neutron Open vSwitch Cleanup Utility
After=syslog.target network.target openvswitch.service
Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
neutron-l3-agent.service openstack-nova-compute.service

[Service]
Type=oneshot
User=neutron
ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
PrivateTmp=true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
~




-Brian



Oh, also worth noting that I don't think we have os-vif in this cloud
because it's so old.  There's no os-vif package installed anyway.



-Brian


I've had some offline discussions about getting someone on this cloud
to debug the problem.  Originally we decided not to pursue it since it's not
hard to work around and we didn't want to disrupt the environment by trying
to move to later OpenStack code (we're still back on Mitaka), but it was
pointed out to me this time around that from a downstream perspective we
have users on older code as well and it may be worth debugging to make sure
they don't hit similar problems.

To that end, I've left one compute node un-rebooted for debugging
purposes.  The downstream discussion is ongoing, but I'll update here if we
find anything.



Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec 
wrote:


Hi,

It's that mag

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Joe Talerico
On Tue, Dec 19, 2017 at 5:45 PM, Derek Higgins  wrote:
>
>
> On 19 December 2017 at 22:23, Brian Haley  wrote:
>>
>> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>>>
>>>
>>>
>>> On 12/19/2017 02:43 PM, Brian Haley wrote:

 On 12/19/2017 11:53 AM, Ben Nemec wrote:
>
> The reboot is done (mostly...see below).
>
> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>
>> Ben - Can you provide some links to the ovs port exhaustion issue for
>> some background?
>
>
> I don't know if we ever had a bug opened, but there's some discussion
> of it in
> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
> I've also copied Derek since I believe he was the one who found it
> originally.
>
> The gist is that after about 3 months of tripleo-ci running in this
> cloud we start to hit errors creating instances because of problems 
> creating
> OVS ports on the compute nodes.  Sometimes we see a huge number of ports 
> in
> general, other times we see a lot of ports that look like this:
>
> Port "qvod2cade14-7c"
>  tag: 4095
>  Interface "qvod2cade14-7c"
>
> Notably they all have a tag of 4095, which seems suspicious to me.  I
> don't know whether it's actually an issue though.


 Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

 The 'qvo' here shows it's part of the VETH pair that os-vif created when
 it plugged in the VM (the other half is 'qvb'), and they're created so that
 iptables rules can be applied by neutron.  It's part of the "old" way to do
 security groups with the OVSHybridIptablesFirewallDriver, and can 
 eventually
 go away once the OVSFirewallDriver can be used everywhere (requires newer
 OVS and agent).

 I wonder if you can run the ovs_cleanup utility to clean some of these
 up?
>>>
>>>
>>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>>> any ports that are still in use?  Or is there a different tool I'm not aware
>>> of that can do more targeted cleanup?
>>
>>
>> Crap, I thought there was an option to just cleanup these dead devices, I
>> should have read the code, it's either neutron ports (default) or all ports.
>> Maybe that should be an option.
>
>
> iirc neutron-ovs-cleanup was being run following the reboot as part of a
> ExecStartPre= on one of the neutron services this is what essentially
> removed the ports for us.
>
>

There is actually unit files for cleanup (netns|ovs|lb), specifically
for ovs-cleanup[1]

Maybe this can be ran to mitigate the need for a reboot?

[1]
[Unit]
Description=OpenStack Neutron Open vSwitch Cleanup Utility
After=syslog.target network.target openvswitch.service
Before=neutron-openvswitch-agent.service neutron-dhcp-agent.service
neutron-l3-agent.service openstack-nova-compute.service

[Service]
Type=oneshot
User=neutron
ExecStart=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
ExecStop=/usr/bin/neutron-ovs-cleanup --config-file
/usr/share/neutron/neutron-dist.conf --config-file
/etc/neutron/neutron.conf --config-file
/etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir
/etc/neutron/conf.d/common --config-dir
/etc/neutron/conf.d/neutron-ovs-cleanup --log-file
/var/log/neutron/ovs-cleanup.log
PrivateTmp=true
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
~
>>
>>
>>
>> -Brian
>>
>>
>>> Oh, also worth noting that I don't think we have os-vif in this cloud
>>> because it's so old.  There's no os-vif package installed anyway.
>>>

 -Brian

> I've had some offline discussions about getting someone on this cloud
> to debug the problem.  Originally we decided not to pursue it since it's 
> not
> hard to work around and we didn't want to disrupt the environment by 
> trying
> to move to later OpenStack code (we're still back on Mitaka), but it was
> pointed out to me this time around that from a downstream perspective we
> have users on older code as well and it may be worth debugging to make 
> sure
> they don't hit similar problems.
>
> To that end, I've left one compute node un-rebooted for debugging
> purposes.  The downstream discussion is ongoing, but I'll update here if 
> we
> find anything.
>
>>
>> Thanks,
>> Joe
>>
>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec 
>> wrote:
>>>
>>> Hi,
>>>
>>> It's that magical time again.  You know the one, when we reboot rh1
>>> to avoid
>>> OVS port exhaustion. :-)
>>>
>>> If all goes well you won't even notice that this is happening, but
>>> there is
>>>

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Derek Higgins
On 19 December 2017 at 22:23, Brian Haley  wrote:

> On 12/19/2017 04:00 PM, Ben Nemec wrote:
>
>>
>>
>> On 12/19/2017 02:43 PM, Brian Haley wrote:
>>
>>> On 12/19/2017 11:53 AM, Ben Nemec wrote:
>>>
 The reboot is done (mostly...see below).

 On 12/18/2017 05:11 PM, Joe Talerico wrote:

> Ben - Can you provide some links to the ovs port exhaustion issue for
> some background?
>

 I don't know if we ever had a bug opened, but there's some discussion
 of it in http://lists.openstack.org/pipermail/openstack-dev/2016-Dece
 mber/109182.html   I've also copied Derek since I believe he was the
 one who found it originally.

 The gist is that after about 3 months of tripleo-ci running in this
 cloud we start to hit errors creating instances because of problems
 creating OVS ports on the compute nodes.  Sometimes we see a huge number of
 ports in general, other times we see a lot of ports that look like this:

 Port "qvod2cade14-7c"
  tag: 4095
  Interface "qvod2cade14-7c"

 Notably they all have a tag of 4095, which seems suspicious to me.  I
 don't know whether it's actually an issue though.

>>>
>>> Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.
>>>
>>> The 'qvo' here shows it's part of the VETH pair that os-vif created when
>>> it plugged in the VM (the other half is 'qvb'), and they're created so that
>>> iptables rules can be applied by neutron.  It's part of the "old" way to do
>>> security groups with the OVSHybridIptablesFirewallDriver, and can
>>> eventually go away once the OVSFirewallDriver can be used everywhere
>>> (requires newer OVS and agent).
>>>
>>> I wonder if you can run the ovs_cleanup utility to clean some of these
>>> up?
>>>
>>
>> As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including
>> any ports that are still in use?  Or is there a different tool I'm not
>> aware of that can do more targeted cleanup?
>>
>
> Crap, I thought there was an option to just cleanup these dead devices, I
> should have read the code, it's either neutron ports (default) or all
> ports.  Maybe that should be an option.


iirc neutron-ovs-cleanup was being run following the reboot as part of
a ExecStartPre= on one of the neutron services this is what essentially
removed the ports for us.



>
>
> -Brian
>
>
> Oh, also worth noting that I don't think we have os-vif in this cloud
>> because it's so old.  There's no os-vif package installed anyway.
>>
>>
>>> -Brian
>>>
>>> I've had some offline discussions about getting someone on this cloud to
 debug the problem.  Originally we decided not to pursue it since it's not
 hard to work around and we didn't want to disrupt the environment by trying
 to move to later OpenStack code (we're still back on Mitaka), but it was
 pointed out to me this time around that from a downstream perspective we
 have users on older code as well and it may be worth debugging to make sure
 they don't hit similar problems.

 To that end, I've left one compute node un-rebooted for debugging
 purposes.  The downstream discussion is ongoing, but I'll update here if we
 find anything.


> Thanks,
> Joe
>
> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec 
> wrote:
>
>> Hi,
>>
>> It's that magical time again.  You know the one, when we reboot rh1
>> to avoid
>> OVS port exhaustion. :-)
>>
>> If all goes well you won't even notice that this is happening, but
>> there is
>> the possibility that a few jobs will fail while the te-broker host is
>> rebooted so I wanted to let everyone know.  If you notice anything
>> else
>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.
>> I have
>> been known to forget to restart services after the reboot.
>>
>> I'll send a followup when I'm done.
>>
>> -Ben
>>
>> __
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.op
>> enstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __
>
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.op
> enstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
 __

 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.op
 enstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

>>>
>>>
>>> __

Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Brian Haley

On 12/19/2017 04:00 PM, Ben Nemec wrote:



On 12/19/2017 02:43 PM, Brian Haley wrote:

On 12/19/2017 11:53 AM, Ben Nemec wrote:

The reboot is done (mostly...see below).

On 12/18/2017 05:11 PM, Joe Talerico wrote:

Ben - Can you provide some links to the ovs port exhaustion issue for
some background?


I don't know if we ever had a bug opened, but there's some discussion 
of it in 
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html 
  I've also copied Derek since I believe he was the one who found it 
originally.


The gist is that after about 3 months of tripleo-ci running in this 
cloud we start to hit errors creating instances because of problems 
creating OVS ports on the compute nodes.  Sometimes we see a huge 
number of ports in general, other times we see a lot of ports that 
look like this:


Port "qvod2cade14-7c"
 tag: 4095
 Interface "qvod2cade14-7c"

Notably they all have a tag of 4095, which seems suspicious to me.  I 
don't know whether it's actually an issue though.


Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

The 'qvo' here shows it's part of the VETH pair that os-vif created 
when it plugged in the VM (the other half is 'qvb'), and they're 
created so that iptables rules can be applied by neutron.  It's part 
of the "old" way to do security groups with the 
OVSHybridIptablesFirewallDriver, and can eventually go away once the 
OVSFirewallDriver can be used everywhere (requires newer OVS and agent).


I wonder if you can run the ovs_cleanup utility to clean some of these 
up?


As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including 
any ports that are still in use?  Or is there a different tool I'm not 
aware of that can do more targeted cleanup?


Crap, I thought there was an option to just cleanup these dead devices, 
I should have read the code, it's either neutron ports (default) or all 
ports.  Maybe that should be an option.


-Brian

Oh, also worth noting that I don't think we have os-vif in this cloud 
because it's so old.  There's no os-vif package installed anyway.




-Brian

I've had some offline discussions about getting someone on this cloud 
to debug the problem.  Originally we decided not to pursue it since 
it's not hard to work around and we didn't want to disrupt the 
environment by trying to move to later OpenStack code (we're still 
back on Mitaka), but it was pointed out to me this time around that 
from a downstream perspective we have users on older code as well and 
it may be worth debugging to make sure they don't hit similar problems.


To that end, I've left one compute node un-rebooted for debugging 
purposes.  The downstream discussion is ongoing, but I'll update here 
if we find anything.




Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec  
wrote:

Hi,

It's that magical time again.  You know the one, when we reboot rh1 
to avoid

OVS port exhaustion. :-)

If all goes well you won't even notice that this is happening, but 
there is

the possibility that a few jobs will fail while the te-broker host is
rebooted so I wanted to let everyone know.  If you notice anything 
else
hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. 
I have

been known to forget to restart services after the reboot.

I'll send a followup when I'm done.

-Ben

__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Ben Nemec



On 12/19/2017 02:43 PM, Brian Haley wrote:

On 12/19/2017 11:53 AM, Ben Nemec wrote:

The reboot is done (mostly...see below).

On 12/18/2017 05:11 PM, Joe Talerico wrote:

Ben - Can you provide some links to the ovs port exhaustion issue for
some background?


I don't know if we ever had a bug opened, but there's some discussion 
of it in 
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html 
  I've also copied Derek since I believe he was the one who found it 
originally.


The gist is that after about 3 months of tripleo-ci running in this 
cloud we start to hit errors creating instances because of problems 
creating OVS ports on the compute nodes.  Sometimes we see a huge 
number of ports in general, other times we see a lot of ports that 
look like this:


Port "qvod2cade14-7c"
 tag: 4095
 Interface "qvod2cade14-7c"

Notably they all have a tag of 4095, which seems suspicious to me.  I 
don't know whether it's actually an issue though.


Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

The 'qvo' here shows it's part of the VETH pair that os-vif created when 
it plugged in the VM (the other half is 'qvb'), and they're created so 
that iptables rules can be applied by neutron.  It's part of the "old" 
way to do security groups with the OVSHybridIptablesFirewallDriver, and 
can eventually go away once the OVSFirewallDriver can be used everywhere 
(requires newer OVS and agent).


I wonder if you can run the ovs_cleanup utility to clean some of these up?


As in neutron-ovs-cleanup?  Doesn't that wipe out everything, including 
any ports that are still in use?  Or is there a different tool I'm not 
aware of that can do more targeted cleanup?


Oh, also worth noting that I don't think we have os-vif in this cloud 
because it's so old.  There's no os-vif package installed anyway.




-Brian

I've had some offline discussions about getting someone on this cloud 
to debug the problem.  Originally we decided not to pursue it since 
it's not hard to work around and we didn't want to disrupt the 
environment by trying to move to later OpenStack code (we're still 
back on Mitaka), but it was pointed out to me this time around that 
from a downstream perspective we have users on older code as well and 
it may be worth debugging to make sure they don't hit similar problems.


To that end, I've left one compute node un-rebooted for debugging 
purposes.  The downstream discussion is ongoing, but I'll update here 
if we find anything.




Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec  
wrote:

Hi,

It's that magical time again.  You know the one, when we reboot rh1 
to avoid

OVS port exhaustion. :-)

If all goes well you won't even notice that this is happening, but 
there is

the possibility that a few jobs will fail while the te-broker host is
rebooted so I wanted to let everyone know.  If you notice anything else
hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know. 
I have

been known to forget to restart services after the reboot.

I'll send a followup when I'm done.

-Ben

__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Brian Haley

On 12/19/2017 11:53 AM, Ben Nemec wrote:

The reboot is done (mostly...see below).

On 12/18/2017 05:11 PM, Joe Talerico wrote:

Ben - Can you provide some links to the ovs port exhaustion issue for
some background?


I don't know if we ever had a bug opened, but there's some discussion of 
it in 
http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html 
  I've also copied Derek since I believe he was the one who found it 
originally.


The gist is that after about 3 months of tripleo-ci running in this 
cloud we start to hit errors creating instances because of problems 
creating OVS ports on the compute nodes.  Sometimes we see a huge number 
of ports in general, other times we see a lot of ports that look like this:


Port "qvod2cade14-7c"
     tag: 4095
     Interface "qvod2cade14-7c"

Notably they all have a tag of 4095, which seems suspicious to me.  I 
don't know whether it's actually an issue though.


Tag 4095 is for "dead" OVS ports, it's an unused VLAN tag in the agent.

The 'qvo' here shows it's part of the VETH pair that os-vif created when 
it plugged in the VM (the other half is 'qvb'), and they're created so 
that iptables rules can be applied by neutron.  It's part of the "old" 
way to do security groups with the OVSHybridIptablesFirewallDriver, and 
can eventually go away once the OVSFirewallDriver can be used everywhere 
(requires newer OVS and agent).


I wonder if you can run the ovs_cleanup utility to clean some of these up?

-Brian

I've had some offline discussions about getting someone on this cloud to 
debug the problem.  Originally we decided not to pursue it since it's 
not hard to work around and we didn't want to disrupt the environment by 
trying to move to later OpenStack code (we're still back on Mitaka), but 
it was pointed out to me this time around that from a downstream 
perspective we have users on older code as well and it may be worth 
debugging to make sure they don't hit similar problems.


To that end, I've left one compute node un-rebooted for debugging 
purposes.  The downstream discussion is ongoing, but I'll update here if 
we find anything.




Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec  
wrote:

Hi,

It's that magical time again.  You know the one, when we reboot rh1 
to avoid

OVS port exhaustion. :-)

If all goes well you won't even notice that this is happening, but 
there is

the possibility that a few jobs will fail while the te-broker host is
rebooted so I wanted to let everyone know.  If you notice anything else
hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.  
I have

been known to forget to restart services after the reboot.

I'll send a followup when I'm done.

-Ben

__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-19 Thread Alex Schultz
On Tue, Dec 19, 2017 at 9:53 AM, Ben Nemec  wrote:
> The reboot is done (mostly...see below).
>
> On 12/18/2017 05:11 PM, Joe Talerico wrote:
>>
>> Ben - Can you provide some links to the ovs port exhaustion issue for
>> some background?
>
>
> I don't know if we ever had a bug opened, but there's some discussion of it
> in
> http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html
> I've also copied Derek since I believe he was the one who found it
> originally.
>
> The gist is that after about 3 months of tripleo-ci running in this cloud we
> start to hit errors creating instances because of problems creating OVS
> ports on the compute nodes.  Sometimes we see a huge number of ports in
> general, other times we see a lot of ports that look like this:
>
> Port "qvod2cade14-7c"
> tag: 4095
> Interface "qvod2cade14-7c"
>
> Notably they all have a tag of 4095, which seems suspicious to me.  I don't
> know whether it's actually an issue though.
>
> I've had some offline discussions about getting someone on this cloud to
> debug the problem.  Originally we decided not to pursue it since it's not
> hard to work around and we didn't want to disrupt the environment by trying
> to move to later OpenStack code (we're still back on Mitaka), but it was
> pointed out to me this time around that from a downstream perspective we
> have users on older code as well and it may be worth debugging to make sure
> they don't hit similar problems.
>
> To that end, I've left one compute node un-rebooted for debugging purposes.
> The downstream discussion is ongoing, but I'll update here if we find
> anything.
>

I just so happened to wander across the bug from last time,
https://bugs.launchpad.net/tripleo/+bug/1719334

>
>>
>> Thanks,
>> Joe
>>
>> On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec 
>> wrote:
>>>
>>> Hi,
>>>
>>> It's that magical time again.  You know the one, when we reboot rh1 to
>>> avoid
>>> OVS port exhaustion. :-)
>>>
>>> If all goes well you won't even notice that this is happening, but there
>>> is
>>> the possibility that a few jobs will fail while the te-broker host is
>>> rebooted so I wanted to let everyone know.  If you notice anything else
>>> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.  I
>>> have
>>> been known to forget to restart services after the reboot.
>>>
>>> I'll send a followup when I'm done.
>>>
>>> -Ben
>>>
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-18 Thread Joe Talerico
Ben - Can you provide some links to the ovs port exhaustion issue for
some background?

Thanks,
Joe

On Mon, Dec 18, 2017 at 10:43 AM, Ben Nemec  wrote:
> Hi,
>
> It's that magical time again.  You know the one, when we reboot rh1 to avoid
> OVS port exhaustion. :-)
>
> If all goes well you won't even notice that this is happening, but there is
> the possibility that a few jobs will fail while the te-broker host is
> rebooted so I wanted to let everyone know.  If you notice anything else
> hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.  I have
> been known to forget to restart services after the reboot.
>
> I'll send a followup when I'm done.
>
> -Ben
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Tis the season...for a cloud reboot

2017-12-18 Thread Ben Nemec

Hi,

It's that magical time again.  You know the one, when we reboot rh1 to 
avoid OVS port exhaustion. :-)


If all goes well you won't even notice that this is happening, but there 
is the possibility that a few jobs will fail while the te-broker host is 
rebooted so I wanted to let everyone know.  If you notice anything else 
hosted in rh1 is down (tripleo.org, zuul-status, etc.) let me know.  I 
have been known to forget to restart services after the reboot.


I'll send a followup when I'm done.

-Ben

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev