[openstack-dev] [TripleO][OVN] Switching the default network backend to ML2/OVN

2018-10-24 Thread Daniel Alvarez Sanchez
Hi Stackers!

The purpose of this email is to share with the community the intention
of switching the default network backend in TripleO from ML2/OVS to
ML2/OVN by changing the mechanism driver from openvswitch to ovn. This
doesn’t mean that ML2/OVS will be dropped but users deploying
OpenStack without explicitly specifying a network driver will get
ML2/OVN by default.

OVN in Short
==

Open Virtual Network is managed under the OVS project, and was created
by the original authors of OVS. It is an attempt to re-do the ML2/OVS
control plane, using lessons learned throughout the years. It is
intended to be used in projects such as OpenStack and Kubernetes. OVN
has a different architecture, moving us away from Python agents
communicating with the Neutron API service via RabbitMQ to daemons
written in C communicating via OpenFlow and OVSDB.

OVN is built with a modern architecture that offers better foundations
for a simpler and more performant solution. What does this mean? For
example, at Red Hat we executed some preliminary testing during the
Queens cycle and found significant CPU savings due to OVN not using
RabbitMQ (CPU utilization during a Rally scenario using ML2/OVS [0] or
ML2/OVN [1]). Also, we tested API performance and found out that most
of the operations are significantly faster with ML2/OVN. Please see
more details in the FAQ section.

Here’s a few useful links about OpenStack’s integration of OVN:

* OpenStack Boston Summit talk on OVN [2]
* OpenStack networking-ovn documentation [3]
* OpenStack networking-ovn code repository [4]

How?


The goal is to merge this patch [5] during the Stein cycle which
pursues the following actions:

1. Switch the default mechanism driver from openvswitch to ovn.
2. Adapt all jobs so that they use ML2/OVN as the network backend.
3. Create legacy environment file for ML2/OVS to allow deployments based on it.
4. Flip scenario007 job from ML2/OVN to ML2/OVS so that we continue testing it.
5. Continue using ML2/OVS in the undercloud.
6. Ensure that updates/upgrades from ML2/OVS don’t break and don’t
switch automatically to the new default. As some parity gaps exist
right now, we don’t want to change the network backend automatically.
Instead, if the user wants to migrate from ML2/OVS to ML2/OVN, we’ll
provide an ansible based tool that will perform the operation.
More info and code at [6].

Reviews, comments and suggestions are really appreciated :)


FAQ
===

Can you talk about the advantages of OVN over ML2/OVS?
---

If asked to describe the ML2/OVS control plane (OVS, L3, DHCP and
metadata agents using the messaging bus to sync with the Neutron API
service) one would not tend to use the term ‘simple’. There is liberal
use of a smattering of Linux networking technologies such as:
* iptables
* network namespaces
* ARP manipulation
* Different forms of NAT
* keepalived, radvd, haproxy, dnsmasq
* Source based routing,
* … and of course OVS flows.

OVN simplifies this to a single process running on compute nodes, and
another process running on centralized nodes, communicating via OVSDB
and OpenFlow, ultimately setting OVS flows.

The simplified, new architecture allows us to re-do features like DVR
and L3 HA in more efficient and elegant ways. For example, L3 HA
failover is faster: It doesn’t use keepalived, rather OVN monitors
neighbor tunnel endpoints. OVN supports enabling both DVR and L3 HA
simultaneously, something we never supported with ML2/OVS.

We also found out that not depending on RPC messages for agents
communication brings a lot of benefits. From our experience, RabbitMQ
sometimes represents a bottleneck and it can be very intense when it
comes to resources utilization.


What about the undercloud?
--

ML2/OVS will be still used in the undercloud as OVN has some
limitations with regards to baremetal provisioning mainly (keep
reading about the parity gaps). We aim to convert the undercloud to
ML2/OVN to provide the operator a more consistent experience as soon
as possible.

It would be possible however to use the Neutron DHCP agent in the
short term to solve this limitation but in the long term we intend to
implement support for baremetal provisioning in the OVN built-in DHCP
server.


What about CI?
-

* networking-ovn has:
* Devstack based Tempest (API, scenario from Tempest and Neutron
Tempest plugin) against the latest released OVS version, and against
OVS master (thus also OVN master)
* Devstack based Rally
* Grenade
* A multinode, container based TripleO job that installs and issues a
basic VM connectivity scenario test
* Supports Python 3 and 2
* TripleO has currently OVN enabled in one quickstart featureset (fs30).

Are there any known parity issues with ML2/OVS?
---

* OVN supports VLAN provider networks, but not VLAN tenant networks.
This 

Re: [openstack-dev] [tripleo] [quickstart] [networking-ovn] No more overcloud_prep-containers.sh script

2018-10-03 Thread Daniel Alvarez Sanchez
Hi Miguel,

This patch should fix it [0]. I ran into same issues and had to manually
patch and/or generate the OVN containers myself.
Try it out and let me know if the problem persists.
To confirm that this is the same issue try to check which images you got in
your local registry (ODL images may be present while OVN ones are not).

[0] https://review.openstack.org/#/c/604953/5

Cheers,
Daniel



On Wed, Oct 3, 2018 at 10:15 AM Miguel Angel Ajo Pelayo 
wrote:

> Hi folks
>
>   I was trying to deploy neutron with networking-ovn via
> tripleo-quickstart scripts on master, and this config file [1]. It doesn't
> work, overcloud deploy cries with:
>
> 1) trying to deploy ovn I end up with a 2018-10-02 17:48:12 | "2018-10-02
> 17:47:51,864 DEBUG: 26691 -- Error: image
> tripleomaster/centos-binary-ovn-controller:current-tripleo not found",
>
> it seems like the overcloud_prep-containers.sh is not there anymore (I
> guess overcloud deploy handles it automatically now? but it fails to
> generate the ovn containers for some reason)
>
> Also, if you look at [2] which are our ansible migration scripts to
> migrate ml2/ovs to ml2/networking-ovn, you will see that we make use of
> overcloud_prep-containers.sh , I guess that we will need to make sure [1]
> works and we will get [2] for free.
>
>
>
> [1]
> https://github.com/openstack/networking-ovn/blob/master/tripleo/ovn.yml
> [2]
> https://docs.openstack.org/networking-ovn/latest/install/migration.html
> --
> Miguel Ángel Ajo
> OSP / Networking DFG, OVN Squad Engineering
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] Stepping down from Neutron core team

2018-08-31 Thread Daniel Alvarez Sanchez
Thanks a lot Kuba for all your contributions!
You've been a great mentor to me since I joined OpenStack and I'm so happy
that I got to work with you. Great engineer and even better person!
All the best, my friend!

On Fri, Aug 31, 2018 at 10:25 AM Jakub Libosvar  wrote:

> Hi all,
>
> as you have might already heard, I'm no longer involved in Neutron
> development due to some changes. Therefore I'm officially stepping down
> from the core team because I can't provide same quality reviews as I
> tried to do before.
>
> I'd like to thank you all for the opportunity I was given in the Neutron
> team, thank you for all I have learned over the years professionally,
> technically and personally. Tomorrow it's gonna be exactly 5 years since
> I started hacking Neutron and I must say I really enjoyed working with
> all Neutrinos here and I had privilege to meet most of you in person and
> that has an extreme value for me. Keep on being a great community!
>
> Thank you again!
> Kuba
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] [OVN] Tempest API / Scenario tests and OVN metadata

2018-04-06 Thread Daniel Alvarez Sanchez
Hi,

Thanks Lucas for writing this down.

On Thu, Apr 5, 2018 at 11:35 AM, Lucas Alvares Gomes 
wrote:

> Hi,
>
> The tests below are failing in the tempest API / Scenario job that
> runs in the networking-ovn gate (non-voting):
>
> neutron_tempest_plugin.api.admin.test_quotas_negative.
> QuotasAdminNegativeTestJSON.test_create_port_when_quotas_is_full
> neutron_tempest_plugin.api.test_routers.RoutersIpV6Test.
> test_router_interface_status
> neutron_tempest_plugin.api.test_routers.RoutersTest.test_
> router_interface_status
> neutron_tempest_plugin.api.test_subnetpools.SubnetPoolsTest.test_create_
> subnet_from_pool_with_prefixlen
> neutron_tempest_plugin.api.test_subnetpools.SubnetPoolsTest.test_create_
> subnet_from_pool_with_quota
> neutron_tempest_plugin.api.test_subnetpools.SubnetPoolsTest.test_create_
> subnet_from_pool_with_subnet_cidr
>
> Digging a bit into it I noticed that with the exception of the two
> "test_router_interface_status" (ipv6 and ipv4) all other tests are
> failing because the way metadata works in networking-ovn.
>
> Taking the "test_create_port_when_quotas_is_full" as an example. The
> reason why it fails is because when the OVN metadata is enabled,
> networking-ovn will metadata port at the moment a network is created
> [0] and that will already fulfill the quota limit set by that test
> [1].
>
> That port will also allocate an IP from the subnet which will cause
> the rest of the tests to fail with a "No more IP addresses available
> on network ..." error.
>

With ML2/OVS we would run into the same Quota problem if DHCP would be
enabled for the created subnets. This means that if we modify the current
tests
to enable DHCP on them and we account this extra port it would be valid for
all networking-ovn as well. Does it sound good or we still want to isolate
quotas?

>
> This is not very trivial to fix because:
>
> 1. Tempest should be backend agnostic. So, adding a conditional in the
> tempest test to check whether OVN is being used or not doesn't sound
> correct.
>
> 2. Creating a port to be used by the metadata agent is a core part of
> the design implementation for the metadata functionality [2]
>
> So, I'm sending this email to try to figure out what would be the best
> approach to deal with this problem and start working towards having
> that job to be voting in our gate. Here are some ideas:
>
> 1. Simple disable the tests that are affected by the metadata approach.
>
> 2. Disable metadata for the tempest API / Scenario tests (here's a
> test patch doing it [3])
>

IMHO, we don't want to do this as metadata is likely to be enabled in all
the
clouds either using ML2/OVS or OVN so it's good to keep exercising
this part.


>
> 3. Same as 1. but also create similar tempest tests specific for OVN
> somewhere else (in the networking-ovn tree?!)
>

As we discussed on IRC I'm keen on doing this instead of getting bits in
tempest to do different things depending on the backend used. Unless
we want to enable DHCP on the subnets that these tests create :)


> What you think would be the best way to workaround this problem, any
> other ideas ?
>
> As for the "test_router_interface_status" tests that are failing
> independent of the metadata, there's a bug reporting the problem here
> [4]. So we should just fix it.
>
> [0] https://github.com/openstack/networking-ovn/blob/
> f3f5257fc465bbf44d589cc16e9ef7781f6b5b1d/networking_ovn/
> common/ovn_client.py#L1154
> [1] https://github.com/openstack/neutron-tempest-plugin/blob/
> 35bf37d1830328d72606f9c790b270d4fda2b854/neutron_tempest_
> plugin/api/admin/test_quotas_negative.py#L66
> [2] https://docs.openstack.org/networking-ovn/latest/
> contributor/design/metadata_api.html#overview-of-proposed-approach
> [3] https://review.openstack.org/#/c/558792/
> [4] https://bugs.launchpad.net/networking-ovn/+bug/1713835
>
> Cheers,
> Lucas
>

Thanks,
Daniel

>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] [neutron] Current containerized neutron agents introduce a significant regression in the dataplane

2018-02-14 Thread Daniel Alvarez Sanchez
On Wed, Feb 14, 2018 at 5:40 AM, Brian Haley  wrote:

> On 02/13/2018 05:08 PM, Armando M. wrote:
>
>>
>>
>> On 13 February 2018 at 14:02, Brent Eagles  beag...@redhat.com>> wrote:
>>
>> Hi,
>>
>> The neutron agents are implemented in such a way that key
>> functionality is implemented in terms of haproxy, dnsmasq,
>> keepalived and radvd configuration. The agents manage instances of
>> these services but, by design, the parent is the top-most (pid 1).
>>
>> On baremetal this has the advantage that, while control plane
>> changes cannot be made while the agents are not available, the
>> configuration at the time the agents were stopped will work (for
>> example, VMs that are restarted can request their IPs, etc). In
>> short, the dataplane is not affected by shutting down the agents.
>>
>> In the TripleO containerized version of these agents, the supporting
>> processes (haproxy, dnsmasq, etc.) are run within the agent's
>> container so when the container is stopped, the supporting processes
>> are also stopped. That is, the behavior with the current containers
>> is significantly different than on baremetal and stopping/restarting
>> containers effectively breaks the dataplane. At the moment this is
>> being considered a blocker and unless we can find a resolution, we
>> may need to recommend running the L3, DHCP and metadata agents on
>> baremetal.
>>
>
> I didn't think the neutron metadata agent was affected but just the
> ovn-metadata agent?  Or is there a problem with the UNIX domain sockets the
> haproxy instances use to connect to it when the container is restarted?


That's right. In ovn-metadata-agent we spawn haproxy inside the q-ovnmeta
namespace
and this is where we'll find a problem if the process goes away. As you
said, neutron
metadata agent is basically receiving the proxied requests from haproxies
residing
in either q-router or q-dhcp namespaces on its UNIX socket and sending them
to Nova.


>
>
> There's quite a bit to unpack here: are you suggesting that running these
>> services in HA configuration doesn't help either with the data plane being
>> gone after a stop/restart? Ultimately this boils down to where the state is
>> persisted, and while certain agents rely on namespaces and processes whose
>> ephemeral nature is hard to persist, enough could be done to allow for a
>> non-disruptive bumping of the afore mentioned services.
>>
>
> Armando - https://review.openstack.org/#/c/542858/ (if accepted) should
> help with dataplane downtime, as sharing the namespaces lets them persist,
> which eases what the agent has to configure on the restart of a container
> (think of what the l3-agent needs to create for 1000 routers).
>
> But it doesn't address dnsmasq being unavailable when the dhcp-agent
> container is restarted like it is today.  Maybe one way around that is to
> run 2+ agents per network, but that still leaves a regression from how it
> works today.  Even with l3-ha I'm not sure things are perfect, might
> wind-up with two masters sometimes.
>
> I've seen one suggestion of putting all these processes in their own
> container instead of the agent container so they continue to run, it just
> might be invasive to the neutron code.  Maybe there is another option?


I had some idea based on that one to reduce the impact on neutron code and
its dependency on
containers. Basically, we would be running dnsmasq, haproxy, keepalived,
radvd, etc
in separate containers (it makes sense as they have independent lifecycles)
and we would drive
those through the docker socket from neutron agents. In order to reduce
this dependency, I
thought of having some sort of 'rootwrap-daemon-docker' which takes the
commands and
checks if it has to spawn the process in a separate container (for example,
iptables wouldn't
be the case) and if so, it'll use the docker socket to do it.
We'll also have to monitor the PID files on those containers to respawn
them in case they
die.

IMHO, this is far from the containers philosophy since we're using host
networking,
privileged access, sharing namespaces, relying on 'sidecar' containers...
but I can't think of
a better way to do it.


>
> -Brian
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][ovn] networking-ovn core team update

2017-12-01 Thread Daniel Alvarez Sanchez
Thanks a lot guys!
It's a pleasure to work with you all :)

Cheers,
Daniel

On Fri, Dec 1, 2017 at 5:48 PM, Miguel Angel Ajo Pelayo  wrote:

> Welcome Daniel! :)
>
> On Fri, Dec 1, 2017 at 5:45 PM, Lucas Alvares Gomes  > wrote:
>
>> Hi all,
>>
>> I would like to welcome Daniel Alvarez to the networking-ovn core team!
>>
>> Daniel has been contributing with the project for a good time already
>> and helping *a lot* with reviews and code.
>>
>> Welcome onboard man!
>>
>> Cheers,
>> Lucas
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [OVN] Functional tests failures

2017-11-23 Thread Daniel Alvarez Sanchez
Hi folks,

We've seen failures in functional tests lately [0] since ovsdbapp was
bumped to 0.8.0. Not sure if it's related since according to logs, it
looks like the connection to OVSDB is lost and then it's not recovered.

We're running functional tests on OVS master so I sent this patch [1]
to test it out with OVS 2.8 branch and, even though it's a bit early to
confirm, it looks like it may solve it.

We've had some IRC discussion around merging [1] or setting up new
jobs but it's still unclear. I'm keen on switching to stable release for our
CI (actually our tempest job against master is nv while the one voting is
against 2.8 branch) and maybe set up a new rally job against master
to detect regressions and also serve as comparison between the two.

Thoughts?

Thanks,
Daniel


[0] https://bugs.launchpad.net/networking-ovn/+bug/1734090
[1] https://review.openstack.org/#/c/522574/
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][infra] Functional job failure rate at 100%

2017-08-09 Thread Daniel Alvarez Sanchez
Some more info added to Jakub's excellent report :)


New kernel Ubuntu-4.4.0-89.112HEADUbuntu-4.4.0-89.112master was
tagged 9 days ago (07/31/2017) [0].

>From a quick look, the only commit around this function is [1].

[0]
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?id=64de31ed97a03ec1b86fd4f76e445506dce55b02
[1]
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?id=2ad4caea651e1cc0fc86111ece9f9d74de825b78

On Wed, Aug 9, 2017 at 3:29 PM, Jakub Libosvar  wrote:

> Daniel Alvarez and I spent some time looking at it and the culprit was
> finally found.
>
> tl;dr
>
> We updated a kernel on machines to one containing bug when creating
> conntrack entries which makes functional tests stuck. More info at [4].
>
> For now, I sent a patch [5] to disable for now jobs that create
> conntrack entries manually, it needs update of commit message. Once it
> merges, we an enable back functional job to voting to avoid regressions.
>
> Is it possible to switch used image for jenkins machines to use back the
> older version? Any other ideas how to deal with the kernel bug?
>
> Thanks
> Jakub
>
> [5] https://review.openstack.org/#/c/492068/1
>
> On 07/08/2017 11:52, Jakub Libosvar wrote:
> > Hi all,
> >
> > as per grafana [1] the functional job is broken. Looking at logstash [2]
> > it started happening consistently since 2017-08-03 16:27. I didn't find
> > any particular patch in Neutron that could cause it.
> >
> > The culprit is that ovsdb starts misbehaving [3] and then we retry calls
> > indefinitely. We still use 2.5.2 openvswitch as we had before. I opened
> > a bug [4] and started investigation, I'll update my findings there.
> >
> > I think at this point there is no reason to run "recheck" on your
> patches.
> >
> > Thanks,
> > Jakub
> >
> > [1]
> > http://grafana.openstack.org/dashboard/db/neutron-failure-
> rate?panelId=7
> > [2] http://bit.ly/2vdKMwy
> > [3]
> > http://logs.openstack.org/14/488914/8/check/gate-neutron-
> dsvm-functional-ubuntu-xenial/75d7482/logs/openvswitch/ovsdb-server.txt.gz
> > [4] https://bugs.launchpad.net/neutron/+bug/1709032
> >
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [networking-ovn] metadata agent implementation

2017-05-05 Thread Daniel Alvarez Sanchez
Hi folks,

Now that it looks like the metadata proposal is more refined [0], I'd like
to get some feedback from you on the driver implementation.

The ovn-metadata-agent in networking-ovn will be responsible for
creating the namespaces, spawning haproxies and so on. But also,
it must implement most of the "old" neutron-metadata-agent functionality
which listens on a UNIX socket and receives requests from haproxy,
adds some headers and forwards them to Nova. This means that we can
import/reuse big part of neutron code.

I wonder what you guys think about depending on neutron tree for the
agent implementation despite we can benefit from a lot of code reuse.
On the other hand, if we want to get rid of this dependency, we could
probably write the agent "from scratch" in C (what about having C
code in the networking-ovn repo?) and, at the same time, it should
buy us a performance boost (probably not very noticeable since it'll
respond to requests from local VMs involving a few lookups and
processing simple HTTP requests; talking to nova would take most
of the time and this only happens at boot time).

I would probably aim for a Python implementation reusing/importing
code from neutron tree but I'm not sure how we want to deal with
changes in neutron codebase (we're actually importing code now).
Looking forward to reading your thoughts :)

Thanks,
Daniel

[0] https://review.openstack.org/#/c/452811/
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] - Team photo

2017-02-20 Thread Daniel Alvarez Sanchez
+1

On Mon, Feb 20, 2017 at 7:20 PM, Bhatia, Manjeet S <
manjeet.s.bha...@intel.com> wrote:

> +1
>
>
>
> *From:* Kevin Benton [mailto:ke...@benton.pub]
> *Sent:* Friday, February 17, 2017 3:08 PM
> *To:* openstack-dev@lists.openstack.org
> *Subject:* [openstack-dev] [neutron] - Team photo
>
>
>
> Hello!
>
>
>
> Is everyone free Thursday at 11:20AM (right before lunch break) for 10
> minutes for a group photo?
>
>
>
> Cheers,
> Kevin Benton
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] Some findings while profiling instances boot

2017-02-16 Thread Daniel Alvarez Sanchez
Awesome work, Kevin!

For the DHCP notification, in my profiling I got only 10% of the CPU time
[0] without taking the waiting times into account which it's probably what
you also measured.
Your patch seems like a neat and great optimization :)

Also, since "get_devices_details_list_and_failed_devices()" takes quite a
long time, does it make sense to trigger this request asynchronously (same
approach you took for OVO notifier) and continue executing the iteration?
This would not result in a huge improvement but, in the case I showed in
the diagram, both 'get_device_details' can be issued at the same time
instead of one after another and, probably, freeing the iteration for
further processing on the agent side. Thoughts on this?

Regarding, the time of SQL queries, it looks like the server spends a
significant amount of time building those and reducing that time will
result in a nice improvement. Mike's outstanding analysis looks promising
and maybe it's worth to discuss it.

[0] http://imgur.com/lDikZ0I



On Thu, Feb 16, 2017 at 8:23 AM, Kevin Benton <ke...@benton.pub> wrote:

> Thanks for the stats and the nice diagram. I did some profiling and I'm
> sure it's the RPC handler on the Neutron server-side behaving like garbage.
>
> There are several causes that I have a string of patches up to address
> that mainly stem from the fact that l2pop requires multiple port status
> updates to function correctly:
>
> * The DHCP notifier will trigger a notification to the DHCP agents on the
> network on a port status update. This wouldn't be too problematic on it's
> own, but it does several queries for networks and segments to determine
> which agents it should talk to. Patch to address it here:
> https://review.openstack.org/#/c/434677/
>
> * The OVO notifier will also generate a notification on any port data
> model change, including the status. This is ultimately the desired
> behavior, but until we eliminate the frivolous status flipping, it's going
> to incur a performance hit. Patch here to put it asynced into the
> background so it doesn't block the port update process:
> https://review.openstack.org/#/c/434678/
>
> * A wasteful DB query in the ML2 PortContext: https://review.op
> enstack.org/#/c/434679/
>
> * More unnecessary  queries for the status update case in the ML2
> PortContext: https://review.openstack.org/#/c/434680/
>
> * Bulking up the DB queries rather than retrieving port details one by
> one.
> https://review.openstack.org/#/c/434681/ https://review.open
> stack.org/#/c/434682/
>
> The top two accounted for more than 60% of the overhead in my profiling
> and they are pretty simple, so we may be able to get them into Ocata for RC
> depending on how other cores feel. If not, they should be good candidates
> for back-porting later. Some of the others start to get more invasive so we
> may be stuck.
>
> Cheers,
> Kevin Benton
>
> On Wed, Feb 15, 2017 at 12:25 PM, Jay Pipes <jaypi...@gmail.com> wrote:
>
>> On 02/15/2017 12:46 PM, Daniel Alvarez Sanchez wrote:
>>
>>> Hi there,
>>>
>>> We're trying to figure out why, sometimes, rpc_loop takes over 10
>>> seconds to process an iteration when booting instances. So we deployed
>>> devstack on a 8GB, 4vCPU VM and did some profiling on the following
>>> command:
>>>
>>> nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec --nic
>>> net-name=private --min-count 8 instance
>>>
>>
>> Hi Daniel, thanks for posting the information here. Quick request of you,
>> though... can you try re-running the test but doing 8 separate calls to
>> nova boot instead of using the --min-count 8 parameter? I'm curious to see
>> if you notice any difference in contention/performance.
>>
>> Best,
>> -jay
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [neutron] Some findings while profiling instances boot

2017-02-15 Thread Daniel Alvarez Sanchez
Hi there,

We're trying to figure out why, sometimes, rpc_loop takes over 10 seconds
to process an iteration when booting instances. So we deployed devstack on
a 8GB, 4vCPU VM and did some profiling on the following command:

nova boot --flavor m1.nano --image cirros-0.3.4-x86_64-uec --nic
net-name=private --min-count 8 instance

(network private has port_security_enabled set to False to avoid the
overhead of setting up sgs)

Logs showed that sometimes, the network-vif-plugged event was sent by the
server ~12 seconds after the vif was detected by ovsdb monitor. Usually,
first and second events come faster while the rest take longer. Further
analysis showed that rpc_loop iterations take several seconds to complete
so, if the vif is detected while iteration X is running, it won't be
processed until iteration X+1.

As an example, I've attached a simplified sequence diagram [0] to show what
happened in a particular iteration of my debug (I have full logs and pstat
files for this session for those interested). In this example, iteration 76
is going to process two ports while some of the previous spawned machines
are being managed by libvirt and so on... At the beginning of iteration 76,
a new vif is detected by ovsdb monitor but it won't be processed until 12
seconds later in iteration 77.

Profiling files show that aggregated CPU time for neutron workers is 97
seconds, while CPU time for ovs agent is 2.1 seconds. Most of its time is
spent waiting for RPC so it looks like there's apparently some room for
optimization and multiprocessing here.

According to dstat.log, CPU is at ~90% and there's ~1GB of free RAM. I
can't tell whether the hypervisor was swapping or not since I didn't have
access to it.

system total-cpu-usage --memory-usage- -net/total->
 time |usr sys idl wai hiq siq| used  buff  cach  free| recv  send>
05-02 14:22:50| 89  11   0   0   0   0|5553M0  1151M 1119M|1808B 1462B>
05-02 14:22:51| 90  10   0   0   0   0|5567M0  1151M 1106M|1588B  836B>
05-02 14:22:52| 89  11   1   0   0   0|5581M0  1151M 1092M|3233B 2346B>
05-02 14:22:53| 89  10   0   0   0   0|5598M0  1151M 1075M|2676B 2038B>
05-02 14:22:54| 90  10   0   0   0   0|5600M0  1151M 1073M|  20k   14k>
05-02 14:22:55| 90   9   0   0   0   0|5601M0  1151M 1072M|  22k   16k>

Also, while having a look at server profiling, around the 33% of the time
was spent building SQL queries [1]. Mike Bayer went through this and
suggested having a look at baked queries and also submitted a sketch of his
proposal [2].

I wanted to share these findings with you (probably most of you knew but
I'm quite new to OpenStack so It's been a really nice exercise for me to
better understand how things work) and gather your feedback about how
things can be improved. Also, I'll be happy to share the results and
discuss further if you think it's worth during the PTG next week.

Thanks a lot for reading and apologies for such a long email!

Cheers,
Daniel
IRC: dalvarez

[0] http://imgur.com/WQqaiYQ
[1] http://imgur.com/6KrfJUC
[2] https://review.openstack.org/430973
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] "Setup firewall filters only for required ports" bug

2017-01-19 Thread Daniel Alvarez Sanchez
On Wed, Jan 18, 2017 at 10:45 PM, Bernard Cafarelli 
wrote:

> Hi neutrinos,
>
> I would like your feedback on the mentioned changeset in title[1]
> (yes, added since Liberty).
>
> With this patch, we (should) skip ports with
> port_security_enabled=False or with an empty list of security groups
> when processing added ports [2]. But we found multiple problems here
>
> * Ports create with port_security_enabled=False
>
> This is the original bug that started this mail: if the FORWARD
> iptables chain has a REJECT default policy/last rule, the traffic is
> still blocked[3]. There is also a launchpad bug with similar details
> [4]
> The problem here: these ports must not be skipped, as we add specific
> firewall rules to allow all traffic. These iptables rules have the
> following comment:
> "/* Accept all packets when port security is disabled. */"
>
> With the current code, any port created with port security will not
> have these rules (and updates do not work).
> I initially sent a patch to process these ports again [5], but there
> is more (as detailed by some in the launchpad bug)
>
> * Ports with no security groups, current code
>
> There is a bug in the  current agent code [6]: even with no security
> groups, the check will return true as, the security_groups key exists
> in the port details (with value "[]").
> So the port will not be skipped
>
> * Ports with no security groups, updated code
>
> Next step was to update checks (security groups list not empy, port
> security True or None), and test again. The port this time was
> skipped, but this showed up in openvswitch-agent.log:
> 2017-01-18 16:19:56.780 7458 INFO
> neutron.agent.linux.iptables_firewall
> [req-c49ca24f-1df8-40d7-8c48-6aab842ba34a - - - - -] Attempted to
> update port filter which is not filtered
> c2c58f8f-3b76-4c00-b792-f1726b28d2fc
> 2017-01-18 16:19:56.853 7458 INFO
> neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
> [req-c49ca24f-1df8-40d7-8c48-6aab842ba34a - - - - -] Configuration for
> devices up [u'c2c58f8f-3b76-4c00-b792-f1726b28d2fc'] and devices down
> [] completed.
>
> Which is the kind of logs we saw in the first bug report. So as an
> additional test, I tried to update this port, adding a security group.
> New log entries:
> 2017-01-18 17:36:53.164 7458 INFO neutron.agent.securitygroups_rpc
> [req-c49ca24f-1df8-40d7-8c48-6aab842ba34a - - - - -] Refresh firewall
> rules
> 2017-01-18 17:36:55.873 7458 INFO
> neutron.agent.linux.iptables_firewall
> [req-c49ca24f-1df8-40d7-8c48-6aab842ba34a - - - - -] Attempted to
> update port filter which is not filtered
> 0f2eea88-0e6a-4ea9-819c-e26eb692cb25
> 2017-01-18 17:36:58.587 7458 INFO
> neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
> [req-c49ca24f-1df8-40d7-8c48-6aab842ba34a - - - - -] Configuration for
> devices up [u'0f2eea88-0e6a-4ea9-819c-e26eb692cb25'] and devices down
> [] completed.
>
> And the iptables configuration did not change to show the newly allowed
> ports.
>
> So with a fixed check, wend up back in the same buggy situation as the
> first one.
>
> * Feedback
>
> So which course of action should we take? After checking these 3 cases
> out, I am in favour of reverting this commit entirely, as in its
> current state it does not help for ports without security groups, and
> breaks ports with port security disabled.
>
>
After having gone through the code and debugged the situation I'm also in
favor of reverting the patch. We should explicitly setup a rule which allows
traffic for that tap device exactly as we do when the port_security_enabled
is switched from True to False. We can't relay on traffic to be implicitly
allowed.

Also, on the tests side, should we add more tests only using create
> calls (port_security tests mostly update an existing port)? How to
> make sure these iptables rules are correctly applied (the ping tests
> are not enough, especially if the host system does not reject packets
> by default)?


Tests are incomplete so we should add either functional or fullstack/tempest
tests that validate these cases (ports created with port_security_enabled
set
to False, ports created with no security groups, etc.). I can try to do
that.




> [1] https://review.openstack.org/#/c/210321/
> [2] https://github.com/openstack/neutron/blob/
> a66c27193573ce015c6c1234b0f2a1d86fb85a22/neutron/plugins/
> ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1640
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1406263
> [4] https://bugs.launchpad.net/neutron/+bug/1549443
> [5] https://review.openstack.org/#/c/421832/
> [6] https://github.com/openstack/neutron/blob/
> a66c27193573ce015c6c1234b0f2a1d86fb85a22/neutron/plugins/
> ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1521
>
> Thanks!
>
> --
> Bernard Cafarelli
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 

Re: [openstack-dev] [Neutron] Neutron team social event in Barcelona

2016-10-17 Thread Daniel Alvarez Sanchez
Hi,

+1 here

On Mon, Oct 17, 2016 at 10:01 AM, Korzeniewski, Artur <
artur.korzeniew...@intel.com> wrote:

> +1
>
>
>
> *From:* Oleg Bondarev [mailto:obonda...@mirantis.com]
> *Sent:* Monday, October 17, 2016 9:52 AM
> *To:* OpenStack Development Mailing List (not for usage questions) <
> openstack-dev@lists.openstack.org>
> *Subject:* Re: [openstack-dev] [Neutron] Neutron team social event in
> Barcelona
>
>
>
> +1
>
>
>
> On Mon, Oct 17, 2016 at 10:23 AM, Jakub Libosvar 
> wrote:
>
> +1
>
>
>
> On 14/10/2016 20:30, Miguel Lavalle wrote:
>
> Dear Neutrinos,
>
> I am organizing a social event for the team on Thursday 27th at 19:30.
> After doing some Google research, I am proposing Raco de la Vila, which
> is located in Poblenou: http://www.racodelavila.com/en/index.htm. The
> menu is here: http://www.racodelavila.com/en/carta-racodelavila.htm
>
> It is easy to get there by subway from the Summit venue:
> https://goo.gl/maps/HjaTEcBbDUR2. I made a reservation for 25 people
> under 'Neutron' or "Miguel Lavalle". Please confirm your attendance so
> we can get a final count.
>
> Here's some reviews:
> https://www.tripadvisor.com/Restaurant_Review-g187497-
> d1682057-Reviews-Raco_De_La_Vila-Barcelona_Catalonia.html
>
> Cheers
>
> Miguel
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev