Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-16 Thread Nathani, Sreedhar (APS)
Hello Salvatore,

I agree with you on we need both items to improve the scaling and performance 
of neutron server.
I am not a developer so can't implement the changes myself. If somebody  is 
going to implement I am more than happy to do the tests.

Thanks & Regards,
Sreedhar Nathani


From: Salvatore Orlando [mailto:sorla...@nicira.com]
Sent: Monday, December 16, 2013 6:18 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Multiple RPC servers is something we should definitely look at.
I don't see a show-stopper reason for which this would not work, although I 
recall we found out a few caveats one should be aware of when doing multiple 
RPC servers when reviewing the patch for multiple API server (I wrote them in 
some other ML thread, I will dig them later). If you are thinking of 
implementing this support, you might want to sync up with Mark McClain who's 
working on splitting API and RPC servers.

While horizontal scaling is surely desirable, evidence we gathered from 
analysis like the one you did showed that probably we can make the interactions 
between the neutron server and the agents a lot more efficient and reliable. I 
reckon both items are needed and can be implemented independently.

Regards,
Salvatore


On 16 December 2013 12:42, Nathani, Sreedhar (APS) 
mailto:sreedhar.nath...@hp.com>> wrote:
Hello Salvatore,

Thanks for the updates.  All the changes which you talked is from the agent 
side.

>From my tests,  with multiple L2 agents running and sending/requesting 
>messages at the same time from the single neutron rpc server process is not 
>able to handle
All the load fast enough and causing the bottleneck.

With the Carl's patch (https://review.openstack.org/#/c/60082), we now support 
multiple neutron API process,
My question is why can't we support multiple neutron rpc server process as well?

Horizontal scaling with multiple neutron-server hosts would be one option, but 
having support of multiple neutron rpc servers process in in the same
System would be really helpful for the scaling of neutron server especially 
during concurrent instance deployments.

Thanks & Regards,
Sreedhar Nathani


From: Salvatore Orlando [mailto:sorla...@nicira.com<mailto:sorla...@nicira.com>]
Sent: Monday, December 16, 2013 4:55 PM

To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Sreedhar,

I am focusing only on the OVS agent at the moment.
Armando fixed a few issues recently with the DHCP agent; those issues were 
triggering a perennial resync; with his fixes I reckon DHCP agent response 
times should be better.

I reckon Maru is also working on architectural improvements for the DHCP agent 
(see thread on DHCP agent reliability).

Regards,
Salvatore

On 13 December 2013 20:26, Nathani, Sreedhar (APS) 
mailto:sreedhar.nath...@hp.com>> wrote:
Hello All,

Update with my testing.

I have installed one more VM as neutron-server host and configured under the 
Load Balancer.
Currently I have 2 VMs running neutron-server process (one is Controller and 
other is dedicated neutron-server VM)

With this configuration during the batch instance deployment with a batch size 
of 30 and sleep time of 20min,
180 instances could get an IP during the first boot. During 181-210 instance 
creation some instances could not get an IP.

This is much better than when running with single neutron server where only 120 
instances could get an IP during the first boot in Havana.

When the instances are getting created, parent neutron-server process spending 
close to 90% of the cpu time on both the servers,
While rest of the neutron-server process (APIs) are spending very low CPU 
utilization.

I think it's good idea to expand the current multiple neutron-server api 
process to support rpc messages as well.

Even with current setup (multiple neutron-server hosts), we still see rpc 
timeouts in DHCP, L2 agents
and dnsmasq process is getting restarted due to SIGKILL though.

Thanks & Regards,
Sreedhar Nathani

From: Nathani, Sreedhar (APS)
Sent: Friday, December 13, 2013 12:08 AM

To: OpenStack Development Mailing List (not for usage questions)
Subject: RE: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Salvatore,

Thanks for your feedback. Does the patch 
https://review.openstack.org/#/c/57420/ which you are working on bug 
https://bugs.launchpad.net/neutron/+bug/1253993
will help to correct the OVS agent loop slowdown issue?
Does this patch address the DHCP agent updating the host file once in a minute 
and finally sending SIGKILL to dnsmasq process?

I have tested with Marun's patch https://review.openstack.org/#/c/61168/ 
regarding 'Send DHCP notifications regar

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-16 Thread Salvatore Orlando
Multiple RPC servers is something we should definitely look at.
I don't see a show-stopper reason for which this would not work, although I
recall we found out a few caveats one should be aware of when doing
multiple RPC servers when reviewing the patch for multiple API server (I
wrote them in some other ML thread, I will dig them later). If you are
thinking of implementing this support, you might want to sync up with Mark
McClain who's working on splitting API and RPC servers.

While horizontal scaling is surely desirable, evidence we gathered from
analysis like the one you did showed that probably we can make the
interactions between the neutron server and the agents a lot more efficient
and reliable. I reckon both items are needed and can be implemented
independently.

Regards,
Salvatore



On 16 December 2013 12:42, Nathani, Sreedhar (APS)
wrote:

>  Hello Salvatore,
>
>
>
> Thanks for the updates.  All the changes which you talked is from the
> agent side.
>
>
>
> From my tests,  with multiple L2 agents running and sending/requesting
> messages at the same time from the single neutron rpc server process is not
> able to handle
>
> All the load fast enough and causing the bottleneck.
>
>
>
> With the Carl’s patch (https://review.openstack.org/#/c/60082), we now
> support multiple neutron API process,
>
> My question is why can’t we support multiple neutron rpc server process as
> well?
>
>
>
> Horizontal scaling with multiple neutron-server hosts would be one option,
> but having support of multiple neutron rpc servers process in in the same
>
> System would be really helpful for the scaling of neutron server
> especially during concurrent instance deployments.
>
>
>
> Thanks & Regards,
>
> Sreedhar Nathani
>
>
>
>
>
> *From:* Salvatore Orlando [mailto:sorla...@nicira.com]
> *Sent:* Monday, December 16, 2013 4:55 PM
>
> *To:* OpenStack Development Mailing List (not for usage questions)
> *Subject:* Re: [openstack-dev] Performance Regression in Neutron/Havana
> compared to Quantum/Grizzly
>
>
>
> Hello Sreedhar,
>
>
>
> I am focusing only on the OVS agent at the moment.
>
> Armando fixed a few issues recently with the DHCP agent; those issues were
> triggering a perennial resync; with his fixes I reckon DHCP agent response
> times should be better.
>
>
>
> I reckon Maru is also working on architectural improvements for the DHCP
> agent (see thread on DHCP agent reliability).
>
>
>
> Regards,
>
> Salvatore
>
>
>
> On 13 December 2013 20:26, Nathani, Sreedhar (APS) <
> sreedhar.nath...@hp.com> wrote:
>
> Hello All,
>
>
>
> Update with my testing.
>
>
>
> I have installed one more VM as neutron-server host and configured under
> the Load Balancer.
>
> Currently I have 2 VMs running neutron-server process (one is Controller
> and other is dedicated neutron-server VM)
>
>
>
> With this configuration during the batch instance deployment with a batch
> size of 30 and sleep time of 20min,
>
> 180 instances could get an IP during the first boot. During 181-210
> instance creation some instances could not get an IP.
>
>
>
> This is much better than when running with single neutron server where
> only 120 instances could get an IP during the first boot in Havana.
>
>
>
> When the instances are getting created, parent neutron-server process
> spending close to 90% of the cpu time on both the servers,
>
> While rest of the neutron-server process (APIs) are spending very low CPU
> utilization.
>
>
>
> I think it’s good idea to expand the current multiple neutron-server api
> process to support rpc messages as well.
>
>
>
> Even with current setup (multiple neutron-server hosts), we still see rpc
> timeouts in DHCP, L2 agents
>
> and dnsmasq process is getting restarted due to SIGKILL though.
>
>
>
> Thanks & Regards,
>
> Sreedhar Nathani
>
>
>
> *From:* Nathani, Sreedhar (APS)
> *Sent:* Friday, December 13, 2013 12:08 AM
>
>
> *To:* OpenStack Development Mailing List (not for usage questions)
>
> *Subject:* RE: [openstack-dev] Performance Regression in Neutron/Havana
> compared to Quantum/Grizzly
>
>
>
> Hello Salvatore,
>
>
>
> Thanks for your feedback. Does the patch
> https://review.openstack.org/#/c/57420/ which you are working on bug
> https://bugs.launchpad.net/neutron/+bug/1253993
>
> will help to correct the OVS agent loop slowdown issue?
>
> Does this patch address the DHCP agent updating the host file once in a
> minute and finally sending SIGKILL to dnsmasq process?
>
>
>
> I have tested with Marun’s patch 
> htt

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-16 Thread Nathani, Sreedhar (APS)
Hello Salvatore,

Thanks for the updates.  All the changes which you talked is from the agent 
side.

>From my tests,  with multiple L2 agents running and sending/requesting 
>messages at the same time from the single neutron rpc server process is not 
>able to handle
All the load fast enough and causing the bottleneck.

With the Carl's patch (https://review.openstack.org/#/c/60082), we now support 
multiple neutron API process,
My question is why can't we support multiple neutron rpc server process as well?

Horizontal scaling with multiple neutron-server hosts would be one option, but 
having support of multiple neutron rpc servers process in in the same
System would be really helpful for the scaling of neutron server especially 
during concurrent instance deployments.

Thanks & Regards,
Sreedhar Nathani


From: Salvatore Orlando [mailto:sorla...@nicira.com]
Sent: Monday, December 16, 2013 4:55 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Sreedhar,

I am focusing only on the OVS agent at the moment.
Armando fixed a few issues recently with the DHCP agent; those issues were 
triggering a perennial resync; with his fixes I reckon DHCP agent response 
times should be better.

I reckon Maru is also working on architectural improvements for the DHCP agent 
(see thread on DHCP agent reliability).

Regards,
Salvatore

On 13 December 2013 20:26, Nathani, Sreedhar (APS) 
mailto:sreedhar.nath...@hp.com>> wrote:
Hello All,

Update with my testing.

I have installed one more VM as neutron-server host and configured under the 
Load Balancer.
Currently I have 2 VMs running neutron-server process (one is Controller and 
other is dedicated neutron-server VM)

With this configuration during the batch instance deployment with a batch size 
of 30 and sleep time of 20min,
180 instances could get an IP during the first boot. During 181-210 instance 
creation some instances could not get an IP.

This is much better than when running with single neutron server where only 120 
instances could get an IP during the first boot in Havana.

When the instances are getting created, parent neutron-server process spending 
close to 90% of the cpu time on both the servers,
While rest of the neutron-server process (APIs) are spending very low CPU 
utilization.

I think it's good idea to expand the current multiple neutron-server api 
process to support rpc messages as well.

Even with current setup (multiple neutron-server hosts), we still see rpc 
timeouts in DHCP, L2 agents
and dnsmasq process is getting restarted due to SIGKILL though.

Thanks & Regards,
Sreedhar Nathani

From: Nathani, Sreedhar (APS)
Sent: Friday, December 13, 2013 12:08 AM

To: OpenStack Development Mailing List (not for usage questions)
Subject: RE: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Salvatore,

Thanks for your feedback. Does the patch 
https://review.openstack.org/#/c/57420/ which you are working on bug 
https://bugs.launchpad.net/neutron/+bug/1253993
will help to correct the OVS agent loop slowdown issue?
Does this patch address the DHCP agent updating the host file once in a minute 
and finally sending SIGKILL to dnsmasq process?

I have tested with Marun's patch https://review.openstack.org/#/c/61168/ 
regarding 'Send DHCP notifications regardless of agent status' but this patch
Also observed the same behavior.


Thanks & Regards,
Sreedhar Nathani

From: Salvatore Orlando [mailto:sorla...@nicira.com]
Sent: Thursday, December 12, 2013 6:21 PM

To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly


I believe your analysis is correct and inline with the findings reported in the 
bug concerning OVS agent loop slowdown.

The issue has become even more prominent with the ML2 plugin due to an 
increased number of notifications sent.

Another issue which makes delays on the DHCP agent worse is that instances send 
a discover message once a minute.

Salvatore
Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)" 
mailto:sreedhar.nath...@hp.com>> ha scritto:
Hello Peter,

Here are the tests I have done. Already have 240 instances active across all 
the 16 compute nodes. To make the tests and data collection easy,
I have done the tests on single compute node

First Test -
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   deploy 10 instances concurrently using nova boot command with 
num-instances option in single compute node
*   All the instances could get IP during the instance boot time.

-   Instances are created at  2013-12-10 13:41:01
-   From the compute host, DHCP requests are sent from 13:41:20 but those 
are not reaching the DHCP 

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-16 Thread Salvatore Orlando
Hello Sreedhar,

I am focusing only on the OVS agent at the moment.
Armando fixed a few issues recently with the DHCP agent; those issues were
triggering a perennial resync; with his fixes I reckon DHCP agent response
times should be better.

I reckon Maru is also working on architectural improvements for the DHCP
agent (see thread on DHCP agent reliability).

Regards,
Salvatore


On 13 December 2013 20:26, Nathani, Sreedhar (APS)
wrote:

>  Hello All,
>
>
>
> Update with my testing.
>
>
>
> I have installed one more VM as neutron-server host and configured under
> the Load Balancer.
>
> Currently I have 2 VMs running neutron-server process (one is Controller
> and other is dedicated neutron-server VM)
>
>
>
> With this configuration during the batch instance deployment with a batch
> size of 30 and sleep time of 20min,
>
> 180 instances could get an IP during the first boot. During 181-210
> instance creation some instances could not get an IP.
>
>
>
> This is much better than when running with single neutron server where
> only 120 instances could get an IP during the first boot in Havana.
>
>
>
> When the instances are getting created, parent neutron-server process
> spending close to 90% of the cpu time on both the servers,
>
> While rest of the neutron-server process (APIs) are spending very low CPU
> utilization.
>
>
>
> I think it’s good idea to expand the current multiple neutron-server api
> process to support rpc messages as well.
>
>
>
> Even with current setup (multiple neutron-server hosts), we still see rpc
> timeouts in DHCP, L2 agents
>
> and dnsmasq process is getting restarted due to SIGKILL though.
>
>
>
> Thanks & Regards,
>
> Sreedhar Nathani
>
>
>
> *From:* Nathani, Sreedhar (APS)
> *Sent:* Friday, December 13, 2013 12:08 AM
>
> *To:* OpenStack Development Mailing List (not for usage questions)
> *Subject:* RE: [openstack-dev] Performance Regression in Neutron/Havana
> compared to Quantum/Grizzly
>
>
>
> Hello Salvatore,
>
>
>
> Thanks for your feedback. Does the patch
> https://review.openstack.org/#/c/57420/ which you are working on bug
> https://bugs.launchpad.net/neutron/+bug/1253993
>
> will help to correct the OVS agent loop slowdown issue?
>
> Does this patch address the DHCP agent updating the host file once in a
> minute and finally sending SIGKILL to dnsmasq process?
>
>
>
> I have tested with Marun’s patch 
> https://review.openstack.org/#/c/61168/regarding ‘Send
> DHCP notifications regardless of agent status’ but this patch
>
> Also observed the same behavior.
>
>
>
>
>
> Thanks & Regards,
>
> Sreedhar Nathani
>
>
>
> *From:* Salvatore Orlando [mailto:sorla...@nicira.com]
>
> *Sent:* Thursday, December 12, 2013 6:21 PM
>
> *To:* OpenStack Development Mailing List (not for usage questions)
> *Subject:* Re: [openstack-dev] Performance Regression in Neutron/Havana
> compared to Quantum/Grizzly
>
>
>
> I believe your analysis is correct and inline with the findings reported
> in the bug concerning OVS agent loop slowdown.
>
> The issue has become even more prominent with the ML2 plugin due to an
> increased number of notifications sent.
>
> Another issue which makes delays on the DHCP agent worse is that instances
> send a discover message once a minute.
>
> Salvatore
>
> Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)" 
> ha scritto:
>
> Hello Peter,
>
> Here are the tests I have done. Already have 240 instances active across
> all the 16 compute nodes. To make the tests and data collection easy,
> I have done the tests on single compute node
>
> First Test -
> *   240 instances already active,  16 instances on the compute node
> where I am going to do the tests
> *   deploy 10 instances concurrently using nova boot command with
> num-instances option in single compute node
> *   All the instances could get IP during the instance boot time.
>
> -   Instances are created at  2013-12-10 13:41:01
> -   From the compute host, DHCP requests are sent from 13:41:20 but
> those are not reaching the DHCP server
> Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds)
> -   DHCP agent updated the host file from 13:41:06 till 13:42:54.
> Dnsmasq process got SIGHUP message every time the hosts file is updated
> -   In compute node tap devices are created between 13:41:08 and
> 13:41:18
> Security group rules are received between 13:41:45 and 13:42:56
> IP table rules were updated between 13:41:50 and 13:43:04
>
> Second Test -
> *   Deleted the newly created 10 instances.
&g

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-13 Thread Nathani, Sreedhar (APS)
Hello All,

Update with my testing.

I have installed one more VM as neutron-server host and configured under the 
Load Balancer.
Currently I have 2 VMs running neutron-server process (one is Controller and 
other is dedicated neutron-server VM)

With this configuration during the batch instance deployment with a batch size 
of 30 and sleep time of 20min,
180 instances could get an IP during the first boot. During 181-210 instance 
creation some instances could not get an IP.

This is much better than when running with single neutron server where only 120 
instances could get an IP during the first boot in Havana.

When the instances are getting created, parent neutron-server process spending 
close to 90% of the cpu time on both the servers,
While rest of the neutron-server process (APIs) are spending very low CPU 
utilization.

I think it's good idea to expand the current multiple neutron-server api 
process to support rpc messages as well.

Even with current setup (multiple neutron-server hosts), we still see rpc 
timeouts in DHCP, L2 agents
and dnsmasq process is getting restarted due to SIGKILL though.

Thanks & Regards,
Sreedhar Nathani

From: Nathani, Sreedhar (APS)
Sent: Friday, December 13, 2013 12:08 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: RE: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Salvatore,

Thanks for your feedback. Does the patch 
https://review.openstack.org/#/c/57420/ which you are working on bug 
https://bugs.launchpad.net/neutron/+bug/1253993
will help to correct the OVS agent loop slowdown issue?
Does this patch address the DHCP agent updating the host file once in a minute 
and finally sending SIGKILL to dnsmasq process?

I have tested with Marun's patch https://review.openstack.org/#/c/61168/ 
regarding 'Send DHCP notifications regardless of agent status' but this patch
Also observed the same behavior.


Thanks & Regards,
Sreedhar Nathani

From: Salvatore Orlando [mailto:sorla...@nicira.com]
Sent: Thursday, December 12, 2013 6:21 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly


I believe your analysis is correct and inline with the findings reported in the 
bug concerning OVS agent loop slowdown.

The issue has become even more prominent with the ML2 plugin due to an 
increased number of notifications sent.

Another issue which makes delays on the DHCP agent worse is that instances send 
a discover message once a minute.

Salvatore
Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)" 
mailto:sreedhar.nath...@hp.com>> ha scritto:
Hello Peter,

Here are the tests I have done. Already have 240 instances active across all 
the 16 compute nodes. To make the tests and data collection easy,
I have done the tests on single compute node

First Test -
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   deploy 10 instances concurrently using nova boot command with 
num-instances option in single compute node
*   All the instances could get IP during the instance boot time.

-   Instances are created at  2013-12-10 13:41:01
-   From the compute host, DHCP requests are sent from 13:41:20 but those 
are not reaching the DHCP server
Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds)
-   DHCP agent updated the host file from 13:41:06 till 13:42:54. Dnsmasq 
process got SIGHUP message every time the hosts file is updated
-   In compute node tap devices are created between 13:41:08 and 13:41:18
Security group rules are received between 13:41:45 and 13:42:56
IP table rules were updated between 13:41:50 and 13:43:04

Second Test -
*   Deleted the newly created 10 instances.
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   Deploy 30 instances concurrently using nova boot command with 
num-instances option in single compute node
*   None  of the instances could get the IP during the instance boot.


-   Instances are created at  2013-12-10 14:13:50

-   From the compute host, DHCP Requests are sent from  14:14:14 but those 
are not reaching the DHCP Server
(don't see any DHCP requests are reaching the DHCP server 
from the tcpdump on the network node)

-   Reply from the DHCP server only got at 14:22:10 ( A delay of 636 
seconds)

-   From the strace of the DHCP agent process, it first updated the hosts 
file at 14:14:05, after this there is a gap of close to 60 min for
Updating next instance address, it repeated till 7th 
instance which was updated at 14:19:50.  30th instance updated at 14:20:00

-   During the 30 instance creation, dnsmasq process got SIGHUP after the 
host file is update

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-12 Thread Nathani, Sreedhar (APS)
Hello Salvatore,

Thanks for your feedback. Does the patch 
https://review.openstack.org/#/c/57420/ which you are working on bug 
https://bugs.launchpad.net/neutron/+bug/1253993
will help to correct the OVS agent loop slowdown issue?
Does this patch address the DHCP agent updating the host file once in a minute 
and finally sending SIGKILL to dnsmasq process?

I have tested with Marun's patch https://review.openstack.org/#/c/61168/ 
regarding 'Send DHCP notifications regardless of agent status' but this patch
Also observed the same behavior.


Thanks & Regards,
Sreedhar Nathani

From: Salvatore Orlando [mailto:sorla...@nicira.com]
Sent: Thursday, December 12, 2013 6:21 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly


I believe your analysis is correct and inline with the findings reported in the 
bug concerning OVS agent loop slowdown.

The issue has become even more prominent with the ML2 plugin due to an 
increased number of notifications sent.

Another issue which makes delays on the DHCP agent worse is that instances send 
a discover message once a minute.

Salvatore
Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)" 
mailto:sreedhar.nath...@hp.com>> ha scritto:
Hello Peter,

Here are the tests I have done. Already have 240 instances active across all 
the 16 compute nodes. To make the tests and data collection easy,
I have done the tests on single compute node

First Test -
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   deploy 10 instances concurrently using nova boot command with 
num-instances option in single compute node
*   All the instances could get IP during the instance boot time.

-   Instances are created at  2013-12-10 13:41:01
-   From the compute host, DHCP requests are sent from 13:41:20 but those 
are not reaching the DHCP server
Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds)
-   DHCP agent updated the host file from 13:41:06 till 13:42:54. Dnsmasq 
process got SIGHUP message every time the hosts file is updated
-   In compute node tap devices are created between 13:41:08 and 13:41:18
Security group rules are received between 13:41:45 and 13:42:56
IP table rules were updated between 13:41:50 and 13:43:04

Second Test -
*   Deleted the newly created 10 instances.
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   Deploy 30 instances concurrently using nova boot command with 
num-instances option in single compute node
*   None  of the instances could get the IP during the instance boot.


-   Instances are created at  2013-12-10 14:13:50

-   From the compute host, DHCP Requests are sent from  14:14:14 but those 
are not reaching the DHCP Server
(don't see any DHCP requests are reaching the DHCP server 
from the tcpdump on the network node)

-   Reply from the DHCP server only got at 14:22:10 ( A delay of 636 
seconds)

-   From the strace of the DHCP agent process, it first updated the hosts 
file at 14:14:05, after this there is a gap of close to 60 min for
Updating next instance address, it repeated till 7th 
instance which was updated at 14:19:50.  30th instance updated at 14:20:00

-   During the 30 instance creation, dnsmasq process got SIGHUP after the 
host file is updated, but at 14:19:52 it got SIGKILL and new process
   created - 14:19:52.881088 +++ killed by 
SIGKILL +++

-   In the compute node, tap devices are created between 14:14:03 and 
14:14:38
From the strace of L2 agent log, can see security group related 
messages are received from 14:14:27 till 14:20:02
During this period in the L2 agent log see many rpc timeout messages 
like below
Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC 
method: "security_group_rules_for_devices" info: ""

Due to security group related messages received by this compute 
node with delay, it's taking very long time to update the iptable rules
(Can see it was updated till 14:20) which is causing the DHCP 
packets to be dropped at compute node itself without reaching to DHCP server


Here is my understanding based on the tests.
Instances are creating fast and so its TAP devices. But there is a considerable 
delay in updating the network port details in dnsmasq host file and sending
The security group related info to the compute nodes due to which compute nodes 
are not able to update the iptable rules fast enough which is causing
Instance not able to get the IP.

I have collected the tcpdump from controller node, compute nodes + strace of 
dhcp, dnsmasq, OVS L2 agents inc

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-12 Thread Salvatore Orlando
I believe your analysis is correct and inline with the findings reported in
the bug concerning OVS agent loop slowdown.

The issue has become even more prominent with the ML2 plugin due to an
increased number of notifications sent.

Another issue which makes delays on the DHCP agent worse is that instances
send a discover message once a minute.

Salvatore
Il 11/dic/2013 11:50 "Nathani, Sreedhar (APS)"  ha
scritto:

> Hello Peter,
>
> Here are the tests I have done. Already have 240 instances active across
> all the 16 compute nodes. To make the tests and data collection easy,
> I have done the tests on single compute node
>
> First Test -
> *   240 instances already active,  16 instances on the compute node
> where I am going to do the tests
> *   deploy 10 instances concurrently using nova boot command with
> num-instances option in single compute node
> *   All the instances could get IP during the instance boot time.
>
> -   Instances are created at  2013-12-10 13:41:01
> -   From the compute host, DHCP requests are sent from 13:41:20 but
> those are not reaching the DHCP server
> Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds)
> -   DHCP agent updated the host file from 13:41:06 till 13:42:54.
> Dnsmasq process got SIGHUP message every time the hosts file is updated
> -   In compute node tap devices are created between 13:41:08 and
> 13:41:18
> Security group rules are received between 13:41:45 and 13:42:56
> IP table rules were updated between 13:41:50 and 13:43:04
>
> Second Test -
> *   Deleted the newly created 10 instances.
> *   240 instances already active,  16 instances on the compute node
> where I am going to do the tests
> *   Deploy 30 instances concurrently using nova boot command with
> num-instances option in single compute node
> *   None  of the instances could get the IP during the instance boot.
>
>
> -   Instances are created at  2013-12-10 14:13:50
>
> -   From the compute host, DHCP Requests are sent from  14:14:14 but
> those are not reaching the DHCP Server
> (don't see any DHCP requests are reaching the DHCP
> server from the tcpdump on the network node)
>
> -   Reply from the DHCP server only got at 14:22:10 ( A delay of 636
> seconds)
>
> -   From the strace of the DHCP agent process, it first updated the
> hosts file at 14:14:05, after this there is a gap of close to 60 min for
> Updating next instance address, it repeated till 7th
> instance which was updated at 14:19:50.  30th instance updated at 14:20:00
>
> -   During the 30 instance creation, dnsmasq process got SIGHUP after
> the host file is updated, but at 14:19:52 it got SIGKILL and new process
>created - 14:19:52.881088 +++ killed by
> SIGKILL +++
>
> -   In the compute node, tap devices are created between 14:14:03 and
> 14:14:38
> From the strace of L2 agent log, can see security group related
> messages are received from 14:14:27 till 14:20:02
> During this period in the L2 agent log see many rpc timeout
> messages like below
> Timeout: Timeout while waiting on RPC response - topic:
> "q-plugin", RPC method: "security_group_rules_for_devices" info: ""
>
> Due to security group related messages received by this
> compute node with delay, it's taking very long time to update the iptable
> rules
> (Can see it was updated till 14:20) which is causing the
> DHCP packets to be dropped at compute node itself without reaching to DHCP
> server
>
>
> Here is my understanding based on the tests.
> Instances are creating fast and so its TAP devices. But there is a
> considerable delay in updating the network port details in dnsmasq host
> file and sending
> The security group related info to the compute nodes due to which compute
> nodes are not able to update the iptable rules fast enough which is causing
> Instance not able to get the IP.
>
> I have collected the tcpdump from controller node, compute nodes + strace
> of dhcp, dnsmasq, OVS L2 agents incase if you are interested to look at it
>
> Thanks & Regards,
> Sreedhar Nathani
>
>
> -Original Message-
> From: Peter Feiner [mailto:pe...@gridcentric.ca]
> Sent: Tuesday, December 10, 2013 10:32 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana
> compared to Quantum/Grizzly
>
> On Tue, Dec 10, 2013 at 7:48 AM, Nathani, Sreedhar (APS) <
> sreedhar.nath...@hp.com> wrote:
> > My setup has 17 L

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-11 Thread Nathani, Sreedhar (APS)
Hello Peter,

Here are the tests I have done. Already have 240 instances active across all 
the 16 compute nodes. To make the tests and data collection easy, 
I have done the tests on single compute node
 
First Test - 
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   deploy 10 instances concurrently using nova boot command with 
num-instances option in single compute node
*   All the instances could get IP during the instance boot time. 
 
-   Instances are created at  2013-12-10 13:41:01
-   From the compute host, DHCP requests are sent from 13:41:20 but those 
are not reaching the DHCP server
Reply from the DHCP server got at 13:43:08 (A delay of 108 seconds)
-   DHCP agent updated the host file from 13:41:06 till 13:42:54. Dnsmasq 
process got SIGHUP message every time the hosts file is updated
-   In compute node tap devices are created between 13:41:08 and 13:41:18
Security group rules are received between 13:41:45 and 13:42:56
IP table rules were updated between 13:41:50 and 13:43:04

Second Test - 
*   Deleted the newly created 10 instances.
*   240 instances already active,  16 instances on the compute node where I 
am going to do the tests
*   Deploy 30 instances concurrently using nova boot command with 
num-instances option in single compute node
*   None  of the instances could get the IP during the instance boot.
 
 
-   Instances are created at  2013-12-10 14:13:50
 
-   From the compute host, DHCP Requests are sent from  14:14:14 but those 
are not reaching the DHCP Server
(don't see any DHCP requests are reaching the DHCP server 
from the tcpdump on the network node)
 
-   Reply from the DHCP server only got at 14:22:10 ( A delay of 636 
seconds)
 
-   From the strace of the DHCP agent process, it first updated the hosts 
file at 14:14:05, after this there is a gap of close to 60 min for 
Updating next instance address, it repeated till 7th 
instance which was updated at 14:19:50.  30th instance updated at 14:20:00
 
-   During the 30 instance creation, dnsmasq process got SIGHUP after the 
host file is updated, but at 14:19:52 it got SIGKILL and new process
   created - 14:19:52.881088 +++ killed by 
SIGKILL +++
 
-   In the compute node, tap devices are created between 14:14:03 and 
14:14:38
From the strace of L2 agent log, can see security group related 
messages are received from 14:14:27 till 14:20:02
During this period in the L2 agent log see many rpc timeout messages 
like below
Timeout: Timeout while waiting on RPC response - topic: "q-plugin", RPC 
method: "security_group_rules_for_devices" info: ""

Due to security group related messages received by this compute 
node with delay, it's taking very long time to update the iptable rules
(Can see it was updated till 14:20) which is causing the DHCP 
packets to be dropped at compute node itself without reaching to DHCP server
 
 
Here is my understanding based on the tests. 
Instances are creating fast and so its TAP devices. But there is a considerable 
delay in updating the network port details in dnsmasq host file and sending
The security group related info to the compute nodes due to which compute nodes 
are not able to update the iptable rules fast enough which is causing
Instance not able to get the IP.

I have collected the tcpdump from controller node, compute nodes + strace of 
dhcp, dnsmasq, OVS L2 agents incase if you are interested to look at it

Thanks & Regards,
Sreedhar Nathani


-Original Message-
From: Peter Feiner [mailto:pe...@gridcentric.ca] 
Sent: Tuesday, December 10, 2013 10:32 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

On Tue, Dec 10, 2013 at 7:48 AM, Nathani, Sreedhar (APS) 
 wrote:
> My setup has 17 L2 agents (16 compute nodes, one Network node). 
> Setting the minimize_polling helped to reduce the CPU utilization by the L2 
> agents but it did not help in instances getting the IP during first boot.
>
> With the minimize_polling polling enabled less number of instances could get 
> IP than without the minimize_polling fix.
>
> Once the we reach certain number of ports(in my case 120 ports), 
> during subsequent concurrent instance deployment(30 instances), updating the 
> port details in the dnsmasq host is taking long time, which causing the delay 
> for instances getting IP address.

To figure out what the next problem is, I recommend that you determine 
precisely what "port details in the dnsmasq host [are] taking [a] long time" to 
update. Is the DHCPDISCOVER packet from the VM arriving before the dnsmasq 
proc

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-10 Thread Peter Feiner
On Tue, Dec 10, 2013 at 7:48 AM, Nathani, Sreedhar (APS)
 wrote:
> My setup has 17 L2 agents (16 compute nodes, one Network node). Setting the 
> minimize_polling helped to reduce the CPU
> utilization by the L2 agents but it did not help in instances getting the IP 
> during first boot.
>
> With the minimize_polling polling enabled less number of instances could get 
> IP than without the minimize_polling fix.
>
> Once the we reach certain number of ports(in my case 120 ports), during 
> subsequent concurrent instance deployment(30 instances),
> updating the port details in the dnsmasq host is taking long time, which 
> causing the delay for instances getting IP address.

To figure out what the next problem is, I recommend that you determine
precisely what "port details in the dnsmasq host [are] taking [a] long
time" to update. Is the DHCPDISCOVER packet from the VM arriving
before the dnsmasq process's hostsfile is updated and dnsmasq is
SIGHUP'd? Is the VM sending the DHCPDISCOVER request before its tap
device is wired to the dnsmasq process (i.e., determine the status of
the chain of bridges at the time the guest sends the DHCPDISCOVER
packet)? Perhaps the DHCPDISCOVER packet is being dropped because the
iptables rules for the VM's port haven't been instantiated when the
DHCPDISCOVER packet is sent. Or perhaps something else, such as the
replies being dropped. These are my only theories at the moment.

Anyhow, once you determine where the DHCP packets are being lost,
you'll have a much better idea of what needs to be fixed.

One suggestion I have to make your debugging less onerous is to
reconfigure your guest image's networking init script to retry DHCP
requests indefinitely. That way, you'll see the guests' DHCP traffic
when neutron eventually gets everything in order. On CirrOS, add the
following line to the eht0 stanza in /etc/network/interfaces to retry
DHCP requests 100 times every 3 seconds:

udhcpc_opts -t 100 -T 3

> When I deployed only 5 instances concurrently (already had 211 instances 
> active) instead of 30, all the instances are able to get the IP.
> But when I deployed 10 instances concurrently (already had 216 instances 
> active) instead of 30, none of the instances could able to get the IP

This is reminiscent of yet another problem I saw at scale. If you're
using the security group rule "VMs in this group can talk to everybody
else in this group", which is one of the defaults in devstack, you get
O(N^2) iptables rules for N VMs running on a particular host. When you
have more VMs running, the openvswitch agent, which is responsible for
instantiating the iptables and does so somewhat laboriously with
respect to the number of iptables rules, the opevnswitch agent could
take too long to configure ports before VMs' DHCP clients time out.
However, considering that you're seeing low CPU utilization by the
openvswitch agent, I don't think you're having this problem; since
you're distributing your VMs across numerous compute hosts, N is quite
small in your case. I only saw problems when N was > 100.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-10 Thread Nathani, Sreedhar (APS)
Hello Peter,

I have merged the code of following patches for minimize_polling setting  and 
enabled minimize_polling in all the L2 agents
https://review.openstack.org/45676
https://review.openstack.org/45677
https://review.openstack.org/45678
https://review.openstack.org/57475/

My setup has 17 L2 agents (16 compute nodes, one Network node). Setting the 
minimize_polling helped to reduce the CPU
utilization by the L2 agents but it did not help in instances getting the IP 
during first boot. 

With the minimize_polling polling enabled less number of instances could get IP 
than without the minimize_polling fix.

Once the we reach certain number of ports(in my case 120 ports), during 
subsequent concurrent instance deployment(30 instances),
updating the port details in the dnsmasq host is taking long time, which 
causing the delay for instances getting IP address. 

When I deployed only 5 instances concurrently (already had 211 instances 
active) instead of 30, all the instances are able to get the IP. 
But when I deployed 10 instances concurrently (already had 216 instances 
active) instead of 30, none of the instances could able to get the IP

Thanks & Regards,
Sreedhar Nathani

-Original Message-
From: Nathani, Sreedhar (APS) 
Sent: Friday, December 06, 2013 12:21 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: RE: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

Hello Peter,

Thanks for the info. I will do the tests with your code changes.

What surprises me is, When I did the tests in Grizzly,  up to 210 instances 
could get an IP during the first boot. 
Once we cross 210 active instances, during the next batch some instances could 
not get IP. As the number of active instances grows,  more number of instances 
could not get IP.
But once I restart the instances, those could get IP Address.. I did the tests 
close to 10 times, so this behavior was consistent all the times. 

But in Havana, Instances are not able to get the IP once we cross 80 instances. 
Moreover, we need to restart the dnsmasq process for instances to get IP during 
next reboot

Thanks & Regards,
Sreedhar Nathani


-Original Message-
From: Peter Feiner [mailto:pe...@gridcentric.ca]
Sent: Thursday, December 05, 2013 10:57 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

On Thu, Dec 5, 2013 at 8:23 AM, Nathani, Sreedhar (APS) 
 wrote:
> Hello Marun,
>
>
>
> Please find the details about my setup and tests which i have done so 
> far
>
>
>
> Setup
>
>   - One Physical Box with 16c, 256G memory. 2 VMs created on this Box
> - One for Controller and One for Network Node
>
>   - 16x compute nodes (each has 16c, 256G memory)
>
>   - All the systems are installed with Ubuntu Precise + Havana Bits 
> from Ubuntu Cloud Archive
>
>
>
> Steps to simulate the issue
>
>   1) Concurrently create 30 Instances (m1.small) using REST API with
> mincount=30
>
>   2) sleep for 20min and repeat the step (1)
>
>
>
>
>
> Issue 1
>
> In Havana, once we cross 150 instances (5 batches x 30) during 6th 
> batch some instances are going into ERROR state
>
> due to network port not able to create and some instances are getting 
> duplicate IP address
>
>
>
> Per Maru Newby this issue might related to this bug
>
> https://bugs.launchpad.net/bugs/1192381
>
>
>
> I have done the similar with Grizzly on the same environment 2 months 
> back, where I could able to deploy close to 240 instances without any 
> errors
>
> Initially on Grizzly also seen the same behavior but with these 
> tunings based on this bug
>
> https://bugs.launchpad.net/neutron/+bug/1160442, never had issues 
> (tested more than 10 times)
>
>sqlalchemy_pool_size = 60
>
>sqlalchemy_max_overflow = 120
>
>sqlalchemy_pool_timeout = 2
>
>agent_down_time = 60
>
>report_internval = 20
>
>
>
> In Havana, I have tuned the same tunables but I could never get past
> 150+ instances. Without the tunables I could not able to get past
>
> 100 instances. We are getting many timeout errors from the DHCP agent 
> and neutron clients
>
>
>
> NOTE: After tuning the agent_down_time to 60 and report_interval to 
> 20, we no longer getting these error messages
>
>2013-12-02 11:44:43.421 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.439 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.452 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-05 Thread Nathani, Sreedhar (APS)
Hello Peter,

Thanks for the info. I will do the tests with your code changes.

What surprises me is, When I did the tests in Grizzly,  up to 210 instances 
could get an IP during the first boot. 
Once we cross 210 active instances, during the next batch some instances could 
not get IP. As the number of active instances grows,  more number of instances 
could not get IP.
But once I restart the instances, those could get IP Address.. I did the tests 
close to 10 times, so this behavior was consistent all the times. 

But in Havana, Instances are not able to get the IP once we cross 80 instances. 
Moreover, we need to restart the dnsmasq process for instances to get IP during 
next reboot

Thanks & Regards,
Sreedhar Nathani


-Original Message-
From: Peter Feiner [mailto:pe...@gridcentric.ca] 
Sent: Thursday, December 05, 2013 10:57 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] Performance Regression in Neutron/Havana compared 
to Quantum/Grizzly

On Thu, Dec 5, 2013 at 8:23 AM, Nathani, Sreedhar (APS) 
 wrote:
> Hello Marun,
>
>
>
> Please find the details about my setup and tests which i have done so 
> far
>
>
>
> Setup
>
>   - One Physical Box with 16c, 256G memory. 2 VMs created on this Box 
> - One for Controller and One for Network Node
>
>   - 16x compute nodes (each has 16c, 256G memory)
>
>   - All the systems are installed with Ubuntu Precise + Havana Bits 
> from Ubuntu Cloud Archive
>
>
>
> Steps to simulate the issue
>
>   1) Concurrently create 30 Instances (m1.small) using REST API with
> mincount=30
>
>   2) sleep for 20min and repeat the step (1)
>
>
>
>
>
> Issue 1
>
> In Havana, once we cross 150 instances (5 batches x 30) during 6th 
> batch some instances are going into ERROR state
>
> due to network port not able to create and some instances are getting 
> duplicate IP address
>
>
>
> Per Maru Newby this issue might related to this bug
>
> https://bugs.launchpad.net/bugs/1192381
>
>
>
> I have done the similar with Grizzly on the same environment 2 months 
> back, where I could able to deploy close to 240 instances without any 
> errors
>
> Initially on Grizzly also seen the same behavior but with these 
> tunings based on this bug
>
> https://bugs.launchpad.net/neutron/+bug/1160442, never had issues 
> (tested more than 10 times)
>
>sqlalchemy_pool_size = 60
>
>sqlalchemy_max_overflow = 120
>
>sqlalchemy_pool_timeout = 2
>
>agent_down_time = 60
>
>report_internval = 20
>
>
>
> In Havana, I have tuned the same tunables but I could never get past 
> 150+ instances. Without the tunables I could not able to get past
>
> 100 instances. We are getting many timeout errors from the DHCP agent 
> and neutron clients
>
>
>
> NOTE: After tuning the agent_down_time to 60 and report_interval to 
> 20, we no longer getting these error messages
>
>2013-12-02 11:44:43.421 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.439 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.452 28201 WARNING 
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>
>
>
>
> In the compute node openvswitch agent logs, we see these errors 
> repeating continuously
>
>
>
> 2013-12-04 06:46:02.081 3546 TRACE
> neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout 
> while waiting on RPC response - topic: "q-plugin", RPC method:
> "security_group_rules_for_devices" info: ""
>
> and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads 
> waiting for msg_id
>
>
>
> DHCP agent has below errors
>
>
>
> 2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] 
> Unable to reload_allocations dhcp.
>
> 2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_dhcp_port" info: ""
>
>
>
> 2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] 
> Unable to sync network state.
>
> 2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_active_networks_info" info: ""
>
>
>
>
>
> In Havana, I have merged the code from this patch and set api_workers 
> to 8 (My Controller VM has 8cores/16Hyperthreads)
>
> https://review.openstack.org/#/c/37131/
>
>
>
> After this p

Re: [openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-05 Thread Peter Feiner
On Thu, Dec 5, 2013 at 8:23 AM, Nathani, Sreedhar (APS)
 wrote:
> Hello Marun,
>
>
>
> Please find the details about my setup and tests which i have done so far
>
>
>
> Setup
>
>   - One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One
> for Controller and One for Network Node
>
>   - 16x compute nodes (each has 16c, 256G memory)
>
>   - All the systems are installed with Ubuntu Precise + Havana Bits from
> Ubuntu Cloud Archive
>
>
>
> Steps to simulate the issue
>
>   1) Concurrently create 30 Instances (m1.small) using REST API with
> mincount=30
>
>   2) sleep for 20min and repeat the step (1)
>
>
>
>
>
> Issue 1
>
> In Havana, once we cross 150 instances (5 batches x 30) during 6th batch
> some instances are going into ERROR state
>
> due to network port not able to create and some instances are getting
> duplicate IP address
>
>
>
> Per Maru Newby this issue might related to this bug
>
> https://bugs.launchpad.net/bugs/1192381
>
>
>
> I have done the similar with Grizzly on the same environment 2 months back,
> where I could able to deploy close to 240 instances without any errors
>
> Initially on Grizzly also seen the same behavior but with these tunings
> based on this bug
>
> https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested
> more than 10 times)
>
>sqlalchemy_pool_size = 60
>
>sqlalchemy_max_overflow = 120
>
>sqlalchemy_pool_timeout = 2
>
>agent_down_time = 60
>
>report_internval = 20
>
>
>
> In Havana, I have tuned the same tunables but I could never get past 150+
> instances. Without the tunables I could not able to get past
>
> 100 instances. We are getting many timeout errors from the DHCP agent and
> neutron clients
>
>
>
> NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we
> no longer getting these error messages
>
>2013-12-02 11:44:43.421 28201 WARNING
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.439 28201 WARNING
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>2013-12-02 11:44:43.452 28201 WARNING
> neutron.scheduler.dhcp_agent_scheduler [-] No more DHCP agents
>
>
>
>
>
> In the compute node openvswitch agent logs, we see these errors repeating
> continuously
>
>
>
> 2013-12-04 06:46:02.081 3546 TRACE
> neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while
> waiting on RPC response - topic: "q-plugin", RPC method:
> "security_group_rules_for_devices" info: ""
>
> and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting
> for msg_id
>
>
>
> DHCP agent has below errors
>
>
>
> 2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to
> reload_allocations dhcp.
>
> 2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_dhcp_port" info: ""
>
>
>
> 2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] Unable to
> sync network state.
>
> 2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout:
> Timeout while waiting on RPC response - topic: "q-plugin", RPC method:
> "get_active_networks_info" info: ""
>
>
>
>
>
> In Havana, I have merged the code from this patch and set api_workers to 8
> (My Controller VM has 8cores/16Hyperthreads)
>
> https://review.openstack.org/#/c/37131/
>
>
>
> After this patch and starting 8 neutron-server worker threads, during the
> batch creation of 240 instances with 30 concurrent requests during each
> batch,
>
> 238 instances became active and 2 instances went into error. Interesting
> these 2 instances which went into error state are from the same compute
> node.
>
>
>
> Unlike earlier this time, the errors are due to 'Too Many Connections' to
> the MySQL database.
>
> 2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager
> [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041
> 9b073211dd5c4988993341cc955e200b] [instance:
> c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
>
> 2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager
> [req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041
> 9b073211dd5c4988993341cc955e200b] [instance:
> c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: OperationalError
> (OperationalError) (1040, 'Too many connections') None None
>
>
>
> Need to back port the patch 'https://review.openstack.org/#/c/37131/' to
> address the Neutron Scaling issues in Havana.
>
> Carl already back porting this patch into Havana
> https://review.openstack.org/#/c/60082/ which is good.
>
>
>
> Issue 2
>
> Grizzly :
>
> During the concurrent instance creation in Grizzly, once we cross 210
> instances, during subsequent 30 instance creation some of
>
> the instances could not get their IP address during the first boot with in
> first few min. Instance MAC and IP Address details
>
> were updated in the dnsmasq host file but with a delay. In

[openstack-dev] Performance Regression in Neutron/Havana compared to Quantum/Grizzly

2013-12-05 Thread Nathani, Sreedhar (APS)
Hello Marun,

Please find the details about my setup and tests which i have done so far

Setup
  - One Physical Box with 16c, 256G memory. 2 VMs created on this Box - One for 
Controller and One for Network Node
  - 16x compute nodes (each has 16c, 256G memory)
  - All the systems are installed with Ubuntu Precise + Havana Bits from Ubuntu 
Cloud Archive

Steps to simulate the issue
  1) Concurrently create 30 Instances (m1.small) using REST API with mincount=30
  2) sleep for 20min and repeat the step (1)


Issue 1
In Havana, once we cross 150 instances (5 batches x 30) during 6th batch some 
instances are going into ERROR state
due to network port not able to create and some instances are getting duplicate 
IP address

Per Maru Newby this issue might related to this bug
https://bugs.launchpad.net/bugs/1192381

I have done the similar with Grizzly on the same environment 2 months back, 
where I could able to deploy close to 240 instances without any errors
Initially on Grizzly also seen the same behavior but with these tunings based 
on this bug
https://bugs.launchpad.net/neutron/+bug/1160442, never had issues (tested more 
than 10 times)
   sqlalchemy_pool_size = 60
   sqlalchemy_max_overflow = 120
   sqlalchemy_pool_timeout = 2
   agent_down_time = 60
   report_internval = 20

In Havana, I have tuned the same tunables but I could never get past 150+ 
instances. Without the tunables I could not able to get past
100 instances. We are getting many timeout errors from the DHCP agent and 
neutron clients

NOTE: After tuning the agent_down_time to 60 and report_interval to 20, we no 
longer getting these error messages
   2013-12-02 11:44:43.421 28201 WARNING neutron.scheduler.dhcp_agent_scheduler 
[-] No more DHCP agents
   2013-12-02 11:44:43.439 28201 WARNING neutron.scheduler.dhcp_agent_scheduler 
[-] No more DHCP agents
   2013-12-02 11:44:43.452 28201 WARNING neutron.scheduler.dhcp_agent_scheduler 
[-] No more DHCP agents


In the compute node openvswitch agent logs, we see these errors repeating 
continuously

2013-12-04 06:46:02.081 3546 TRACE 
neutron.plugins.openvswitch.agent.ovs_neutron_agent Timeout: Timeout while 
waiting on RPC response - topic: "q-plugin", RPC method: 
"security_group_rules_for_devices" info: ""
and WARNING neutron.openstack.common.rpc.amqp [-] No calling threads waiting 
for msg_id

DHCP agent has below errors

2013-12-02 15:35:19.557 22125 ERROR neutron.agent.dhcp_agent [-] Unable to 
reload_allocations dhcp.
2013-12-02 15:35:19.557 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout 
while waiting on RPC response - topic: "q-plugin", RPC method: "get_dhcp_port" 
info: ""

2013-12-02 15:35:34.266 22125 ERROR neutron.agent.dhcp_agent [-] Unable to sync 
network state.
2013-12-02 15:35:34.266 22125 TRACE neutron.agent.dhcp_agent Timeout: Timeout 
while waiting on RPC response - topic: "q-plugin", RPC method: 
"get_active_networks_info" info: ""


In Havana, I have merged the code from this patch and set api_workers to 8 (My 
Controller VM has 8cores/16Hyperthreads)
https://review.openstack.org/#/c/37131/

After this patch and starting 8 neutron-server worker threads, during the batch 
creation of 240 instances with 30 concurrent requests during each batch,
238 instances became active and 2 instances went into error. Interesting these 
2 instances which went into error state are from the same compute node.

Unlike earlier this time, the errors are due to 'Too Many Connections' to the 
MySQL database.
2013-12-04 17:07:59.877 21286 AUDIT nova.compute.manager 
[req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 
9b073211dd5c4988993341cc955e200b] [instance: 
c14596fd-13d5-482b-85af-e87077d4ed9b] Terminating instance
2013-12-04 17:08:00.578 21286 ERROR nova.compute.manager 
[req-26d64693-d1ef-40f3-8350-659e34d5b1d7 c4d609870d4447c684858216da2f8041 
9b073211dd5c4988993341cc955e200b] [instance: 
c14596fd-13d5-482b-85af-e87077d4ed9b] Error: Remote error: OperationalError 
(OperationalError) (1040, 'Too many connections') None None

Need to back port the patch 'https://review.openstack.org/#/c/37131/' to 
address the Neutron Scaling issues in Havana.
Carl already back porting this patch into Havana  
https://review.openstack.org/#/c/60082/ which is good.

Issue 2
Grizzly :
During the concurrent instance creation in Grizzly, once we cross 210 
instances, during subsequent 30 instance creation some of
the instances could not get their IP address during the first boot with in 
first few min. Instance MAC and IP Address details
were updated in the dnsmasq host file but with a delay. Instances are able to 
get their IP address with a delay eventually.

If we reboot the instance using 'nova reboot' instance used to get IP Address.
* Amount of delay is depending on number of network ports and delay is in the 
range of 8seconds to 2min


Havana :
But in Havana only 81 instances could get the IP during the first boot. Port is 
getting created and IP ad