Re: [openstack-dev] [neutron] CI jobs take pretty long, can we improve that?

Kevin Benton Tue, 22 Mar 2016 06:39:07 -0700

Thanks for doing this. I dug into the test_volume_boot_pattern test to see
what was going on.


On the first boot, Nova called Neutron to create a port at 23:29:44 and it
took 441ms to return the port to Nova.[1]
Nova then plugged the interface for that port into OVS a little over 6
seconds later at 23:29:50.[2]
The Neutron agent attempted to process this on the iteration at 23:29:52
[3]; however, it didn't get the ofport populated from the OVSDB monitor...
a bug![4] The Neutron agent did catch it on the next iteration two seconds
later on a retry and notified the Neutron server at 23:29:54.[5]
The Neutron server processed the port ACTIVE change in just under 80ms[6],
but it did not dispatch the notification to Nova until 2 seconds later at
23:29:56 [7] due to the Nova notification batching mechanism[8].

Total time between port create and boot is about 12 seconds. 6 in Nova and
6 in Neutron.

For the Neutron side, the bug fix should eliminate 2 seconds. We could
probably make the Nova notifier batching mechanism a little more aggressive
so it only batches up calls in a very short interval rather than making 2
second buckets at all times. The remaining 2 seconds is just the agent
processing loop interval, which can be tuned with a config but it should be
okay if that's the only bottleneck.

For Nova, we need to improve that 6 seconds after it has created the
Neutron port before it has plugged it into the vswitch. I can see it makes
some other calls to Neutron in this time to list security groups and
floating IPs. Maybe this can be done asynchronously because I don't think
they should block the initial VM boot to pause that plugs in the VIF.

Completely unrelated to the boot process, the entire tempest run spent ~412
seconds building and destroying Neutron resources in setup and teardown.[9]
However, considering the number of tests executed, this seems reasonable so
I'm not sure we need to work on optimizing that yet.


Cheers,
Kevin Benton


1.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_45_341
2.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-n-cpu.txt.gz#_2016-03-21_23_29_50_629
3.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_52_216
4. https://bugs.launchpad.net/neutron/+bug/1560464
5.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_54_738
6.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_54_813
7.
http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_56_782
8.
http://git.openstack.org/cgit/openstack/neutron/tree/neutron/notifiers/nova.py
9. egrep -R 'tearDown|setUp' tempest.txt.gz | grep 9696 | awk '{print
$(NF)}' | ./fsum

On Mon, Mar 21, 2016 at 5:09 PM, Clark Boylan <[email protected]> wrote:

> On Mon, Mar 21, 2016, at 01:23 PM, Sean Dague wrote:
> > On 03/21/2016 04:09 PM, Clark Boylan wrote:
> > > On Mon, Mar 21, 2016, at 11:49 AM, Clark Boylan wrote:
> > >> On Mon, Mar 21, 2016, at 11:08 AM, Clark Boylan wrote:
> > >>> On Mon, Mar 21, 2016, at 09:32 AM, Armando M. wrote:
> > >>>> Do you have an a better insight of job runtimes vs jobs in other
> > >>>> projects?
> > >>>> Most of the time in the job runtime is actually spent setting the
> > >>>> infrastructure up, and I am not sure we can do anything about it,
> unless
> > >>>> we
> > >>>> take this with Infra.
> > >>>
> > >>> I haven't done a comparison yet buts lets break down the runtime of a
> > >>> recent successful neutron full run against neutron master [0].
> > >>
> > >> And now for some comparative data from the gate-tempest-dsvm-full job
> > >> [0]. This job also ran against a master change that merged and ran in
> > >> the same cloud and region as the neutron job.
> > >>
> > > snip
> > >> Generally each step of this job was quicker. There were big
> differences
> > >> in devstack and tempest run time though. Is devstack much slower to
> > >> setup neutron when compared to nova net? For tempest it looks like we
> > >> run ~1510 tests against neutron and only ~1269 against nova net. This
> > >> may account for the large difference there. I also recall that we run
> > >> ipv6 tempest tests against neutron deployments that were inefficient
> and
> > >> booted 2 qemu VMs per test (not sure if that is still the case but
> > >> illustrates that the tests themselves may not be very quick in the
> > >> neutron case).
> > >
> > > Looking at the tempest slowest tests output for each of these jobs
> > > (neutron and nova net) some tests line up really well across jobs and
> > > others do not. In order to get a better handle on the runtime for
> > > individual tests I have pushed https://review.openstack.org/295487
> which
> > > will run tempest serially reducing the competition for resources
> between
> > > tests.
> > >
> > > Hopefully the subunit logs generated by this change can provide more
> > > insight into where we are losing time during the tempest test runs.
>
> The results are in, we have gate-tempest-dsvm-full [0] and
> gate-tempest-dsvm-neutron-full [1] job results where tempest ran
> serially to reduce resource contention and provide accurateish per test
> timing data. Both of these jobs ran on the same cloud so should have
> comparable performance from the underlying VMs.
>
> gate-tempest-dsvm-full
> Time spent in job before tempest: 700 seconds
> Time spent running tempest: 2428
> Tempest tests run: 1269 (113 skipped)
>
> gate-tempest-dsvm-neutron-full
> Time spent in job before tempest: 789 seconds
> Time spent running tempest: 4407 seconds
> Tempest tests run: 1510 (76 skipped)
>
> All times above are wall time as recorded by Jenkins.
>
> We can also compare the 10 slowest tests in the non neutron job against
> their runtimes in the neutron job. (note this isn't a list of the top 10
> slowest tests in the neutron job because that job runs extra tests).
>
> nova net job
>
> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
>                                   85.232
>
> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
>                                     83.319
>
> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
>                                  50.338
>
> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
>                                             43.494
>
> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
>                                     40.225
>
> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
>                                                39.653
>
> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
> 37.720
>
> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
> 36.355
>
> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
>                27.375
>
> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
>                 27.025
>
> neutron job
>
> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
>                                  110.345
>
> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
>                                    108.170
>
> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
>                                  63.852
>
> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
>                                                59.931
>
> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
>                                             57.835
>
> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
>                                     49.552
>
> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
> 40.378
>
> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
> 39.088
>
> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
>                 35.645
>
> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
>                30.551
>
> > Subunit logs aren't the full story here. Activity in addCleanup doesn't
> > get added to the subunit time accounting for the test, which causes some
> > interesting issues when waiting for resources to delete. I would be
> > especially cautious of that on some of these.
>
> Based on this those numbers above may not tell the whole story but they
> do seem to tell us that in comparable circumstances neutron is slower
> than nova net. Now the sample size is tiny, but again it gives us
> somewhere to start. What is boot from volume doing in the neutron case
> that makes it so much slower? Why is shelving so much slower with
> neutron? and so on.
>
> A few seconds here and a few seconds there adds up when these operations
> are repeated a few hundred times. We can probably start to whittle the
> job runtime down by shaving that extra time off. In any case I think
> this is about as far as I can pull this thread, and will let the neutron
> team take it from here.
>
> [0]
>
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-full/8e64615/console.html
> [1]
>
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/console.html
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [neutron] CI jobs take pretty long, can we improve that?

Reply via email to