Thanks for doing this. I dug into the test_volume_boot_pattern test to see what was going on.
On the first boot, Nova called Neutron to create a port at 23:29:44 and it took 441ms to return the port to Nova.[1] Nova then plugged the interface for that port into OVS a little over 6 seconds later at 23:29:50.[2] The Neutron agent attempted to process this on the iteration at 23:29:52 [3]; however, it didn't get the ofport populated from the OVSDB monitor... a bug![4] The Neutron agent did catch it on the next iteration two seconds later on a retry and notified the Neutron server at 23:29:54.[5] The Neutron server processed the port ACTIVE change in just under 80ms[6], but it did not dispatch the notification to Nova until 2 seconds later at 23:29:56 [7] due to the Nova notification batching mechanism[8]. Total time between port create and boot is about 12 seconds. 6 in Nova and 6 in Neutron. For the Neutron side, the bug fix should eliminate 2 seconds. We could probably make the Nova notifier batching mechanism a little more aggressive so it only batches up calls in a very short interval rather than making 2 second buckets at all times. The remaining 2 seconds is just the agent processing loop interval, which can be tuned with a config but it should be okay if that's the only bottleneck. For Nova, we need to improve that 6 seconds after it has created the Neutron port before it has plugged it into the vswitch. I can see it makes some other calls to Neutron in this time to list security groups and floating IPs. Maybe this can be done asynchronously because I don't think they should block the initial VM boot to pause that plugs in the VIF. Completely unrelated to the boot process, the entire tempest run spent ~412 seconds building and destroying Neutron resources in setup and teardown.[9] However, considering the number of tests executed, this seems reasonable so I'm not sure we need to work on optimizing that yet. Cheers, Kevin Benton 1. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_45_341 2. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-n-cpu.txt.gz#_2016-03-21_23_29_50_629 3. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_52_216 4. https://bugs.launchpad.net/neutron/+bug/1560464 5. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_54_738 6. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_54_813 7. http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_56_782 8. http://git.openstack.org/cgit/openstack/neutron/tree/neutron/notifiers/nova.py 9. egrep -R 'tearDown|setUp' tempest.txt.gz | grep 9696 | awk '{print $(NF)}' | ./fsum On Mon, Mar 21, 2016 at 5:09 PM, Clark Boylan <[email protected]> wrote: > On Mon, Mar 21, 2016, at 01:23 PM, Sean Dague wrote: > > On 03/21/2016 04:09 PM, Clark Boylan wrote: > > > On Mon, Mar 21, 2016, at 11:49 AM, Clark Boylan wrote: > > >> On Mon, Mar 21, 2016, at 11:08 AM, Clark Boylan wrote: > > >>> On Mon, Mar 21, 2016, at 09:32 AM, Armando M. wrote: > > >>>> Do you have an a better insight of job runtimes vs jobs in other > > >>>> projects? > > >>>> Most of the time in the job runtime is actually spent setting the > > >>>> infrastructure up, and I am not sure we can do anything about it, > unless > > >>>> we > > >>>> take this with Infra. > > >>> > > >>> I haven't done a comparison yet buts lets break down the runtime of a > > >>> recent successful neutron full run against neutron master [0]. > > >> > > >> And now for some comparative data from the gate-tempest-dsvm-full job > > >> [0]. This job also ran against a master change that merged and ran in > > >> the same cloud and region as the neutron job. > > >> > > > snip > > >> Generally each step of this job was quicker. There were big > differences > > >> in devstack and tempest run time though. Is devstack much slower to > > >> setup neutron when compared to nova net? For tempest it looks like we > > >> run ~1510 tests against neutron and only ~1269 against nova net. This > > >> may account for the large difference there. I also recall that we run > > >> ipv6 tempest tests against neutron deployments that were inefficient > and > > >> booted 2 qemu VMs per test (not sure if that is still the case but > > >> illustrates that the tests themselves may not be very quick in the > > >> neutron case). > > > > > > Looking at the tempest slowest tests output for each of these jobs > > > (neutron and nova net) some tests line up really well across jobs and > > > others do not. In order to get a better handle on the runtime for > > > individual tests I have pushed https://review.openstack.org/295487 > which > > > will run tempest serially reducing the competition for resources > between > > > tests. > > > > > > Hopefully the subunit logs generated by this change can provide more > > > insight into where we are losing time during the tempest test runs. > > The results are in, we have gate-tempest-dsvm-full [0] and > gate-tempest-dsvm-neutron-full [1] job results where tempest ran > serially to reduce resource contention and provide accurateish per test > timing data. Both of these jobs ran on the same cloud so should have > comparable performance from the underlying VMs. > > gate-tempest-dsvm-full > Time spent in job before tempest: 700 seconds > Time spent running tempest: 2428 > Tempest tests run: 1269 (113 skipped) > > gate-tempest-dsvm-neutron-full > Time spent in job before tempest: 789 seconds > Time spent running tempest: 4407 seconds > Tempest tests run: 1510 (76 skipped) > > All times above are wall time as recorded by Jenkins. > > We can also compare the 10 slowest tests in the non neutron job against > their runtimes in the neutron job. (note this isn't a list of the top 10 > slowest tests in the neutron job because that job runs extra tests). > > nova net job > > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern > 85.232 > > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern > 83.319 > > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance > 50.338 > > tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern > 43.494 > > tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario > 40.225 > > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance > 39.653 > > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete > 37.720 > > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete > 36.355 > > tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped > 27.375 > > tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks > 27.025 > > neutron job > > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern > 110.345 > > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern > 108.170 > > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance > 63.852 > > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance > 59.931 > > tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern > 57.835 > > tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario > 49.552 > > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete > 40.378 > > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete > 39.088 > > tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks > 35.645 > > tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped > 30.551 > > > Subunit logs aren't the full story here. Activity in addCleanup doesn't > > get added to the subunit time accounting for the test, which causes some > > interesting issues when waiting for resources to delete. I would be > > especially cautious of that on some of these. > > Based on this those numbers above may not tell the whole story but they > do seem to tell us that in comparable circumstances neutron is slower > than nova net. Now the sample size is tiny, but again it gives us > somewhere to start. What is boot from volume doing in the neutron case > that makes it so much slower? Why is shelving so much slower with > neutron? and so on. > > A few seconds here and a few seconds there adds up when these operations > are repeated a few hundred times. We can probably start to whittle the > job runtime down by shaving that extra time off. In any case I think > this is about as far as I can pull this thread, and will let the neutron > team take it from here. > > [0] > > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-full/8e64615/console.html > [1] > > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/console.html > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
