Public bug reported: ENV: Neutron stable/queens (12.0.1) CentOS 7 (3.10.0-514.26.2.el7.x86_64) Ceph v10.2.9 Jewel
How to reproduce: Concurrently create 256 VMs in a single network which has 2 dhcp agents. Exception: nova-compute side: 2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325 2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds neutron server log: A failed nova boot, the neutron port only have one log: 2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73 It will never get the `Provisioning complete` notification. Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP: 2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73 2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132 2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138 Code: For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot. For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds. [1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424 [2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428 Potential solution: Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works. But, such config is really not so good for the production envrionment. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1778616 Title: DHCP agent block network-vif-plugged cause Instance failed to build Status in neutron: New Bug description: ENV: Neutron stable/queens (12.0.1) CentOS 7 (3.10.0-514.26.2.el7.x86_64) Ceph v10.2.9 Jewel How to reproduce: Concurrently create 256 VMs in a single network which has 2 dhcp agents. Exception: nova-compute side: 2018-06-25 17:56:09.394 43886 DEBUG nova.compute.manager [req-22395cd8-2461-411c-9da4-7da1be23e480 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Preparing to wait for external event network-vif-plugged-6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 prepare_for_instance_event /usr/lib/python2.7/site-packages/nova/compute/manager.py:325 2018-06-25 18:01:10.670 43886 WARNING nova.virt.libvirt.driver [req-679f1393-777d-4b9f-9faa-410e9d7e79b2 78ae27a6ab794fe39e64f57310c15b0e 70f17debbc324d81bbf76aaa2e3c1bd0 - default default] [instance: c6c1d69a-5ee6-4097-8294-8adb02b49a12] Timeout waiting for [('network-vif-plugged', u'6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87')] for instance with vm_state building and task_state spawning.: Timeout: 300 seconds neutron server log: A failed nova boot, the neutron port only have one log: 2018-06-25 17:56:06.270 221045 DEBUG neutron.db.provisioning_blocks [req-4bddb839-e9d2-4188-a95a-9a67990c05c0 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object 6f6962b8-b1b0-48a4-99dd-ae19ec1a0f87 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73 It will never get the `Provisioning complete` notification. Furthermore, a successfully booted nova instance, for the provisioning_blocks, it also takes 40s for the DHCP: 2018-06-25 18:00:17.180 266883 DEBUG neutron.db.provisioning_blocks [req-37107d50-4777-4f56-b9a0-eaf9b69c17d9 fc06c70220e74dfd90ebb516ff0da51d 9ff72ac0624c48c390535d58e8f0b3a1 - default default] Transition to ACTIVE for port object bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 will not be triggered until provisioned by entity DHCP. add_provisioning_component /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:73 2018-06-25 18:00:57.165 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 completed by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:132 2018-06-25 18:00:57.167 266884 DEBUG neutron.db.provisioning_blocks [req-720231ac-5996-432d-b033-37c340532127 - - - - -] Provisioning complete for port bcf1767c-1b9d-48ca-a5dc-c74587fa35e5 triggered by entity DHCP. provisioning_complete /usr/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138 Code: For [1], seems a sync action with lock will block all port_update rpc. But for our test, there is no dhcp sync during the instance boot. For [2], this lock is the essential issue, all the port update are coming from same network, some of them will not get the lock for more than 300 seconds. [1] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L424 [2] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L428 Potential solution: Add more dhcp agents for the network, for our test settings, add dhcp_agents_per_network = 10, then it works. But, such config is really not so good for the production envrionment. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1778616/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp