Public bug reported: Looking at the Neutron Failure Rate dashboard, specifically the tempest jobs:
http://grafana.openstack.org/dashboard/db/neutron-failure- rate?panelId=10&fullscreen One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu- xenial-nv job has a very high failure rate, over 90% for the past 5 days. Matt Riedemann did an analysis (which I'll paste below), but the summary is that setup of the 3-node job is failing a lot, and not discovering the third node, leading to a failure when that node is attempted to be used. So the first step is to change devstack-gate (?) code to wait for all the subnodes to show up from a Nova perspective before proceeding. There was a previous attempt at a grenade change in https://review.openstack.org/#/c/426310/ that was abandoned, but that seems like a good start based on the analysis. Matt's comment #1: Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu- xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far as their config. They use the same values for nova-cpu.conf pointing at the nova_cell1 MQ which points at the cell1 conductor and cell1 database. I see that the compute nodes are created for both subnode-2 and subnode-3 *after* discover_hosts runs: 2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-vm- gate.sh:main:L777: discover_hosts Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503 nova-compute[794]: INFO nova.compute.resource_tracker [None req- f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-xenial-3 -node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-b03a- 2cf4a2733a94 Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504 nova-compute[827]: INFO nova.compute.resource_tracker [None req- 95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-xenial-3 -node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1-909d- fc9cf1b14248 And looking at the discover_hosts output, only subnode-2 is discovered as the unmapped host: http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm- neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack- gate-discover-hosts.txt.gz The compute node from the primary host is discovered and mapped to cell1 as part of the devstack run on the primary host: http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm- neutron-dvr-ha-multinode-full-ubuntu-xenial- nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831 So it seems that we are simply getting lucky by discovering the compute node from subnode-2 and mapping it to cell1 but missing the compute node from subnode-3, so it doesn't get mapped and then things fail when Tempest tries to use it. This could be a problem on any 3 node job, and might not just be related to this devstack change. Matt's comment #2: I've gone through the dvr-ha 3-node job failure and it just appears to be a latent issue that we could also hit in 2-node jobs, and I even noticed in a 2-node job that when the subnode compute node is created it actually happens after we start running discover_hosts from the primary via devstack-gate, so it just seems to be a race window, which we already have, and maybe expose more in 3-node jobs if they are slower, or slow down the traffic on the control node. If you look at the cells v2 setup guide, it even says to make sure the computes are created before running discover_hosts: https://docs.openstack.org/nova/latest/user/cells.html "Configure and start your compute hosts. Before step 7, make sure you have compute hosts in the database by running nova service-list --binary nova-compute." Step 7 is running 'nova-manage cell_v2 discover_hosts'. Ideally what we should be doing is have devstack-gate pass a variable to the discover_hosts.sh script in devstack telling it how many compute services we expect (3 in the case of the dvr-ha job) and then have that discover_hosts.sh script run nova-compute service-list --binary nova- compute and count the results until the expected number is hit, or it times out, but then run discover_hosts. That's really what we expect from other deployment tools like triple-o and kolla. But overall I'm not finding anything in this change that's killing these jobs outright, so let's get it in. Matt's comment #3: This is what I see for voting jobs that fail with the 'host is not mapped to any cell' error: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d Those are all grenade multinode jobs. Likely https://review.openstack.org/#/c/426310/, or a variant thereof, would resolve it. ** Affects: neutron Importance: High Assignee: Brian Haley (brian-haley) Status: Confirmed ** Tags: l3-dvr-backlog -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1707003 Title: gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate Status in neutron: Confirmed Bug description: Looking at the Neutron Failure Rate dashboard, specifically the tempest jobs: http://grafana.openstack.org/dashboard/db/neutron-failure- rate?panelId=10&fullscreen One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full- ubuntu-xenial-nv job has a very high failure rate, over 90% for the past 5 days. Matt Riedemann did an analysis (which I'll paste below), but the summary is that setup of the 3-node job is failing a lot, and not discovering the third node, leading to a failure when that node is attempted to be used. So the first step is to change devstack-gate (?) code to wait for all the subnodes to show up from a Nova perspective before proceeding. There was a previous attempt at a grenade change in https://review.openstack.org/#/c/426310/ that was abandoned, but that seems like a good start based on the analysis. Matt's comment #1: Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu- xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far as their config. They use the same values for nova-cpu.conf pointing at the nova_cell1 MQ which points at the cell1 conductor and cell1 database. I see that the compute nodes are created for both subnode-2 and subnode-3 *after* discover_hosts runs: 2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack- vm-gate.sh:main:L777: discover_hosts Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503 nova-compute[794]: INFO nova.compute.resource_tracker [None req- f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu- xenial-3-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda- b03a-2cf4a2733a94 Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504 nova-compute[827]: INFO nova.compute.resource_tracker [None req- 95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu- xenial-3-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1 -909d-fc9cf1b14248 And looking at the discover_hosts output, only subnode-2 is discovered as the unmapped host: http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm- neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack- gate-discover-hosts.txt.gz The compute node from the primary host is discovered and mapped to cell1 as part of the devstack run on the primary host: http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm- neutron-dvr-ha-multinode-full-ubuntu-xenial- nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831 So it seems that we are simply getting lucky by discovering the compute node from subnode-2 and mapping it to cell1 but missing the compute node from subnode-3, so it doesn't get mapped and then things fail when Tempest tries to use it. This could be a problem on any 3 node job, and might not just be related to this devstack change. Matt's comment #2: I've gone through the dvr-ha 3-node job failure and it just appears to be a latent issue that we could also hit in 2-node jobs, and I even noticed in a 2-node job that when the subnode compute node is created it actually happens after we start running discover_hosts from the primary via devstack-gate, so it just seems to be a race window, which we already have, and maybe expose more in 3-node jobs if they are slower, or slow down the traffic on the control node. If you look at the cells v2 setup guide, it even says to make sure the computes are created before running discover_hosts: https://docs.openstack.org/nova/latest/user/cells.html "Configure and start your compute hosts. Before step 7, make sure you have compute hosts in the database by running nova service-list --binary nova-compute." Step 7 is running 'nova-manage cell_v2 discover_hosts'. Ideally what we should be doing is have devstack-gate pass a variable to the discover_hosts.sh script in devstack telling it how many compute services we expect (3 in the case of the dvr-ha job) and then have that discover_hosts.sh script run nova-compute service-list --binary nova-compute and count the results until the expected number is hit, or it times out, but then run discover_hosts. That's really what we expect from other deployment tools like triple-o and kolla. But overall I'm not finding anything in this change that's killing these jobs outright, so let's get it in. Matt's comment #3: This is what I see for voting jobs that fail with the 'host is not mapped to any cell' error: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d Those are all grenade multinode jobs. Likely https://review.openstack.org/#/c/426310/, or a variant thereof, would resolve it. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1707003/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp