[Yahoo-eng-team] [Bug 1996995] [NEW] VM's inaccessible after live migration on certain Arista VXLAN Flood and Learn fabrics

2022-11-18 Thread Aaron S
Public bug reported:

Description
===
This is not a Nova bug per se, but rather an issue with Arista and potentially 
other network fabrics.

I have observed a case where VMs are inaccessible by network traffic
after live migrating on certain fabrics, in this case, Arista VXlan,
despite the hypervisor sending out a number of garp packets following a
live migration.

This was observed on an Arista VXlan fabric - live migrating a VM
between hypervisors on two different switches. A live migration between
two hypervisors on the same switch is not affected.

In both cases, I can see garps on the wire triggered by a VM being live
migrated, these packets have been observed from other hypervisors and
even other VMs in the same VLAN on different hypervisors.

The VM is accessible after a period of time, at the point the switch arp
aging timer resets and the MAC is re-learnt on the correct switch.

This occurs on any VM - even a simple c1.m1 with no active workload,
backed by Ceph storage.

Steps to Reproduce
===

To try and prevent this from happening, I have tested the libvirt: Add
announce-self post live-migration workaround patch[0] - despite this,
the issue was still observed.

Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no
active or significant load on VM

Run:
`ping VM_IP | while read ping; do echo "$(date): $pong"; done`

Then:
`openstack server migrate --live TARGET_HOST VM_INSTANCE`

Expected result
===
VM live migrates and is accessible in a reasonable <10 timeframe

Actual result
=
VM live migrates successfully, ping fails until switch arp timer resets (in our 
environment, 60-180 seconds)

Despite efforts from us and our network team, we are unable to determine
why the VM is inaccessible, what has been noticed is that sending a
further number of announce_self commands to the qemu monitor, triggering
more garps, gets the VM into an accessible state in an acceptable time
of <5 seconds.

Environment
=
Arista EOS4.26M VXLan fabric
OpenStack Nova Train, Ussuri, Victoria (with and without patch
Ceph Nautlius

OpenStack provider networking, using VLANs

Patch/Workaround
=
I have a follow-up workaround patch which builds on the announce-self patch 
prepared which we have been running in our production deployment.

This patch adds two configurable options and the associated code:

`enable_qemu_monitor_announce_max_retries` - this will call
announce_self a futher n number of times, triggering more garp packets
to be sent.

`enable_qemu_monitor_announce_retry_interval` - this is the delay which
will be used between triggering the additional announce_self calls, as
configured in the option above.

My tests of nearly 5000 live migrations show that the optimal settings
in our environment are 3 additional calls to qemu_announce_self with 1
second delay - this gets out VMs accessible in 2 or 3 seconds in the
vast majority of cases, and 99% within 5 seconds after they stop
responding to ping (the point at which we determine they are
inaccessible).


I shall be submitting this patch for review by the Nova community

0:
https://opendev.org/openstack/nova/commit/9609ae0bab30675e184d1fc63aec849c1de020d0

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: live-migration

** Description changed:

  Description
  ===
  This is not a Nova bug per se, but rather an issue with Arista and 
potentially other network fabrics.
  
- 
- I have observed a case where VMs are inaccessible by network traffic after 
live migrating on certain fabrics, in this case, Arista VXlan, despite the 
hypervisor sending out a number of garp packets following a live migration.
+ I have observed a case where VMs are inaccessible by network traffic
+ after live migrating on certain fabrics, in this case, Arista VXlan,
+ despite the hypervisor sending out a number of garp packets following a
+ live migration.
  
  This was observed on an Arista VXlan fabric - live migrating a VM
  between hypervisors on two different switches. A live migration between
  two hypervisors on the same switch is not affected.
  
  In both cases, I can see garps on the wire triggered by a VM being live
  migrated, these packets have been observed from other hypervisors and
  even other VMs in the same VLAN on different hypervisors.
  
  The VM is accessible after a period of time, at the point the switch arp
  aging timer resets and the MAC is re-learnt on the correct switch.
  
  This occurs on any VM - even a simple c1.m1 with no active workload,
  backed by Ceph storage.
- 
  
  Steps to Reproduce
  ===
  
  To try and prevent this from happening, I have tested the libvirt: Add
  announce-self post live-migration workaround patch[0] - despite this,
  the issue was still observed.
  
  Create VM: c1.m1 or similar, Centos7 or Centos8 - Ceph storage, no
  active or significant load on VM
  
  Run:
  `ping VM_IP | while read ping; do echo "$(date): $pong";

[Yahoo-eng-team] [Bug 1939733] Fix included in openstack/neutron queens-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron queens-eol  release.

** Changed in: cloud-archive/queens
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939733

Title:
  [OSSA-2021-005] Arbitrary dnsmasq reconfiguration via extra_dhcp_opts
  (CVE-2021-40085)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Committed
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
  Fix Committed
Status in Ubuntu Cloud Archive xena series:
  New
Status in neutron:
  Fix Released
Status in OpenStack Security Advisory:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Hirsute:
  Won't Fix
Status in neutron source package in Impish:
  Fix Released

Bug description:
  Application doesnt check the input values for extra_dhcp_opts port
  parameter allowing user to use a newline character. The values from
  extra_dhcp_opts are used in rendering of opts file which is passed to
  dnsmasq as a dhcp-optsfile. Considering this, an attacker can inject
  any options to that file.

  The main direct impact in my opinion is that attacker can push
  arbitrary dhcp options to another instances connected to the same
  network. And due to we are able to modify our own port connected to
  external network, it is possible to push dhcp options to the instances
  of another tennants using the same external network.

  If we go further, there is an known buffer overflow vulnerability in
  dnsmasq
  
(https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=7d04e17444793a840f98a0283968b96502b112dc)
  which was not considered as a security issue due to attacker cannot
  control dhcp opts in most cases and therefore this vulnerability is
  still exists in most distributives (e.g Ubuntu 20.04.1). In our case
  dhcp opts is exactly what attacker can modify, so we can trigger
  buffer overflow there. I even managed to write an exploit which lead
  to a remote code execution using this buffer overflow vulnerability.

  Here the payload to crash dnsmasq as a proof of concept:
  ```
  PUT /v2.0/ports/9db67e0f-537c-494a-a655-c8a0c518d57e HTTP/1.1
  Host: openstack
  X-Auth-Token: TOKEN
  Content-Type: application/json
  Content-Length: 170

  {"port":{
  "extra_dhcp_opts":[{"opt_name":"zzz",
  
"opt_value":"xxx\n128,aa:bb\n120,aa.cc\n128,:"
  }]}}
  ```

  Tested on ocata, train and victoria versions.

  Vulnerability was found by Pavel Toporkov

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1939733/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939125] Fix included in openstack/neutron queens-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron queens-eol  release.

** Changed in: neutron/queens
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939125

Title:
  Incorect Auto schedule new network segments notification listner

Status in neutron:
  New
Status in neutron queens series:
  Fix Released
Status in neutron rocky series:
  Fix Released
Status in neutron stein series:
  Fix Released

Bug description:
  auto_schedule_new_network_segments() added in
  Ic9e64aa4ecdc3d56f00c26204ad931b810db7599 uses new payload
  notification listener in old stable branches of Neutron that still use
  old notify syntax.

  Following branches are affected: stable/stein, stable/rocky,
  stable/queens

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939125/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1837635] Fix included in openstack/neutron queens-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron queens-eol  release.

** Changed in: cloud-archive/queens
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1837635

Title:
  HA router state change from "standby" to "master" should be delayed

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  Currently, when a HA state change occurs, the agent execute a series
  of actions [1]: updates the metadata proxy, updates the prefix
  delegation, executed L3 extension "ha_state_change" methods, updates
  the radvd status and notifies this to the server.

  When, in a system with more than two routers (one in "active" mode and
  the others in "standby"), a switch-over is done, the "keepalived"
  process [2] in each "standby" server will set the virtual IP in the HA
  interface and advert it. In case that other router HA interface has
  the same priority (by default in Neutron, the HA instances of the same
  router ID will have the same priority, 50) but higher IP [3], the HA
  interface of this instance will have the VIPs and routes deleted and
  will become "standby" again. E.g.: [4]

  In some cases, we have detected that when the master controller is
  rebooted, the change from "standby" to "master" of the other two
  servers is detected, but the change from "master" to "standby" of the
  server with lower IP (as commented before) is not registered by the
  server, because the Neutron server is still not accessible (the master
  controller was rebooted). This status change, sometimes, is lost. This
  is the situation when both "standby" servers become "master" but the
  "master"-"standby" transition of one of them is lost.

  1) INITIAL STATUS
  (overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 
router
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | standby  |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | standby  |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | active   |
  
+--+--++---+--+

  2) CONTROLLER 1 REBOOTED
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | active   |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | active   |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | standby  |
  
+--+--++---+--+

  
  The aim of this bug is to make public this problem and propose a patch to 
delay the transition from "standby" to "master" to let keepalived, among all 
the instances running in the HA servers, to decide which one of them is the 
"master" server.

  
  [1] 
https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
  [2] https://www.keepalived.org/
  [3] This method is used by keepalived to define which router is predominant 
and must be master.
  [4] http://paste.openstack.org/show/754760/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1837635/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939733] Fix included in openstack/neutron rocky-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron rocky-eol  release.

** Changed in: cloud-archive/rocky
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939733

Title:
  [OSSA-2021-005] Arbitrary dnsmasq reconfiguration via extra_dhcp_opts
  (CVE-2021-40085)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
  Fix Committed
Status in Ubuntu Cloud Archive xena series:
  New
Status in neutron:
  Fix Released
Status in OpenStack Security Advisory:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Hirsute:
  Won't Fix
Status in neutron source package in Impish:
  Fix Released

Bug description:
  Application doesnt check the input values for extra_dhcp_opts port
  parameter allowing user to use a newline character. The values from
  extra_dhcp_opts are used in rendering of opts file which is passed to
  dnsmasq as a dhcp-optsfile. Considering this, an attacker can inject
  any options to that file.

  The main direct impact in my opinion is that attacker can push
  arbitrary dhcp options to another instances connected to the same
  network. And due to we are able to modify our own port connected to
  external network, it is possible to push dhcp options to the instances
  of another tennants using the same external network.

  If we go further, there is an known buffer overflow vulnerability in
  dnsmasq
  
(https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=7d04e17444793a840f98a0283968b96502b112dc)
  which was not considered as a security issue due to attacker cannot
  control dhcp opts in most cases and therefore this vulnerability is
  still exists in most distributives (e.g Ubuntu 20.04.1). In our case
  dhcp opts is exactly what attacker can modify, so we can trigger
  buffer overflow there. I even managed to write an exploit which lead
  to a remote code execution using this buffer overflow vulnerability.

  Here the payload to crash dnsmasq as a proof of concept:
  ```
  PUT /v2.0/ports/9db67e0f-537c-494a-a655-c8a0c518d57e HTTP/1.1
  Host: openstack
  X-Auth-Token: TOKEN
  Content-Type: application/json
  Content-Length: 170

  {"port":{
  "extra_dhcp_opts":[{"opt_name":"zzz",
  
"opt_value":"xxx\n128,aa:bb\n120,aa.cc\n128,:"
  }]}}
  ```

  Tested on ocata, train and victoria versions.

  Vulnerability was found by Pavel Toporkov

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1939733/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939733] Fix included in openstack/neutron stein-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron stein-eol  release.

** Changed in: cloud-archive/stein
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939733

Title:
  [OSSA-2021-005] Arbitrary dnsmasq reconfiguration via extra_dhcp_opts
  (CVE-2021-40085)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
  Fix Committed
Status in Ubuntu Cloud Archive xena series:
  New
Status in neutron:
  Fix Released
Status in OpenStack Security Advisory:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Hirsute:
  Won't Fix
Status in neutron source package in Impish:
  Fix Released

Bug description:
  Application doesnt check the input values for extra_dhcp_opts port
  parameter allowing user to use a newline character. The values from
  extra_dhcp_opts are used in rendering of opts file which is passed to
  dnsmasq as a dhcp-optsfile. Considering this, an attacker can inject
  any options to that file.

  The main direct impact in my opinion is that attacker can push
  arbitrary dhcp options to another instances connected to the same
  network. And due to we are able to modify our own port connected to
  external network, it is possible to push dhcp options to the instances
  of another tennants using the same external network.

  If we go further, there is an known buffer overflow vulnerability in
  dnsmasq
  
(https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=7d04e17444793a840f98a0283968b96502b112dc)
  which was not considered as a security issue due to attacker cannot
  control dhcp opts in most cases and therefore this vulnerability is
  still exists in most distributives (e.g Ubuntu 20.04.1). In our case
  dhcp opts is exactly what attacker can modify, so we can trigger
  buffer overflow there. I even managed to write an exploit which lead
  to a remote code execution using this buffer overflow vulnerability.

  Here the payload to crash dnsmasq as a proof of concept:
  ```
  PUT /v2.0/ports/9db67e0f-537c-494a-a655-c8a0c518d57e HTTP/1.1
  Host: openstack
  X-Auth-Token: TOKEN
  Content-Type: application/json
  Content-Length: 170

  {"port":{
  "extra_dhcp_opts":[{"opt_name":"zzz",
  
"opt_value":"xxx\n128,aa:bb\n120,aa.cc\n128,:"
  }]}}
  ```

  Tested on ocata, train and victoria versions.

  Vulnerability was found by Pavel Toporkov

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1939733/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939125] Fix included in openstack/neutron rocky-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron rocky-eol  release.

** Changed in: neutron/rocky
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939125

Title:
  Incorect Auto schedule new network segments notification listner

Status in neutron:
  New
Status in neutron queens series:
  Fix Released
Status in neutron rocky series:
  Fix Released
Status in neutron stein series:
  Fix Released

Bug description:
  auto_schedule_new_network_segments() added in
  Ic9e64aa4ecdc3d56f00c26204ad931b810db7599 uses new payload
  notification listener in old stable branches of Neutron that still use
  old notify syntax.

  Following branches are affected: stable/stein, stable/rocky,
  stable/queens

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939125/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939125] Fix included in openstack/neutron stein-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron stein-eol  release.

** Changed in: neutron/stein
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939125

Title:
  Incorect Auto schedule new network segments notification listner

Status in neutron:
  New
Status in neutron queens series:
  Fix Released
Status in neutron rocky series:
  Fix Released
Status in neutron stein series:
  Fix Released

Bug description:
  auto_schedule_new_network_segments() added in
  Ic9e64aa4ecdc3d56f00c26204ad931b810db7599 uses new payload
  notification listener in old stable branches of Neutron that still use
  old notify syntax.

  Following branches are affected: stable/stein, stable/rocky,
  stable/queens

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939125/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1837635] Fix included in openstack/neutron rocky-eol

2022-11-18 Thread OpenStack Infra
This issue was fixed in the openstack/neutron rocky-eol  release.

** Changed in: cloud-archive/rocky
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1837635

Title:
  HA router state change from "standby" to "master" should be delayed

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  Currently, when a HA state change occurs, the agent execute a series
  of actions [1]: updates the metadata proxy, updates the prefix
  delegation, executed L3 extension "ha_state_change" methods, updates
  the radvd status and notifies this to the server.

  When, in a system with more than two routers (one in "active" mode and
  the others in "standby"), a switch-over is done, the "keepalived"
  process [2] in each "standby" server will set the virtual IP in the HA
  interface and advert it. In case that other router HA interface has
  the same priority (by default in Neutron, the HA instances of the same
  router ID will have the same priority, 50) but higher IP [3], the HA
  interface of this instance will have the VIPs and routes deleted and
  will become "standby" again. E.g.: [4]

  In some cases, we have detected that when the master controller is
  rebooted, the change from "standby" to "master" of the other two
  servers is detected, but the change from "master" to "standby" of the
  server with lower IP (as commented before) is not registered by the
  server, because the Neutron server is still not accessible (the master
  controller was rebooted). This status change, sometimes, is lost. This
  is the situation when both "standby" servers become "master" but the
  "master"-"standby" transition of one of them is lost.

  1) INITIAL STATUS
  (overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 
router
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | standby  |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | standby  |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | active   |
  
+--+--++---+--+

  2) CONTROLLER 1 REBOOTED
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | active   |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | active   |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | standby  |
  
+--+--++---+--+

  
  The aim of this bug is to make public this problem and propose a patch to 
delay the transition from "standby" to "master" to let keepalived, among all 
the instances running in the HA servers, to decide which one of them is the 
"master" server.

  
  [1] 
https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
  [2] https://www.keepalived.org/
  [3] This method is used by keepalived to define which router is predominant 
and must be master.
  [4] http://paste.openstack.org/show/754760/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1837635/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1997025] [NEW] [CI] "test_live_migration_with_trunk" failing

2022-11-18 Thread Rodolfo Alonso
Public bug reported:

Several occurrences during the last day:
* 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_578/864215/2/check/neutron-ovs-tempest-multinode-full/5780784/testr_results.html
* 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a0/864051/6/check/neutron-ovs-tempest-multinode-full/4a0fc4f/testr_results.html
* 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d4e/841838/69/check/neutron-ovs-tempest-multinode-full/d4e2c13/testr_results.html

Zuul build list:
https://zuul.opendev.org/t/openstack/builds?job_name=neutron-ovs-
tempest-multinode-full&skip=0

** Affects: neutron
 Importance: Critical
 Status: New

** Changed in: neutron
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1997025

Title:
  [CI] "test_live_migration_with_trunk" failing

Status in neutron:
  New

Bug description:
  Several occurrences during the last day:
  * 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_578/864215/2/check/neutron-ovs-tempest-multinode-full/5780784/testr_results.html
  * 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a0/864051/6/check/neutron-ovs-tempest-multinode-full/4a0fc4f/testr_results.html
  * 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d4e/841838/69/check/neutron-ovs-tempest-multinode-full/d4e2c13/testr_results.html

  Zuul build list:
  https://zuul.opendev.org/t/openstack/builds?job_name=neutron-ovs-
  tempest-multinode-full&skip=0

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1997025/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1996788] Re: The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.

2022-11-18 Thread Bernard Cafarelli
** Tags added: ovs

** Changed in: neutron
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1996788

Title:
  The virtual network is broken on the node after neutron-openvswitch-
  agent is restarted if RPC requests return an error for a while.

Status in neutron:
  Opinion

Bug description:
  We ran into a problem in our openstack cluster, when traffic does not go 
through the virtual network on the node on which the neutron-openvswitch-agent 
was restarted.
  We had an update from one version of the Openstack to another and by chance 
we had a inconsistency of the DB and neutron-server: any port select from the 
DB returned an error.
  For a while neutron-openvswitch-agent (just after restart) couldn't get any 
information via RCP in its rpc_loop iterations due to DB/neutron-server 
inconsistency.
  But after updating the database, we got a broken virtual network on the node 
where the neutron-openvswitch-agent was restarted.

  It seems to me that I have found a problem place in the logic of 
neutron-ovs-agent.
  To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent 
to neutron-server.

  Here are the steps to reproduce on devstack setup from the master branch.
  Two nodes: node0 is controller, node1 is compute.

  0) Prepare a vxlan based network and a VM.
  [root@node0 ~]# openstack network create net1
  [root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 
192.168.1.0/24
  [root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny 
--image cirros-0.5.2-x86_64-disk --host node1

  Just after creating the VM, there is a message in the devstack@q-agt
  logs:

  Nov 16 09:53:35 node1 neutron-openvswitch-agent[374810]: INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None
  req-77753b72-cb23-4dae-b68a-7048b63faf8b None None] Assigning 1 as
  local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466

  So, local vlan which is used on node1 for the network is `1`
  A ping from the node0 to the VM on node1 success works:

  [root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 
192.168.1.211
  PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
  64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
  64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms

  1) Now, please don't misunderstand me, I don't want to be read that I'm 
patching the code and then clearly something won't work,
  I just want to emulate a problem that's hard enough to reproduce in a normal 
way but it can.
  So, emulate a problem that method get_resource_by_id returns an error just 
after neutron-ovs-agent restart (RPC based method actually):

  [root@node1 neutron]# git diff
  diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
  index 9a133afb07..299eb25981 100644
  --- a/neutron/agent/rpc.py
  +++ b/neutron/agent/rpc.py
  @@ -327,6 +327,11 @@ class CacheBackedPluginApi(PluginApi):

   def get_device_details(self, context, device, agent_id, host=None,
  agent_restarted=False):
  +import time
  +if not hasattr(self, '_stime'):
  +self._stime = time.time()
  +if self._stime + 5 > time.time():
  +raise Exception('Emulate RPC error in get_resource_by_id call')
   port_obj = self.remote_resource_cache.get_resource_by_id(
   resources.PORT, device, agent_restarted)
   if not port_obj:

  
  Restart neutron-openvswitch-agent agent and try to ping after 1-2 mins:

  [root@node1 ~]# systemctl restart devstack@q-agt

  [root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 
-c 2 192.168.1.234
  PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.

  --- 192.168.1.234 ping statistics ---
  2 packets transmitted, 0 received, 100% packet loss, time 1058ms

  [root@node0 ~]#

  Ping doesn't work.
  Just after the neutron-ovs-agent restart and when the RPC starts working 
correctly, there are logs:

  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None 
req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Assigning 2 as local vlan 
for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.agent.securitygroups_rpc [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 
None None] Preparing filters for devices 
{'40d82f69-274f-4de5-84d9-6290159f288b'}
  Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.agent.linux.openvswitch_firewall.firewall [None 
req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Initializing port 
40d82f69-274f-4de5-84d9-6290159f288b that was already initialized.

  So, `Assigning 2 as local vlan` followed by `Initializing port ...
  that was already i

[Yahoo-eng-team] [Bug 1997090] [NEW] VMs listing with sort keys throws exception when trying to compare None values

2022-11-18 Thread Anton Kurbatov
Public bug reported:

The nova-api raises exception on attempt to get VMs sorted by i.e.
task_state key.

Here are steps-to-reproduce:

- create two VMs: vm1 in ACTIVE state (cell1) and vm2 in ERROR state (cell0)
- try to list servers sorted by sort_key=task_state

[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny 
--image cirros-0.5.2-x86_64-disk
[root@node0 ~]# openstack server create vm2 --network net1 --flavor m1.xlarge 
--image cirros-0.5.2-x86_64-disk
[root@node0 ~]# openstack server list -f json --long -c ID -c 'Task State' -c 
'Status'
[
  {
"ID": "3a3927c4-9f67-4356-8a3e-a3e58cf0744e",
"Status": "ERROR",
"Task State": null
  },
  {
"ID": "9af631ec-3e59-45da-bafa-85141e3707da",
"Status": "ACTIVE",
"Task State": null
  }
]
[root@node0 ~]#
[root@node0 ~]# curl -k -H "x-auth-token: $s" 
'http://10.136.16.186/compute/v2.1/servers/detail?sort_key=task_state'
{"computeFault": {"code": 500, "message": "Unexpected API Error. Please report 
this at http://bugs.launchpad.net/nova/ and attach the Nova API log if 
possible.\n"}}[root@node0 ~]#

Traceback:

Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi [None req-59ce5d12-1c84-4c45-8b10-da863b721d6f demo 
admin] Unexpected exception in API method: TypeError: '<' not supported between 
instances of 'NoneType' and 'NoneType'
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi Traceback (most recent call last):
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/wsgi.py", 
line 664, in wrapped
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return f(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   [Previous line repeated 2 more times]
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/servers.py", line 143, in detail
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi servers = self._get_servers(req, is_detail=True)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/servers.py", line 327, in 
_get_servers
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi instance_list = self.compute_api.get_all(elevated 
or context,
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/api.py", line 
3140, in get_all
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi insts, down_cell_uuids = 
instance_list.get_instance_objects_sorted(
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/instance_list.py", 
line 176, in get_instance_objects_sorted
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi instance_list = 
instance_obj._make_instance_list(ctx,
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/objects/instance.py", line 
1287, in _make_instance_list
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi for db_inst in db_inst_list:
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/compute/multi_cell_list.py", line 411, in 
get_records_sorted
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi item = next(feeder)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/usr/lib64/python3.9/heapq.py", line 353, in 
merge
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi _heapify(h)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/compute/multi_c

[Yahoo-eng-team] [Bug 1997089] [NEW] With new RBAC enabled (enforce_scope and enforce_new_defaults): some security groups aren't visible for admin user

2022-11-18 Thread Slawek Kaplonski
Public bug reported:

See failed test
tempest.api.compute.admin.test_security_groups.SecurityGroupsTestAdminJSON.test_list_security_groups_list_all_tenants_filter
in
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_63d/614484/10/check/tempest-
full-enforce-scope-new-defaults/63d64d6/testr_results.html

Failure:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in 
wrapper
return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/admin/test_security_groups.py", 
line 86, in test_list_security_groups_list_all_tenants_filter
self.assertIn(sec_group['id'], sec_group_id_list)
  File 
"/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/testtools/testcase.py",
 line 399, in assertIn
self.assertThat(haystack, Contains(needle), message)
  File 
"/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/testtools/testcase.py",
 line 480, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: '0596ea46-0609-4d40-b42a-e24d4882709b' 
not in ['5bb547c6-e27c-4be9-8599-dcb47b253e3e', 
'21c2add9-c4ee-40bb--42c408f677a9', '0acc8817-d8ed-44cf-8728-c43cae604c7e']

** Affects: neutron
 Importance: Undecided
 Assignee: Slawek Kaplonski (slaweq)
 Status: Confirmed


** Tags: api

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1997089

Title:
   With new RBAC enabled (enforce_scope and enforce_new_defaults): some
  security groups aren't visible for admin user

Status in neutron:
  Confirmed

Bug description:
  See failed test
  
tempest.api.compute.admin.test_security_groups.SecurityGroupsTestAdminJSON.test_list_security_groups_list_all_tenants_filter
  in
  
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_63d/614484/10/check/tempest-
  full-enforce-scope-new-defaults/63d64d6/testr_results.html

  Failure:

  Traceback (most recent call last):
File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in 
wrapper
  return f(*func_args, **func_kwargs)
File 
"/opt/stack/tempest/tempest/api/compute/admin/test_security_groups.py", line 
86, in test_list_security_groups_list_all_tenants_filter
  self.assertIn(sec_group['id'], sec_group_id_list)
File 
"/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/testtools/testcase.py",
 line 399, in assertIn
  self.assertThat(haystack, Contains(needle), message)
File 
"/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/testtools/testcase.py",
 line 480, in assertThat
  raise mismatch_error
  testtools.matchers._impl.MismatchError: 
'0596ea46-0609-4d40-b42a-e24d4882709b' not in 
['5bb547c6-e27c-4be9-8599-dcb47b253e3e', 
'21c2add9-c4ee-40bb--42c408f677a9', '0acc8817-d8ed-44cf-8728-c43cae604c7e']

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1997089/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1997092] [NEW] Metadata service broken after minor neutron update when OVN 21.09+ is used

2022-11-18 Thread Ihar Hrachyshka
Public bug reported:

Originally reported at:
https://bugzilla.redhat.com/show_bug.cgi?id=2093901

Prerequisites:

1. OVN 21.09+ that includes 
https://github.com/ovn-org/ovn/commit/3ae8470edc648b7401433a22a9f15053cc7e666d
2. Existing metadata namespace created by OVN agent before commit 
https://review.opendev.org/c/openstack/neutron/+/768462

Steps to reproduce:
1. Neutron OVN metadata agent updated to include the patch from prereq (2).
2. Neutron OVN metadata agent is restarted. It will create a new network 
namespace to host the metadata vif. It will also remove the old vif.
3. curl http://169.254.169.254/latest/meta-data/ from a VM that is hosted on 
the same node. It fails.

This happens because the agent first creates new vif, then deletes the
old vif. Which puts OVN into a situation where 2 interfaces exist in
parallel assigned to the same LSP. This scenario is considered invalid
by OVN core team. There's a patch up for review for OVN core to handle
the situation more gracefully:
https://patchwork.ozlabs.org/project/ovn/patch/20221114092437.2807815-1-xsimo...@redhat.com/
This patch will not leave metadata service broken, but it will trigger
full recompute in OVN. So we should not rely on its mechanics. Instead
Neutron should make sure that no two vifs carry the same iface-id at the
same time.

The reason why this was not a problem with OVN 21.06 or earlier is
because the patch referred in prereq (1) changed the behavior in this
invalid / undefined scenario.

** Affects: neutron
 Importance: Undecided
 Assignee: Ihar Hrachyshka (ihar-hrachyshka)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1997092

Title:
  Metadata service broken after minor neutron update when OVN 21.09+ is
  used

Status in neutron:
  In Progress

Bug description:
  Originally reported at:
  https://bugzilla.redhat.com/show_bug.cgi?id=2093901

  Prerequisites:

  1. OVN 21.09+ that includes 
https://github.com/ovn-org/ovn/commit/3ae8470edc648b7401433a22a9f15053cc7e666d
  2. Existing metadata namespace created by OVN agent before commit 
https://review.opendev.org/c/openstack/neutron/+/768462

  Steps to reproduce:
  1. Neutron OVN metadata agent updated to include the patch from prereq (2).
  2. Neutron OVN metadata agent is restarted. It will create a new network 
namespace to host the metadata vif. It will also remove the old vif.
  3. curl http://169.254.169.254/latest/meta-data/ from a VM that is hosted on 
the same node. It fails.

  This happens because the agent first creates new vif, then deletes the
  old vif. Which puts OVN into a situation where 2 interfaces exist in
  parallel assigned to the same LSP. This scenario is considered invalid
  by OVN core team. There's a patch up for review for OVN core to handle
  the situation more gracefully:
  
https://patchwork.ozlabs.org/project/ovn/patch/20221114092437.2807815-1-xsimo...@redhat.com/
  This patch will not leave metadata service broken, but it will trigger
  full recompute in OVN. So we should not rely on its mechanics. Instead
  Neutron should make sure that no two vifs carry the same iface-id at
  the same time.

  The reason why this was not a problem with OVN 21.06 or earlier is
  because the patch referred in prereq (1) changed the behavior in this
  invalid / undefined scenario.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1997092/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1997094] [NEW] [ovn-octavia-provider] HM created at fully populated loadbalancer stuck in PENDING_CREATE

2022-11-18 Thread Fernando Royo
Public bug reported:

When try to create a health monitor on OVN LBs using the API to create
fully populated loadbalancers, where the pool object will include the
information about the HM to be created.

These become stuck in PENDING_CREATE and is not functional at all. If I
delete it, the LB it was tied to will become stuck in PENDING_UPDATE.

** Affects: neutron
 Importance: Undecided
 Assignee: Fernando Royo (froyoredhat)
 Status: In Progress


** Tags: ovn-octavia-provider

** Changed in: neutron
 Assignee: (unassigned) => Fernando Royo (froyoredhat)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1997094

Title:
  [ovn-octavia-provider] HM created at fully populated loadbalancer
  stuck in PENDING_CREATE

Status in neutron:
  In Progress

Bug description:
  When try to create a health monitor on OVN LBs using the API to create
  fully populated loadbalancers, where the pool object will include the
  information about the HM to be created.

  These become stuck in PENDING_CREATE and is not functional at all. If
  I delete it, the LB it was tied to will become stuck in
  PENDING_UPDATE.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1997094/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1996836] Re: With new RBAC enabled (enforce_scope and enforce_new_defaults): 'router:external' field is missing in network list response

2022-11-18 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/865032
Committed: 
https://opendev.org/openstack/neutron/commit/0ef4f988254457ae460f192a334ccd6776688afb
Submitter: "Zuul (22348)"
Branch:master

commit 0ef4f988254457ae460f192a334ccd6776688afb
Author: Slawek Kaplonski 
Date:   Fri Nov 18 16:04:01 2022 +0100

Remove policy rule for get_network:router:external

In legacy RBAC rules get of the network's router:external attribute was
available for everyone (rule:regular_user). In new S-RBAC rules it was
done to be available for admin users and for PROJECT_READER. This didn't
really had the same result as router:external attribute wasn't visible
for networks which belongs to other project.

Networks which are set to be external are automatically shared with all
other projects and each user from such project should be able to check
every of visible networks if it is external or not.
In overall, extra policy rule for "get_network:router:external" isn't
really necessary and this patch removes it.

Closes-Bug: #1996836
Change-Id: I5fe4a0134c6ecf5cf28e2f5d59411134546c98b0


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1996836

Title:
  With new RBAC enabled (enforce_scope and enforce_new_defaults):
  'router:external' field is missing in network list response

Status in neutron:
  Fix Released

Bug description:
  I was testing the tempest with the new RBAC enabled which means in
  neutron.conf enable the below options:

  [oslo_policy]
  enforce_scope = True
  enforce_new_defaults = True

  
https://zuul.opendev.org/t/openstack/build/e447385546c749f8b38bc4c411088dc1/log/controller/logs/etc/neutron/neutron_conf.txt#1928

  Tempest external network tests doing the list network but
  'router:external' field is missing in network list response

  -
  
https://zuul.opendev.org/t/openstack/build/e447385546c749f8b38bc4c411088dc1/log/job-
  output.txt#23754

  policy defaults for 'router:external' seems fine
  - 
https://github.com/openstack/neutron/blob/bf44e70db6219e7f3a45bd61b7dd14a31ae33bb0/neutron/conf/policies/network.py#L193

  But it seems enforce_scope is restricting it somewhere, is this check in 
context causing not to return it?
  - 
https://github.com/openstack/neutron-lib/blob/9ecd5995b6c598cee931087bf13fdd166f404034/neutron_lib/context.py#L125

  We should not add system:all in neutron as system scope is not
  supported in neutron policy now.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1996836/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1997124] [NEW] Netplan/Systemd/Cloud-init/Dbus Race

2022-11-18 Thread Brett Holman
Public bug reported:

Cloud-init is seeing intermittent failures while running `netplan
apply`, which appears to be caused by a missing resource at the time of
call.

The symptom in cloud-init logs looks like:

Running ['netplan', 'apply'] resulted in stderr output: Failed to
connect system bus: No such file or directory

I think that this error[1] is likely caused by cloud-init running
netplan apply too early in boot process (before dbus is active).

Today I stumbled upon this error which was hit in MAAS[2]. We have also
hit it intermittently during tests (we didn't have a reproducer).

Realizing that this may not be a cloud-init error, but possibly a
dependency bug between dbus/systemd we decided to file this bug for
broader visibility to other projects.

I will follow up this initial report with some comments from our
discussion earlier.

[1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801
[2] 
https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970

** Affects: cloud-init
 Importance: Undecided
 Status: New

** Affects: systemd
 Importance: Undecided
 Status: New

** Also affects: systemd
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1997124

Title:
  Netplan/Systemd/Cloud-init/Dbus Race

Status in cloud-init:
  New
Status in systemd:
  New

Bug description:
  Cloud-init is seeing intermittent failures while running `netplan
  apply`, which appears to be caused by a missing resource at the time
  of call.

  The symptom in cloud-init logs looks like:

  Running ['netplan', 'apply'] resulted in stderr output: Failed to
  connect system bus: No such file or directory

  I think that this error[1] is likely caused by cloud-init running
  netplan apply too early in boot process (before dbus is active).

  Today I stumbled upon this error which was hit in MAAS[2]. We have
  also hit it intermittently during tests (we didn't have a reproducer).

  Realizing that this may not be a cloud-init error, but possibly a
  dependency bug between dbus/systemd we decided to file this bug for
  broader visibility to other projects.

  I will follow up this initial report with some comments from our
  discussion earlier.

  [1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801
  [2] 
https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1997124/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1844191] Re: azure advanced networking sometimes triggers duplicate mac detection

2022-11-18 Thread Chad Smith
Upstream PR landed with a fix for this issue allowing cloud-init to ignore 
duplicate macs as seen on  mellanox subordinate devices. 
https://github.com/canonical/cloud-init/pull/1853.

We have also released this into Ubuntu Lunar 23.04 as cloud-init version
22.4-0ubuntu4.

Our plan is also to queue this up as soon as possible for our next SRU
(Stable release update).

Marking this as Fix released as it will be in the next cloud images build for 
23.04.
We will create separate bug tasks on this bug for bionic, focal, jammy and 
kinetic when we start the SRU release process for this bug.

In the meantime, https://code.launchpad.net/~cloud-init-
dev/+archive/ubuntu/daily  has development builds containing this fix
for those looking to validate this behavior before an official SRU
release to Bionic, Focal, jammy and Kinetic.

** Changed in: cloud-init
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1844191

Title:
  azure advanced networking sometimes triggers duplicate mac detection

Status in cloud-init:
  Fix Released

Bug description:
  Hi, we're still being affected by this on Azure with
  19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image:
  BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS

  Here is the packer config:
  
  "provisioners": [
  {
"type": "shell",
"inline": [
  "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 
'Waiting for cloud-init...'; sleep 1; done"
]
  },
  {
  "type": "ansible",
  "playbook_file": "{{user `ansible_playbook`}}",
  "user": "packer",
  "extra_arguments": [ "--extra-vars", "codeVersion={{user 
`code_version`}} managed_image_name={{user `managed_image_name`}}" ]
  },
  {
  "type": "shell",
  "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh 
'{{ .Path }}'",
  "inline_shebang": "/bin/sh -x",
  "inline": [ "/usr/sbin/waagent -force -deprovision+user && export 
HISTSIZE=0 && sync" ]
  }]
  

  Here is the playbook:
  
  ---
  - hosts: all
remote_user: ubuntu
become: yes
become_method: sudo
become_user: root

environment:
  DEBIAN_FRONTEND: noninteractive
  

  Note: we are applying `enableAcceleratedNetworking: true` to the NIC,
  anecdotally we think this is related.

  Usually our playbook has more in it (obviously) but Azure kept
  pointing fingers at us that our image was causing the problem, so I
  ran this test simply deploying a blank deprovisioned image via our
  same process.

  And here's what happens on the serial console log:

  
  [   20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
  [   20.343177] sh[910]: + echo cleaning persistent cloud-init object
  [   20.349027] [  OK  ] Started Network Time Synchronization.
  [  OK  ] Reached target System Time Synchronized.
  sh[910]: cleaning persistent cloud-init object
  [   20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
  [   20.412333] sh[910]: + exit 0
  [   34.282291] cloud-init[938]: Cloud-init v. 
19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 
18:02:23 +. Up 32.02 seconds.
  [   34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: 
failed stage init-local
  [   34.423057] cloud-init[938]: failed run of stage init-local
  [   34.437716] cloud-init[938]: 

  [   34.441088] cloud-init[938]: Traceback (most recent call last):
  [   34.443719] cloud-init[938]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in 
status_wrapper
  [   34.448072] cloud-init[938]: ret = functor(name, args)
  [   34.450532] cloud-init[938]:   File 
"/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
  [   34.454849] cloud-init[938]: 
init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
  [   34.458725] cloud-init[938]:   File 
"/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in 
apply_network_config
  [   34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
  [   34.466051] cloud-init[938]:   File 
"/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in 
wait_for_physdevs
  [   34.470673] cloud-init[938]: present_macs = 
get_interfaces_by_mac().keys()
  [   34.473964] cloud-init[938]:   File 
"/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in 
get_interfaces_by_mac
  [   34.479325] cloud-init[938]: (name, ret[mac], mac))
  [   34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 
'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
  [   34.486614] cloud-init[938]: 
-

[Yahoo-eng-team] [Bug 1988011] Re: The allocation ratio of ram change by placement does not work

2022-11-18 Thread Launchpad Bug Tracker
[Expired for OpenStack Compute (nova) because there has been no activity
for 60 days.]

** Changed in: nova
   Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1988011

Title:
  The allocation ratio of ram change by placement does not work

Status in OpenStack Compute (nova):
  Expired

Bug description:
  I found that Nova will check ram_ratio in the table of compute_nodes
  in order to check whether the destination node has enough memory, when
  executing live migration instance to a target host.

  In same cases, the ram ratio in Nova is different from Placement, when
  I change the allocation_ratio by Placement API. And live migration is
  still failed when I increasing the ram ratio by Placement if it is
  deficient.

  I think the ratio in Nova should keep up with the value in Placement.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1988011/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1997124] Re: Netplan/Systemd/Cloud-init/Dbus Race

2022-11-18 Thread Brett Holman
> Separately we really ought to port networkd from dbus communication to
varlink such that it can be used safely on critical boot path. The rest
of the Systemd critical components are already using varlink.

+1

> did you mean to mark Ubuntu(Systemd) as affected?

Yes, I'll update that thanks.


** Also affects: systemd (Ubuntu)
   Importance: Undecided
   Status: New

** No longer affects: systemd

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1997124

Title:
  Netplan/Systemd/Cloud-init/Dbus Race

Status in cloud-init:
  New
Status in systemd package in Ubuntu:
  New

Bug description:
  Cloud-init is seeing intermittent failures while running `netplan
  apply`, which appears to be caused by a missing resource at the time
  of call.

  The symptom in cloud-init logs looks like:

  Running ['netplan', 'apply'] resulted in stderr output: Failed to
  connect system bus: No such file or directory

  I think that this error[1] is likely caused by cloud-init running
  netplan apply too early in boot process (before dbus is active).

  Today I stumbled upon this error which was hit in MAAS[2]. We have
  also hit it intermittently during tests (we didn't have a reproducer).

  Realizing that this may not be a cloud-init error, but possibly a
  dependency bug between dbus/systemd we decided to file this bug for
  broader visibility to other projects.

  I will follow up this initial report with some comments from our
  discussion earlier.

  [1] https://github.com/canonical/netplan/blob/main/src/dbus.c#L801
  [2] 
https://discourse.maas.io/t/latest-ubuntu-20-04-image-causing-netplan-error/5970

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1997124/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp