[Yahoo-eng-team] [Bug 2054799] Re: [SRU] Issue with Project administration at Cloud Admin level

2024-04-29 Thread Edward Hope-Morley
** Also affects: horizon (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Mantic)
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/caracal
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/2054799

Title:
  [SRU] Issue with Project administration at Cloud Admin level

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Jammy:
  New
Status in horizon source package in Mantic:
  New
Status in horizon source package in Noble:
  New

Bug description:
  [Impact]

  We are not able to see the list of users and groups assigned to a
  project in Horizon.

  [Test Case]

  Please refer to [Test steps] section below.

  [Regression Potential]

  The fix ed768ab is already in the upstream main, stable/2024.1,
  stable/2023.2 branches, so it is a clean backport and might be helpful
  for deployments using dashboard.

  
  [Others]

  Original Bug Description Below
  ===

  We are not able to see the list of users assigned to a project in Horizon.
  Scenario:
  - Log in as Cloud Admin
  - Set Domain Context (k8s)
  - Go to projects section
  - Click on project Permissions_Roles_Test
  - Go to Users

  Expectation: Get a table with the users assigned to this project.
  Result: Get an error - https://i.imgur.com/TminwUy.png

  [Test steps]

  1, Create an ordinary openstack test env with horizon.

  2, Prepared some test data (eg: one domain k8s, one project k8s, and
  one user k8s-admain with the role k8s-admin-role)

  openstack domain create k8s
  openstack role create k8s-admin-role
  openstack project create --domain k8s k8s
  openstack user create --project-domain k8s --project k8s --domain k8s 
--password password k8s-admin
  openstack role add --user k8s-admin --user-domain k8s --project k8s 
--project-domain k8s k8s-admin-role
  $ openstack role assignment list --project k8s --names
  
++---+---+-+++---+
  | Role   | User  | Group | Project | Domain | System | 
Inherited |
  
++---+---+-+++---+
  | k8s-admin-role | k8s-admin@k8s |   | k8s@k8s ||| False  
   |
  
++---+---+-+++---+

  3, Log in horizon dashboard with admin user(eg:
  admin/openstack/admin_domain).

  4, Click 'Identity -> Domains' to set domain context to the domain
  'k8s'.

  5, Click 'Identity -> Project -> k8s project -> Users'.

  6, This is the result, it said 'Unable to disaply the users of this
  project' - https://i.imgur.com/TminwUy.png

  7, These are some logs

  ==> /var/log/apache2/error.log <==
  [Fri Feb 23 10:03:12.201024 2024] [wsgi:error] [pid 47342:tid 
140254008985152] [remote 10.5.3.120:58978] Recoverable error: 
'e900b8934d11458b8eb9db21671c1b11'
  ==> /var/log/apache2/ssl_access.log <==
  10.5.3.120 - - [23/Feb/2024:10:03:11 +] "GET 
/identity/07123041ee0544e0ab32e50dde780afd/detail/?tab=project_details__users 
HTTP/1.1" 200 1125 
"https://10.5.3.120/identity/07123041ee0544e0ab32e50dde780afd/detail/; 
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/119.0.0.0 Safari/537.36"

  [Some Analyses]

  This action will call this function in horizon [1].
  This function will firstly get a list of users (api.keystone.user_list) [2], 
then role assignment list (api.keystone.get_project_users_roles) [3].
  Without setting domain context, this works fine.
  However, if setting domain context, the project displayed is in a different 
domain.
  The user list from [2] only contains users of the user's own domain, while 
the role assignment list [3] includes users in another domain since the project 
is in another domain.

  From horizon's debug log, here is an example of user list:
  {"users": [{"email": "juju@localhost", "id": 
"8cd8f92ac2f94149a91488ad66f02382", "name": "admin", "domain_id": 
"103a4eb1712f4eb9873240d5a7f66599", "enabled": true, 

[Yahoo-eng-team] [Bug 2017748] Re: [SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

2024-03-19 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2017748

Title:
  [SRU] OVN:  ovnmeta namespaces missing during scalability test causing
  DHCP issues

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  New
Status in neutron source package in Jammy:
  New

Bug description:
  [Impact]

  ovnmeta- namespaces are missing intermittently then can't reach to VMs

  [Test Case]
  TBD
  - Not able to reproduce this easily.

  [Where problems could occur]
  This patches are related to ovn metadata agent in compute. 
  VM's connectivity can possibly be affected by this patch when ovn is used. 
  Biding port to datapath could be affected.

  [Others]

  == ORIGINAL DESCRIPTION ==

  Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

  During a scalability test it was noted that a few VMs where having
  issues being pinged (2 out of ~5000 VMs in the test conducted). After
  some investigation it was found that the VMs in question did not
  receive a DHCP lease:

  udhcpc: no lease, failing
  FAIL
  checking http://169.254.169.254/2009-04-04/instance-id
  failed 1/20: up 181.90. request failed

  And the ovnmeta- namespaces for the networks that the VMs was booting
  from were missing. Looking into the ovn-metadata-agent.log:

  2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
  [-] There is no metadata port for network
  9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
  configured, tearing the namespace down if needed _get_provision_params
  /usr/lib/python3.9/site-
  packages/neutron/agent/ovn/metadata/agent.py:495

  Apparently, when the system is under stress (scalability tests) there
  are some edge cases where the metadata port information has not yet
  being propagated by OVN to the Southbound database and when the
  PortBindingChassisEvent event is being handled and try to find either
  the metadata port of the IP information on it (which is updated by
  ML2/OVN during subnet creation) it can not be found and fails silently
  with the error shown above.

  Note that, running the same tests but with less concurrency did not
  trigger this issue. So only happens when the system is overloaded.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1973347] Re: OVN revision_number infinite update loop

2024-03-01 Thread Edward Hope-Morley
** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Changed in: neutron (Ubuntu Jammy)
   Status: New => Fix Released

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/wallaby
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1973347

Title:
  OVN revision_number infinite update loop

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Jammy:
  Fix Released

Bug description:
  After the change described in
  https://mail.openvswitch.org/pipermail/ovs-dev/2022-May/393966.html
  was merged and released in stable OVN 22.03, there is a possibility to
  create an endless loop of revision_number update in external_ids of
  ports and router_ports. We have confirmed the bug in Ussuri and Yoga.
  When the problem happens, the Neutron log would look like this:

  2022-05-13 09:30:56.318 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4815
  2022-05-13 09:30:56.366 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
  2022-05-13 09:30:56.367 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
  2022-05-13 09:30:56.467 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4815
  2022-05-13 09:30:56.880 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=1): 
UpdateLRouterPortCommand(...)
  2022-05-13 09:30:56.881 25 ... Running txn n=1 command(idx=2): 
SetLRouterPortInLSwitchPortCommand(...)
  2022-05-13 09:30:56.984 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4816
  2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:57.057 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
  2022-05-13 09:30:57.058 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
  2022-05-13 09:30:57.159 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4816
  2022-05-13 09:30:57.523 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:57.523 25 ... Running txn n=1 command(idx=1): 
UpdateLRouterPortCommand(...)
  2022-05-13 09:30:57.524 25 ... Running txn n=1 command(idx=2): 
SetLRouterPortInLSwitchPortCommand(...)
  2022-05-13 09:30:57.627 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: router_ports) to 4817
  2022-05-13 09:30:57.674 25 ... Running txn n=1 command(idx=0): 
CheckRevisionNumberCommand(...)
  2022-05-13 09:30:57.674 25 ... Running txn n=1 command(idx=1): 
SetLSwitchPortCommand(...)
  2022-05-13 09:30:57.675 25 ... Running txn n=1 command(idx=2): 
PgDelPortCommand(...)
  2022-05-13 09:30:57.765 25 ... Successfully bumped revision number for 
resource 8af189bd-c5bf-48a9-b072-3fb6c69ae592 (type: ports) to 4817

  (full version here: https://pastebin.com/raw/NLP1b6Qm).

  In our lab environment we have confirmed that the problem is gone
  after mentioned change is rolled back.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1973347/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821088] Re: Virtual Interface creation failed due to duplicate entry

2024-01-31 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821088

Title:
  Virtual Interface creation failed due to duplicate entry

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  Won't Fix
Status in OpenStack Compute (nova) ussuri series:
  Won't Fix
Status in OpenStack Compute (nova) victoria series:
  Won't Fix
Status in OpenStack Compute (nova) wallaby series:
  Won't Fix
Status in OpenStack Compute (nova) xena series:
  Fix Released

Bug description:
  Seen once in a test on stable/rocky:

  http://logs.openstack.org/48/638348/1/gate/heat-functional-convg-
  mysql-lbaasv2-py35/9d70590/logs/screen-n-api.txt.gz?level=ERROR

  The traceback appears to be similar to the one reported in bug 1602357
  (which raises the possibility that
  https://bugs.launchpad.net/nova/+bug/1602357/comments/8 is relevant
  here):

  ERROR nova.api.openstack.wsgi [None req-e05ce059-71c4-437d-91e0-e4bc896acca6 
demo demo] Unexpected exception in API method: 
nova.exception_Remote.VirtualInterfaceCreateException_Remote: Virtual Interface 
creation failed
  pymysql.err.IntegrityError: (1062, "Duplicate entry 
'fa:16:3e:9d:18:a6/aac0ca83-b3d2-4b28-ab15-de2d3a3e6e16-0' for key 
'uniq_virtual_interfaces0address0deleted'")
  oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, 
"Duplicate entry 'fa:16:3e:9d:18:a6/aac0ca83-b3d2-4b28-ab15-de2d3a3e6e16-0' for 
key 'uniq_virtual_interfaces0address0deleted'") [SQL: 'INSERT INTO 
virtual_interfaces (created_at, updated_at, deleted_at, deleted, address, 
network_id, instance_uuid, uuid, tag) VALUES (%(created_at)s, %(updated_at)s, 
%(deleted_at)s, %(deleted)s, %(address)s, %(network_id)s, %(instance_uuid)s, 
%(uuid)s, %(tag)s)'] [parameters: {'created_at': datetime.datetime(2019, 3, 20, 
16, 11, 27, 753079), 'tag': None, 'uuid': 
'aac0ca83-b3d2-4b28-ab15-de2d3a3e6e16', 'deleted_at': None, 'deleted': 0, 
'address': 'fa:16:3e:9d:18:a6/aac0ca83-b3d2-4b28-ab15-de2d3a3e6e16', 
'network_id': None, 'instance_uuid': '890675f9-3a1e-4a07-8bed-8648cea9fbb9', 
'updated_at': None}] (Background on this error at: http://sqlalche.me/e/gkpj)

  (This sequence of exceptions occurs 3 times, I assume because retrying
  is normally sufficient to fix a duplicate entry problem.)

  The test was
  
heat_integrationtests.functional.test_cancel_update.CancelUpdateTest.test_cancel_update_server_with_port

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1821088/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

2024-01-24 Thread Edward Hope-Morley
** Also affects: cinder (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: cinder (Ubuntu Noble)
   Importance: Undecided
   Status: New

** Also affects: cinder (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** Also affects: cinder (Ubuntu Mantic)
   Importance: Undecided
   Status: New

** Also affects: cinder (Ubuntu Lunar)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019190

Title:
  [RBD] Retyping of in-use boot volumes renders instances unusable
  (possible data corruption)

Status in Cinder:
  New
Status in Cinder wallaby series:
  New
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in OpenStack Compute (nova):
  Invalid
Status in cinder package in Ubuntu:
  New
Status in cinder source package in Jammy:
  New
Status in cinder source package in Lunar:
  New
Status in cinder source package in Mantic:
  New
Status in cinder source package in Noble:
  New

Bug description:
  While trying out the volume retype feature in cinder, we noticed that after 
an instance is
  rebooted it will not come back online and be stuck in an error state or if it 
comes back
  online, its filesystem is corrupted.

  ## Observations

  Say there are the two volume types `fast` (stored in ceph pool `volumes`) and 
`slow`
  (stored in ceph pool `volumes.hdd`). Before the retyping we can see that the 
volume
  for example is present in the `volumes.hdd` pool and has a watcher accessing 
the
  volume.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes.hdd
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers:
  watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 
cookie=140370268803456
  ```

  Starting the retyping process using the migration policy `on-demand` for that 
volume either
  via the horizon dashboard or the CLI causes the volume to be correctly 
transferred to the
  `volumes` pool within the ceph cluster. However, the watcher does not get 
transferred, so
  nobody is accessing the volume after it has been transferred.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers: none
  ```

  Taking a look at the libvirt XML of the instance in question, one can see 
that the `rbd`
  volume path does not change after the retyping is completed. Therefore, if 
the instance is
  restarted nova will not be able to find its volume preventing an instance 
start.

   Pre retype

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

   Post retype (no change)

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

  ### Possible cause

  While looking through the code that is responsible for the volume retype we 
found a function
  `swap_volume` volume which by our understanding should be responsible for 
fixing the association
  above. As we understand cinder should use an internal API path to let nova 
perform this action.
  This doesn't seem to happen.

  (`_swap_volume`:
  
https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)

  ## Further observations

  If one tries to regenerate the libvirt XML by e.g. live migrating the 
instance and rebooting the
  instance after, the filesystem gets corrupted.

  ## Environmental Information and possibly related reports

  We are running the latest version of TripleO Wallaby using the hardened 
(whole disk)
  overcloud image for the nodes.

  Cinder Volume Version: `openstack-
  cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`

  ### Possibly related

  - https://bugzilla.redhat.com/show_bug.cgi?id=1293440

  
  (might want to paste the above to a markdown file for better readability)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

2024-01-19 Thread Edward Hope-Morley
Since we are using Yoga and hitting this issue I had a go at reverting
the patch there too and can confirm that it does resolve the problem.

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/caracal
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019190

Title:
  [RBD] Retyping of in-use boot volumes renders instances unusable
  (possible data corruption)

Status in Cinder:
  New
Status in Cinder wallaby series:
  New
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  While trying out the volume retype feature in cinder, we noticed that after 
an instance is
  rebooted it will not come back online and be stuck in an error state or if it 
comes back
  online, its filesystem is corrupted.

  ## Observations

  Say there are the two volume types `fast` (stored in ceph pool `volumes`) and 
`slow`
  (stored in ceph pool `volumes.hdd`). Before the retyping we can see that the 
volume
  for example is present in the `volumes.hdd` pool and has a watcher accessing 
the
  volume.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes.hdd
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers:
  watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 
cookie=140370268803456
  ```

  Starting the retyping process using the migration policy `on-demand` for that 
volume either
  via the horizon dashboard or the CLI causes the volume to be correctly 
transferred to the
  `volumes` pool within the ceph cluster. However, the watcher does not get 
transferred, so
  nobody is accessing the volume after it has been transferred.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers: none
  ```

  Taking a look at the libvirt XML of the instance in question, one can see 
that the `rbd`
  volume path does not change after the retyping is completed. Therefore, if 
the instance is
  restarted nova will not be able to find its volume preventing an instance 
start.

   Pre retype

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

   Post retype (no change)

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

  ### Possible cause

  While looking through the code that is responsible for the volume retype we 
found a function
  `swap_volume` volume which by our understanding should be responsible for 
fixing the association
  above. As we understand cinder should use an internal API path to let nova 
perform this action.
  This doesn't seem to happen.

  (`_swap_volume`:
  
https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)

  ## Further observations

  If one tries to regenerate the libvirt XML by e.g. live migrating the 
instance and rebooting the
  instance after, the filesystem gets corrupted.

  ## Environmental Information and possibly related reports

  We are running the latest version of TripleO Wallaby using the hardened 
(whole disk)
  overcloud image for the nodes.

  Cinder Volume Version: `openstack-
  cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`

  ### Possibly related

  - https://bugzilla.redhat.com/show_bug.cgi?id=1293440

  
  (might want to paste the above to a markdown file for better readability)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1998789] Re: [SRU] PooledLDAPHandler.result3 does not release pool connection back when an exception is raised

2023-10-30 Thread Edward Hope-Morley
this is now fix released down to victoria as per [1]

[1] https://bugs.launchpad.net/ubuntu/+source/keystone/+bug/2039176

** Changed in: cloud-archive/victoria
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/wallaby
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/xena
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/yoga
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/zed
   Status: Fix Committed => Fix Released

** Changed in: keystone (Ubuntu Lunar)
   Status: Fix Committed => Fix Released

** Changed in: keystone (Ubuntu Jammy)
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/antelope
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1998789

Title:
  [SRU] PooledLDAPHandler.result3 does not release pool connection back
  when an exception is raised

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Triaged
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in OpenStack Identity (keystone):
  Fix Released
Status in keystone package in Ubuntu:
  Fix Released
Status in keystone source package in Focal:
  Triaged
Status in keystone source package in Jammy:
  Fix Released
Status in keystone source package in Lunar:
  Fix Released

Bug description:
  [Impact]

  This SRU is a backport of
  https://review.opendev.org/c/openstack/keystone/+/866723 to the
  respective Ubuntu and UCA releases. The patch is merged to the all
  respective upstream branches (master & stable/[u,v,w,x,y,z]).

  This SRU intends to fix a denial-of-service bug that happens when
  keystone uses pooled ldap connections. In pooled ldap connection mode,
  keystone borrows a connection from the pool, do the LDAP operation and
  release it back to the pool. But, if an exception or error happens
  while the LDAP connection is still borrowed, Keystone fails to release
  the connection back to the pool, hogging it forever. If this happens
  for all the pooled connections, the connection pool will be exhausted
  and Keystone will no longer be able to perform LDAP operations.

  The fix corrects this behavior by allowing the connection to release
  back to the pool even if an exception/error happens during the LDAP
  operation.

  [Test Case]

  - Deploy an LDAP server of your choice
  - Fill it with many data so the search takes more than 
`pool_connection_timeout` seconds
  - Define a keystone domain with the LDAP driver with following options:

  [ldap]
  use_pool = True
  page_size = 100
  pool_connection_timeout = 3
  pool_retry_max = 3
  pool_size = 10

  - Point the domain to the LDAP server
  - Try to login to the OpenStack dashboard, or try to do anything that uses 
the LDAP user
  - Observe the /var/log/apache2/keystone_error.log, it should contain 
ldap.TIMEOUT() stack traces followed by `ldappool.MaxConnectionReachedError` 
stack traces

  To confirm the fix, repeat the scenario and observe that the
  "/var/log/apache2/keystone_error.log" does not contain
  `ldappool.MaxConnectionReachedError` stack traces and LDAP operation
  in motion is successful (e.g. OpenStack Dashboard login)

  [Regression Potential]
  The patch is quite trivial and should not affect any deployment in a negative 
way. The LDAP pool functionality can be disabled by setting "use_pool=False" in 
case of any regression.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1998789/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1998789] Re: [SRU] PooledLDAPHandler.result3 does not release pool connection back when an exception is raised

2023-10-16 Thread Edward Hope-Morley
** Changed in: cloud-archive/yoga
   Status: Fix Released => New

** Changed in: cloud-archive/zed
   Status: Fix Released => New

** Also affects: cloud-archive/antelope
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/antelope
   Status: New => Fix Released

** Also affects: keystone (Ubuntu Lunar)
   Importance: Undecided
   Status: New

** Changed in: keystone (Ubuntu Lunar)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Identity (keystone).
https://bugs.launchpad.net/bugs/1998789

Title:
  [SRU] PooledLDAPHandler.result3 does not release pool connection back
  when an exception is raised

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in OpenStack Identity (keystone):
  Fix Released
Status in keystone package in Ubuntu:
  New
Status in keystone source package in Focal:
  New
Status in keystone source package in Jammy:
  New
Status in keystone source package in Lunar:
  Fix Released

Bug description:
  [Impact]

  This SRU is a backport of
  https://review.opendev.org/c/openstack/keystone/+/866723 to the
  respective Ubuntu and UCA releases. The patch is merged to the all
  respective upstream branches (master & stable/[u,v,w,x,y,z]).

  This SRU intends to fix a denial-of-service bug that happens when
  keystone uses pooled ldap connections. In pooled ldap connection mode,
  keystone borrows a connection from the pool, do the LDAP operation and
  release it back to the pool. But, if an exception or error happens
  while the LDAP connection is still borrowed, Keystone fails to release
  the connection back to the pool, hogging it forever. If this happens
  for all the pooled connections, the connection pool will be exhausted
  and Keystone will no longer be able to perform LDAP operations.

  The fix corrects this behavior by allowing the connection to release
  back to the pool even if an exception/error happens during the LDAP
  operation.

  [Test Case]

  - Deploy an LDAP server of your choice
  - Fill it with many data so the search takes more than 
`pool_connection_timeout` seconds
  - Define a keystone domain with the LDAP driver with following options:

  [ldap]
  use_pool = True
  page_size = 100
  pool_connection_timeout = 3
  pool_retry_max = 3
  pool_size = 10

  - Point the domain to the LDAP server
  - Try to login to the OpenStack dashboard, or try to do anything that uses 
the LDAP user
  - Observe the /var/log/apache2/keystone_error.log, it should contain 
ldap.TIMEOUT() stack traces followed by `ldappool.MaxConnectionReachedError` 
stack traces

  To confirm the fix, repeat the scenario and observe that the
  "/var/log/apache2/keystone_error.log" does not contain
  `ldappool.MaxConnectionReachedError` stack traces and LDAP operation
  in motion is successful (e.g. OpenStack Dashboard login)

  [Regression Potential]
  The patch is quite trivial and should not affect any deployment in a negative 
way. The LDAP pool functionality can be disabled by setting "use_pool=False" in 
case of any regression.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1998789/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1969971] Re: Live migrations failing due to remote host identification change

2023-10-06 Thread Edward Hope-Morley
The nova-cloud-controller charm will create hostname, fqdn and ip
address entries for each compute host. It does using settings 'private-
address' and 'hostname' on the cloud-compute relation. private-address
will be the address resolvable from libvirt-migration-network (if
configured) otherwise the unit private-address.

Here comes the problem; the hostname added to known_hosts will be from
relation 'hostname' BUT the hostname fqdn will be resolved from private-
address. This means that if Nova compute is configured to use network X
for the its management network and libvirt-migration-network is set to a
different network, the fqdn in known_hosts will be from the latter. This
is all good until nova-compute needs to do a vm resize and the image
used to build the vm no longer exists in Glance. At which point Nova
will use the instance.hostname from the database to perform an scp from
source to destination and this fails because this hostname (fqdn from
management network) is not in known_hosts.

This is something that Nova should ultimately have support for but in
the interim the suggestion is that nova-cloud-controller always adds the
management network fqdn to known_hosts.

** Also affects: nova
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1969971

Title:
  Live migrations failing due to remote host identification change

Status in OpenStack Nova Cloud Controller Charm:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  I've encountered a cloud where, for some reason (maybe a redeploy of a
  compute; I'm not sure), I'm hitting this error in nova-compute.log on
  the source node for an instance migration:

  2022-04-22 10:21:17.419 3776 ERROR nova.virt.libvirt.driver [-] [instance: 
] Live Migration failure: operation failed: Failed to 
connect to remote libvirt URI qemu+ssh:///system: Cannot recv 
data: @@@
  @WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
  @@@
  IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
  Someone could be eavesdropping on you right now (man-in-the-middle attack)!
  It is also possible that a host key has just been changed.
  The fingerprint for the RSA key sent by the remote host is
  SHA256:.
  Please contact your system administrator.
  Add correct host key in /root/.ssh/known_hosts to get rid of this message.
  Offending RSA key in /root/.ssh/known_hosts:97
remove with:
ssh-keygen -f "/root/.ssh/known_hosts" -R ""
  RSA host key for  has changed and you have requested strict 
checking.
  Host key verification failed.: Connection reset by peer: 
libvirt.libvirtError: operation failed: Failed to connect to remote libvirt URI 
qemu+ssh:///system: Cannot recv data: 
@@@

  This interferes with instance migration.

  There is a workaround:
  * Manually ssh to the destination node, both as the root and nova users on 
the source node.
  * Manually clear the offending known_hosts entries reported by the SSH 
command.
  * Verify that once cleared, the root and nova users are able to successfully 
connect via SSH.

  Obviously, this is cumbersome in the case of clouds with high numbers
  of compute nodes.  It'd be better if the charm was able to avoid this
  issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1969971/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1943639] Re: project/instances/attach_interface has O(N) scaling time complexity for opening form

2023-09-05 Thread Edward Hope-Morley
This patch was merged to master during the Yoga development cycle and is
available in the following point releases upstream:

  (yoga) 22.0.0
  (xena) 20.1.3

The Ubuntu archives currently have the following versions:

  Yoga - 4:22.1.0-0ubuntu2.1~cloud0
  Xena - 4:20.1.4-0ubuntu1~cloud1
  Wallaby - 4:19.4.0-0ubuntu1~cloud1
  Victoria - 4:18.6.4-0ubuntu1~cloud1
  Ussuri - 3:18.3.5-0ubuntu2.1

So to get this patch backported to Ussuri, it first needs to be SRUd to
Wallaby and Victoria UCA.

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/xena
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1943639

Title:
  project/instances/attach_interface has O(N) scaling time complexity
  for opening form

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Focal:
  New

Bug description:
  [ Impact ]

  The time complexity of opening the project/instances/attach_interface
  form box is O(N) where N is the number of networks in the project with
  a large prefactor.

  This is due to

  
https://opendev.org/openstack/horizon/src/branch/master/openstack_dashboard/dashboards/project/instances/utils.py#L210

  Which loops over the networks and requests the ports associated with
  the network. For large projects this scaling behavior can become
  prohibitive.

  The patch [1] addresses this issue by reducing the number of API calls
  and hence the prefactor of the algorithm.

  [ Test Plan ]

  In order to reproduce the issue, create a Nova VM and then add many
  networks. On the instances tab in the Horizon UI click on "attach
  interface" for the VM. It will take a moment for the dialog to appear.
  The exact time until the dialog appears will depend on the number of
  networks linearly.

  With [1] the time it takes for the dialog box to appear will be
  significantly shorter.

  [ Where problems could occur ]

  The patch [1] affects the "attach interface" dialog box and could
  break this UI feature in case something was wrong with the
  implementation. It is also possible that due to a bug in the
  implementation some networks are missing from the dialog.

  [1] https://review.opendev.org/c/openstack/horizon/+/866895

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1943639/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1943639] Re: project/instances/attach_interface has O(N) scaling time complexity for opening form

2023-09-05 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1943639

Title:
  project/instances/attach_interface has O(N) scaling time complexity
  for opening form

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Focal:
  New

Bug description:
  [ Impact ]

  The time complexity of opening the project/instances/attach_interface
  form box is O(N) where N is the number of networks in the project with
  a large prefactor.

  This is due to

  
https://opendev.org/openstack/horizon/src/branch/master/openstack_dashboard/dashboards/project/instances/utils.py#L210

  Which loops over the networks and requests the ports associated with
  the network. For large projects this scaling behavior can become
  prohibitive.

  The patch [1] addresses this issue by reducing the number of API calls
  and hence the prefactor of the algorithm.

  [ Test Plan ]

  In order to reproduce the issue, create a Nova VM and then add many
  networks. On the instances tab in the Horizon UI click on "attach
  interface" for the VM. It will take a moment for the dialog to appear.
  The exact time until the dialog appears will depend on the number of
  networks linearly.

  With [1] the time it takes for the dialog box to appear will be
  significantly shorter.

  [ Where problems could occur ]

  The patch [1] affects the "attach interface" dialog box and could
  break this UI feature in case something was wrong with the
  implementation. It is also possible that due to a bug in the
  implementation some networks are missing from the dialog.

  [1] https://review.opendev.org/c/openstack/horizon/+/866895

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1943639/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1843708] Re: [SRU] Key-pair is not updated during the rebuild

2023-05-16 Thread Edward Hope-Morley
** Changed in: cloud-archive/ussuri
   Status: New => Fix Released

** Changed in: nova (Ubuntu Focal)
   Status: New => Fix Released

** Changed in: cloud-archive/train
   Status: New => Fix Released

** Changed in: cloud-archive/stein
   Status: New => Fix Released

** Changed in: cloud-archive/rocky
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843708

Title:
  [SRU] Key-pair is not updated during the rebuild

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive rocky series:
  Won't Fix
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Fix Released
Status in OpenStack Compute (nova) rocky series:
  Fix Released
Status in OpenStack Compute (nova) stein series:
  Fix Released
Status in OpenStack Compute (nova) train series:
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Bionic:
  New
Status in nova source package in Focal:
  Fix Released

Bug description:
  [ Impact ]

   * See the original bug description below

  [ Test Plan ]

   * See the original bug description below

  [ Where problems could occur ]

   * See the original bug description below

  [ Other Info ]
   
   * the fix 6a7a78a44 is already in stable/queens, but not in 17.0.13
   
   
  Original Bug Description
  ===

  When we want to rebuild an instance and change the keypair we can specified 
it with :
  openstack --os-compute-api-version 2.54  server rebuild --image "Debian 10" 
--key-name key1 instance1

  This comes from this implementation :
  https://review.opendev.org/#/c/379128/
  
https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/rebuild-keypair-reset.html

  But when rebuilding the instance, Cloud-Init will set the key in 
authorized_keys from
  http://169.254.169.254/openstack/latest/meta_data.json

  And this meta_data.json uses the keys from instance_extra tables
  But the keypair will be updated in the 'instances' table but not in the 
'instance_extra' table.

  So the keypair is not updated inside the VM

  May be this is the function for saving the keypair, but the save() do nothing 
:
  
https://opendev.org/openstack/nova/src/branch/master/nova/objects/instance.py#L714

  Steps to reproduce
  ==

  - Deploy a DevStack
  - Boot an instance with keypair key1
  - Rebuild it with key2
  - A nova show will show the key_name key2, keypairs object in table 
instance_extra is not updated and you cannot connect with key2 to the instance

  Expected result
  ===
  Connecte to the Vm with the new keypair added during the rebuild call

  Actual result
  =
  The keypair added during the rebuild call is not set in the VM

  Environment
  ===
  I tested it on a Devstack from master and we have the behaviour.
  NOVA : commit 5fa49cd0b8b6015aa61b4312b2ce1ae780c42c64

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1843708/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1843708] Re: [SRU] Key-pair is not updated during the rebuild

2023-05-16 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** No longer affects: cloud-archive/queens

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843708

Title:
  [SRU] Key-pair is not updated during the rebuild

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Fix Released
Status in OpenStack Compute (nova) rocky series:
  Fix Released
Status in OpenStack Compute (nova) stein series:
  Fix Released
Status in OpenStack Compute (nova) train series:
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Bionic:
  New
Status in nova source package in Focal:
  New

Bug description:
  [ Impact ]

   * See the original bug description below

  [ Test Plan ]

   * See the original bug description below

  [ Where problems could occur ]

   * See the original bug description below

  [ Other Info ]
   
   * the fix 6a7a78a44 is already in stable/queens, but not in 17.0.13
   
   
  Original Bug Description
  ===

  When we want to rebuild an instance and change the keypair we can specified 
it with :
  openstack --os-compute-api-version 2.54  server rebuild --image "Debian 10" 
--key-name key1 instance1

  This comes from this implementation :
  https://review.opendev.org/#/c/379128/
  
https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/rebuild-keypair-reset.html

  But when rebuilding the instance, Cloud-Init will set the key in 
authorized_keys from
  http://169.254.169.254/openstack/latest/meta_data.json

  And this meta_data.json uses the keys from instance_extra tables
  But the keypair will be updated in the 'instances' table but not in the 
'instance_extra' table.

  So the keypair is not updated inside the VM

  May be this is the function for saving the keypair, but the save() do nothing 
:
  
https://opendev.org/openstack/nova/src/branch/master/nova/objects/instance.py#L714

  Steps to reproduce
  ==

  - Deploy a DevStack
  - Boot an instance with keypair key1
  - Rebuild it with key2
  - A nova show will show the key_name key2, keypairs object in table 
instance_extra is not updated and you cannot connect with key2 to the instance

  Expected result
  ===
  Connecte to the Vm with the new keypair added during the rebuild call

  Actual result
  =
  The keypair added during the rebuild call is not set in the VM

  Environment
  ===
  I tested it on a Devstack from master and we have the behaviour.
  NOVA : commit 5fa49cd0b8b6015aa61b4312b2ce1ae780c42c64

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1843708/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1951296] Re: OVN db sync script fails with OVN schema that has label column in ACL table

2023-03-31 Thread Edward Hope-Morley
** Description changed:

+ [Impact]
+ Backport fix to Focal/Ussuri so that neutron-ovn-db-sync-util does not trip 
up when it finds ovn ACL table entries with a "label" column that does not 
exist in neutron db.
+ 
+ [Test Plan]
+  * Deploy Openstack Ussuri
+  * Create a network with security groups
+  * Create an instance using this network so that ports get tied to SGs
+  * Go to neutron-api unit (neutron-server) and do the following
+  * cp /etc/neutron/neutron.conf 
/etc/neutron/neutron.conf.no_keystone_authtoken
+  * remove "auth_section = keystone_authtoken" in the [nova] section of 
neutron.conf.no_keystone_authtoken
+  * run 'neutron-ovn-db-sync-util --config-file 
/etc/neutron/neutron.conf.no_keystone_authtoken --config-file 
/etc/neutron/plugins/ml2/ml2_conf.ini --ovn-neutron_sync_mode repair'
+  * the above should not produce any errors like the following:
+ 
+ RuntimeError: ACL ... already exists
+ 
+ [Regression Potential]
+ there is no regression potential expected with this patch.
+ 
+ --
+ 
  OVN introduced a new column in ACL table. The column name is label and
  when running db-sync script, we compare ACL generated by the ovn mech
  driver from Neutron DB with the actual ACLs in the OVN DB. Because of
  the new label column, everything seems like a new ACL because the column
  differs to what Neutron generated. Thus the script attempts to create a
  new ACL that already exists.
  
- b'Traceback (most recent call last):'
- b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
- b'return f(self, *args, **kwargs)'
- b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
- b'return f(self, *args, **kwargs)'
- b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1547, in test_ovn_nb_sync_repair'
- b"self._test_ovn_nb_sync_helper('repair')"
- b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1543, in _test_ovn_nb_sync_helper'
- b'self._sync_resources(mode)'
- b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1523, in _sync_resources'
- b'nb_synchronizer.do_sync()'
- b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 104, in do_sync'
- b'self.sync_acls(ctx)'
- b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 288, in sync_acls'
- b'txn.add(self.ovn_api.pg_acl_add(**acla))'
- b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
- b'next(self.gen)'
- b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/ovsdb/impl_idl_ovn.py", line 
230, in transaction'
- b'yield t'
- b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
- b'next(self.gen)'
- b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
110, in transaction'
- b'del self._nested_txns_map[cur_thread_id]'
- b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
61, in __exit__'
- b'self.result = self.commit()'
- b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 65, in commit'
- b'raise result.ex'
- b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py",
 line 131, in run'
- b'txn.results.put(txn.do_commit())'
- b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 93, in do_commit'
- b'command.run_idl(txn)'
- b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/schema/ovn_northbound/commands.py",
 line 124, in run_idl'
- b'self.direction, self.priority, self.match))'
- b'RuntimeError: ACL (from-lport, 1001, inport == @neutron_pg_drop && ip) 
already exists'
+ b'Traceback (most recent call last):'
+ b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
+ b'return f(self, *args, **kwargs)'
+ b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
+ b'return f(self, *args, **kwargs)'
+ b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1547, in test_ovn_nb_sync_repair'
+ b"self._test_ovn_nb_sync_helper('repair')"
+ b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1543, in _test_ovn_nb_sync_helper'
+ b'self._sync_resources(mode)'
+ b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1523, in _sync_resources'
+ b'nb_synchronizer.do_sync()'
+ b'  File 

[Yahoo-eng-team] [Bug 1951296] Re: OVN db sync script fails with OVN schema that has label column in ACL table

2023-03-28 Thread Edward Hope-Morley
** Changed in: cloud-archive/xena
   Status: New => Fix Released

** Changed in: cloud-archive/wallaby
   Status: New => Fix Released

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1951296

Title:
  OVN db sync script fails with OVN schema that has label column in ACL
  table

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  OVN introduced a new column in ACL table. The column name is label and
  when running db-sync script, we compare ACL generated by the ovn mech
  driver from Neutron DB with the actual ACLs in the OVN DB. Because of
  the new label column, everything seems like a new ACL because the
  column differs to what Neutron generated. Thus the script attempts to
  create a new ACL that already exists.

  b'Traceback (most recent call last):'
  b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
  b'return f(self, *args, **kwargs)'
  b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
  b'return f(self, *args, **kwargs)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1547, in test_ovn_nb_sync_repair'
  b"self._test_ovn_nb_sync_helper('repair')"
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1543, in _test_ovn_nb_sync_helper'
  b'self._sync_resources(mode)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1523, in _sync_resources'
  b'nb_synchronizer.do_sync()'
  b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 104, in do_sync'
  b'self.sync_acls(ctx)'
  b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 288, in sync_acls'
  b'txn.add(self.ovn_api.pg_acl_add(**acla))'
  b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
  b'next(self.gen)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/ovsdb/impl_idl_ovn.py", line 
230, in transaction'
  b'yield t'
  b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
  b'next(self.gen)'
  b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
110, in transaction'
  b'del self._nested_txns_map[cur_thread_id]'
  b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
61, in __exit__'
  b'self.result = self.commit()'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 65, in commit'
  b'raise result.ex'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py",
 line 131, in run'
  b'txn.results.put(txn.do_commit())'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 93, in do_commit'
  b'command.run_idl(txn)'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/schema/ovn_northbound/commands.py",
 line 124, in run_idl'
  b'self.direction, self.priority, self.match))'
  b'RuntimeError: ACL (from-lport, 1001, inport == @neutron_pg_drop && ip) 
already exists'

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1951296/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1951296] Re: OVN db sync script fails with OVN schema that has label column in ACL table

2023-03-28 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/zed
   Status: New => Fix Released

** Changed in: cloud-archive/yoga
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1951296

Title:
  OVN db sync script fails with OVN schema that has label column in ACL
  table

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  OVN introduced a new column in ACL table. The column name is label and
  when running db-sync script, we compare ACL generated by the ovn mech
  driver from Neutron DB with the actual ACLs in the OVN DB. Because of
  the new label column, everything seems like a new ACL because the
  column differs to what Neutron generated. Thus the script attempts to
  create a new ACL that already exists.

  b'Traceback (most recent call last):'
  b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
  b'return f(self, *args, **kwargs)'
  b'  File "/usr/local/lib/python3.6/site-packages/neutron/tests/base.py", 
line 181, in func'
  b'return f(self, *args, **kwargs)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1547, in test_ovn_nb_sync_repair'
  b"self._test_ovn_nb_sync_helper('repair')"
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1543, in _test_ovn_nb_sync_helper'
  b'self._sync_resources(mode)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/tests/functional/test_ovn_db_sync.py",
 line 1523, in _sync_resources'
  b'nb_synchronizer.do_sync()'
  b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 104, in do_sync'
  b'self.sync_acls(ctx)'
  b'  File "/home/cloud-user/networking-ovn/networking_ovn/ovn_db_sync.py", 
line 288, in sync_acls'
  b'txn.add(self.ovn_api.pg_acl_add(**acla))'
  b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
  b'next(self.gen)'
  b'  File 
"/home/cloud-user/networking-ovn/networking_ovn/ovsdb/impl_idl_ovn.py", line 
230, in transaction'
  b'yield t'
  b'  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__'
  b'next(self.gen)'
  b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
110, in transaction'
  b'del self._nested_txns_map[cur_thread_id]'
  b'  File "/usr/local/lib/python3.6/site-packages/ovsdbapp/api.py", line 
61, in __exit__'
  b'self.result = self.commit()'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 65, in commit'
  b'raise result.ex'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py",
 line 131, in run'
  b'txn.results.put(txn.do_commit())'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py",
 line 93, in do_commit'
  b'command.run_idl(txn)'
  b'  File 
"/usr/local/lib/python3.6/site-packages/ovsdbapp/schema/ovn_northbound/commands.py",
 line 124, in run_idl'
  b'self.direction, self.priority, self.match))'
  b'RuntimeError: ACL (from-lport, 1001, inport == @neutron_pg_drop && ip) 
already exists'

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1951296/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1890244] Re: nova scheduler should ignore removed groups

2023-03-17 Thread Edward Hope-Morley
** Changed in: cloud-archive/yoga
   Status: Fix Committed => New

** Also affects: nova (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Lunar)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Kinetic)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Jammy)
   Importance: Undecided
   Status: New

** No longer affects: nova (Ubuntu Lunar)

** Changed in: nova (Ubuntu Kinetic)
   Status: New => Fix Released

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/zed
   Status: New => Fix Released

** No longer affects: cloud-archive/ussuri

** No longer affects: cloud-archive/victoria

** No longer affects: cloud-archive/wallaby

** No longer affects: cloud-archive/xena

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1890244

Title:
  nova scheduler should ignore removed groups

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Jammy:
  New
Status in nova source package in Kinetic:
  Fix Released

Bug description:
  Description
  ===
  We created a server group and started some instances in it.

  Later we removed the server group.

  Some time later, we had to evacuate these instances, but this failed, because 
the
  scheduler removed all available hosts during filtering.

  Steps to reproduce
  ==
  * create a server group
  * start some instances in this group
  * delete the server group
  * ( hard poweroff your hypervisor )
  * evacuate the instances

  Expected result
  ===
  The instances are evacuated

  Actual result
  =
  The instances run into ERROR-state, because the server group is not found.

  Environment
  ===
  * Kolla deployed OpenStack Train
  * Ubuntu 18.04 / KVM + Libvirt

  Logs & Configs
  ==

  scheduler tells:

   Filtering removed all hosts for the request with instance ID
  'adddf2c9-0252-4463-a97c-f1ec209d9f49'. Filter results:
  ['AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start:
  2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)',
  'ImagePropertiesFilter: (start: 2, end: 2)',
  'ServerGroupAntiAffinityFilter: (start: 2, end: 2)',
  'ServerGroupAffinityFilter: (start: 2, end: 0)']

  instance show:

   | fault | {'code': 404, 'created': '2020-08-04T06:13:41Z', 'message':
  'Instance group 7e84dc57-de05-4c92-9e3b-6e2d06c1d85b could not be
  found.'} |

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1890244/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1939723] Re: neutron-ovn-db-sync generates insufficient flow

2023-03-13 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/zed
   Status: New => Fix Released

** Changed in: cloud-archive/yoga
   Status: New => Fix Released

** Changed in: cloud-archive/wallaby
   Status: New => Fix Released

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

** Changed in: cloud-archive/xena
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939723

Title:
  neutron-ovn-db-sync generates insufficient flow

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  In OpenStack version Victoria, neutron-ovn-db-sync generates insufficient 
flow for port no security-group or disable port-security.
  ---> As a result, the port is not connected to the network.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1939723/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1982284] Re: libvirt live migration sometimes fails with "libvirt.libvirtError: internal error: migration was active, but no RAM info was set"

2023-03-09 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/xena
   Status: New => Fix Released

** Changed in: cloud-archive/yoga
   Status: New => Fix Released

** Changed in: cloud-archive/zed
   Status: New => Fix Released

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1982284

Title:
  libvirt live migration sometimes fails with "libvirt.libvirtError:
  internal error: migration was active, but no   RAM info was set"

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  In Progress
Status in OpenStack Compute (nova) ussuri series:
  In Progress
Status in OpenStack Compute (nova) victoria series:
  In Progress
Status in OpenStack Compute (nova) wallaby series:
  In Progress
Status in OpenStack Compute (nova) xena series:
  Fix Released
Status in OpenStack Compute (nova) yoga series:
  Fix Released
Status in OpenStack Compute (nova) zed series:
  Fix Released

Bug description:
  We have seen this downstream where live migration randomly fails with
  the following error [1]:

libvirt.libvirtError: internal error: migration was active, but no
  RAM info was set

  Discussion on [1] gravitated toward a possible race condition issue in
  qemu around the query-migrate command [2]. The query-migrate command
  is used (indirectly) by the libvirt driver during monitoring of live
  migrations [3][4][5].

  While searching for info about this error, I found a thread on libvir-
  list from the past [6] where someone else encountered the same error
  and for them it happened if they called query-migrate *after* a live
  migration had completed.

  Based on this, it seemed possible that our live migration monitoring
  thread sometimes races and calls jobStats() after the migration has
  completed, resulting in this error being raised and the migration
  being considered failed when it was actually complete.

  A patch has since been proposed and committed [7] to address the
  possible issue.

  Meanwhile, on our side in nova, we can mitigate this problematic
  behavior by catching the specific error from libvirt and ignoring it
  so that a live migration in this situation will be considered
  completed by the libvirt driver.

  Doing this would improve the experience for users that are hitting
  this error and getting erroneous live migration failures.

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=2074205
  [2] 
https://qemu.readthedocs.io/en/latest/interop/qemu-qmp-ref.html#qapidoc-1848
  [3] 
https://github.com/openstack/nova/blob/bcb96f362ab12e297f125daa5189fb66345b4976/nova/virt/libvirt/driver.py#L10123
  [4] 
https://github.com/openstack/nova/blob/bcb96f362ab12e297f125daa5189fb66345b4976/nova/virt/libvirt/guest.py#L655
  [5] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetJobStats
  [6] https://listman.redhat.com/archives/libvir-list/2021-January/213631.html
  [7] 
https://github.com/qemu/qemu/commit/552de79bfdd5e9e53847eb3c6d6e4cd898a4370e

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1982284/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1890244] Re: nova scheduler should ignore removed groups

2022-12-16 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1890244

Title:
  nova scheduler should ignore removed groups

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  In Progress
Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===
  We created a server group and started some instances in it.

  Later we removed the server group.

  Some time later, we had to evacuate these instances, but this failed, because 
the
  scheduler removed all available hosts during filtering.

  Steps to reproduce
  ==
  * create a server group
  * start some instances in this group
  * delete the server group
  * ( hard poweroff your hypervisor )
  * evacuate the instances

  Expected result
  ===
  The instances are evacuated

  Actual result
  =
  The instances run into ERROR-state, because the server group is not found.

  Environment
  ===
  * Kolla deployed OpenStack Train
  * Ubuntu 18.04 / KVM + Libvirt

  Logs & Configs
  ==

  scheduler tells:

   Filtering removed all hosts for the request with instance ID
  'adddf2c9-0252-4463-a97c-f1ec209d9f49'. Filter results:
  ['AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start:
  2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)',
  'ImagePropertiesFilter: (start: 2, end: 2)',
  'ServerGroupAntiAffinityFilter: (start: 2, end: 2)',
  'ServerGroupAffinityFilter: (start: 2, end: 0)']

  instance show:

   | fault | {'code': 404, 'created': '2020-08-04T06:13:41Z', 'message':
  'Instance group 7e84dc57-de05-4c92-9e3b-6e2d06c1d85b could not be
  found.'} |

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1890244/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1961013] Re: [stable][ovn] frequent OVN DB leader changes increase rate of Neutron API errors

2022-09-27 Thread Edward Hope-Morley
these patches are merged and released in ubuntu

** Changed in: neutron
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1961013

Title:
  [stable][ovn] frequent OVN DB leader changes increase rate of Neutron
  API errors

Status in neutron:
  Fix Released

Bug description:
  Open vSwitch v2.16 introduces a change of behavior for clustered databases,
  before the leader creates a snapshot of the database it will transfer
  leadership to a different server.

  For versions of Neutron where the Southbound DB IDL requires a connection to
  the leader this will cause frequent reconnection. In a loaded system this will
  subsequently lead to increased rate of API errors for the neutron-server
  and delays for the neutron-ovn-metadata agent which manifest itself as
  instances not getting metadata.

  In the main branch Neutron has recently changed to not require a connection to
  the leader for the Southbound DB [0].

  End users can choose to use more recent versions of OVS and OVN than what was
  available at the time of release of past OpenStack releases, so we would like
  to have this change backported to Ussuri.

  This bug is to track the backport of the change.

  0: https://review.opendev.org/c/openstack/neutron/+/803268

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1961013/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1873091] Re: [RFE] Neutron ports dns_assignment does not match the designate DNS records for Neutron port

2022-09-20 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Patch added: "lp1873091-ussuri.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1873091/+attachment/5617403/+files/lp1873091-ussuri.debdiff

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1873091

Title:
  [RFE] Neutron ports dns_assignment does not match the designate DNS
  records for Neutron port

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New

Bug description:
  the Neutron port dns_assignment dont match the designate DNS records
  assigned to the Neutron port

  as explained in the link below 
  https://docs.openstack.org/neutron/pike/admin/config-dns-int.html

  when a user creates a neutron port using the command below
  neutron port-create 37aaff3a-6047-45ac-bf4f-a825e56fd2b3 \
--dns-name my-vm --dns_domain port-domain.org.

  The actual output for dns_assignment is:
  {"hostname": "my-vm", "ip_address": "203.0.113.9", "fqdn": 
"my-vm.example.org."}
  {"hostname": "my-vm", "ip_address": "2001:db8:10::9", "fqdn": 
"my-vm.example.org."}

  and the Designate DNS records is 
  67a8e83d-7e3c-4fb1-9261-0481318bb7b5 | A| my-vm.port-domain.org.  | 
203.0.113.9  
  5a4f671c-9969-47aa-82e1-e05754021852 |  | my-vm.port-domain.org.  | 
2001:db8:10::9 

  while the expected output for dns-assignment:
  {"hostname": "my-vm", "ip_address": "203.0.113.9", "fqdn": 
"my-vm.port-domain.org."}
  {"hostname": "my-vm", "ip_address": "2001:db8:10::9", "fqdn": 
"my-vm.port-domain.org."}

  
  most likely right now the dns_domain is taken from the Neutron network 
dns_domain or from neutron dns_domain configuration

  
  A good approach would be to always make the dns_assignment for Neutron port 
synced with the Designate DNS records if Designate is used

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1873091/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1989986] Re: changing dns_domain on the charm does not propagate to ovn

2022-09-16 Thread Edward Hope-Morley
By extension, the ovn dns entries are also not updated if I set/change
domain on the network itself:

$ openstack network set --dns-domain "testlab2.stsstack.qa.1ss." private
$ openstack network show private -c dns_domain -f value
testlab2.stsstack.qa.1ss.

New networks do get the correct domain. Looks like this might actually
be a neutron bug then.

** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1989986

Title:
  changing dns_domain on the charm does not propagate to ovn

Status in OpenStack Neutron API Charm:
  New
Status in neutron:
  New

Bug description:
  If I deploy Focal Ussuri with the Openstack charms and set a domain on
  the neutron-api dns-domain config and then subsequently change it to
  another value, the dns table entries in ovn-central do not get updated
  i.e.

  # ovn-nbctl list dns
  _uuid   : 9ed2e4db-e262-4745-9c5b-0d808269431d
  external_ids: {ls_name=neutron-a193dccd-1e21-4ee3-be93-a88e86f0d2c4}
  records : 
{"160.21.168.192.in-addr.arpa"=focal-124153.testlab.stsstack.qa.1ss, 
focal-124153="192.168.21.160", 
focal-124153.testlab.stsstack.qa.1ss="192.168.21.160"}

  should be using domain "testlab2.stsstack.qa.1ss" since that is what I
  set on the charm and is set in the config:

  # grep -r dns /etc/neutron/neutron.conf 
  dns_domain = testlab2.stsstack.qa.1ss.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-api/+bug/1989986/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1973276] Re: OVN port loses its virtual type after port update

2022-08-23 Thread Edward Hope-Morley
** Changed in: cloud-archive/zed
   Status: New => Fix Released

** Changed in: cloud-archive/yoga
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1973276

Title:
  OVN port loses its virtual type after port update

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New

Bug description:
  Bug found in Octavia (master)

  Octavia creates at least 2 ports for each load balancer:
  - the VIP port, it is down, it keeps/stores the IP address of the LB
  - the VRRP port, plugged into a VM, it has the VIP address in the 
allowed-address list (and the VIP address is configured on the interface in the 
VM)

  When sending an ARP request for the VIP address, the VRRP port should
  reply with its mac-address.

  In OVN the VIP port is marked as "type: virtual".

  But when the VIP port is updated, it loses its "port: virtual" status
  and that breaks the ARP resolution (OVN replies to the ARP request by
  sending the mac-address of the VIP port - which is not used/down).

  Quick reproducer that simulates the Octavia behavior:

  
  ===

  import subprocess
  import time
   
  import openstack
   
  conn = openstack.connect(cloud="devstack-admin-demo")
   
  network = conn.network.find_network("public")
   
  sg = conn.network.find_security_group('sg')
  if not sg:
  sg = conn.network.create_security_group(name='sg')
   
  vip_port = conn.network.create_port(
  name="lb-vip",
  network_id=network.id,
  device_id="lb-1",
  device_owner="me",
  is_admin_state_up=False)
   
  vip_address = [
  fixed_ip['ip_address']
  for fixed_ip in vip_port.fixed_ips
  if '.' in fixed_ip['ip_address']][0]
   
  vrrp_port = conn.network.create_port(
  name="lb-vrrp",
  device_id="vrrp",
  device_owner="vm",
  network_id=network.id)
  vrrp_port = conn.network.update_port(
  vrrp_port,
  allowed_address_pairs=[
  {"ip_address": vip_address,
   "mac_address": vrrp_port.mac_address}])
   
  time.sleep(1)
   
  output = subprocess.check_output(
  f"sudo ovn-nbctl show | grep -A2 'port {vip_port.id}'",
  shell=True)
  output = output.decode('utf-8')
   
  if 'type: virtual' in output:
  print("Port is virtual, this is ok.")
  print(output)
   
  conn.network.update_port(
  vip_port,
  security_group_ids=[sg.id])
   
  time.sleep(1)
   
  output = subprocess.check_output(
  f"sudo ovn-nbctl show | grep -A2 'port {vip_port.id}'",
  shell=True)
  output = output.decode('utf-8')
   
  if 'type: virtual' not in output:
  print("Port is not virtual, this is an issue.")
  print(output)

  ===

  
  In my env (devstack master on c9s):
  $ python3 /mnt/host/virtual_port_issue.py
  Port is virtual, this is ok.
  port e0fe2894-e306-42d9-8c5e-6e77b77659e2 (aka lb-vip)
  type: virtual
  addresses: ["fa:16:3e:93:00:8f 172.24.4.111 2001:db8::178"]

  Port is not virtual, this is an issue.
  port e0fe2894-e306-42d9-8c5e-6e77b77659e2 (aka lb-vip)
  addresses: ["fa:16:3e:93:00:8f 172.24.4.111 2001:db8::178"]
  port 8ec36278-82b1-436b-bc5e-ea03ef22192f

  
  In Octavia, the "port: virtual" is _sometimes_ back after other updates of 
the ports, but in some cases the LB is unreachable.

  (and "ovn-nbctl lsp-set-type  virtual" fixes the LB)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1973276/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1973276] Re: OVN port loses its virtual type after port update

2022-08-23 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1973276

Title:
  OVN port loses its virtual type after port update

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New

Bug description:
  Bug found in Octavia (master)

  Octavia creates at least 2 ports for each load balancer:
  - the VIP port, it is down, it keeps/stores the IP address of the LB
  - the VRRP port, plugged into a VM, it has the VIP address in the 
allowed-address list (and the VIP address is configured on the interface in the 
VM)

  When sending an ARP request for the VIP address, the VRRP port should
  reply with its mac-address.

  In OVN the VIP port is marked as "type: virtual".

  But when the VIP port is updated, it loses its "port: virtual" status
  and that breaks the ARP resolution (OVN replies to the ARP request by
  sending the mac-address of the VIP port - which is not used/down).

  Quick reproducer that simulates the Octavia behavior:

  
  ===

  import subprocess
  import time
   
  import openstack
   
  conn = openstack.connect(cloud="devstack-admin-demo")
   
  network = conn.network.find_network("public")
   
  sg = conn.network.find_security_group('sg')
  if not sg:
  sg = conn.network.create_security_group(name='sg')
   
  vip_port = conn.network.create_port(
  name="lb-vip",
  network_id=network.id,
  device_id="lb-1",
  device_owner="me",
  is_admin_state_up=False)
   
  vip_address = [
  fixed_ip['ip_address']
  for fixed_ip in vip_port.fixed_ips
  if '.' in fixed_ip['ip_address']][0]
   
  vrrp_port = conn.network.create_port(
  name="lb-vrrp",
  device_id="vrrp",
  device_owner="vm",
  network_id=network.id)
  vrrp_port = conn.network.update_port(
  vrrp_port,
  allowed_address_pairs=[
  {"ip_address": vip_address,
   "mac_address": vrrp_port.mac_address}])
   
  time.sleep(1)
   
  output = subprocess.check_output(
  f"sudo ovn-nbctl show | grep -A2 'port {vip_port.id}'",
  shell=True)
  output = output.decode('utf-8')
   
  if 'type: virtual' in output:
  print("Port is virtual, this is ok.")
  print(output)
   
  conn.network.update_port(
  vip_port,
  security_group_ids=[sg.id])
   
  time.sleep(1)
   
  output = subprocess.check_output(
  f"sudo ovn-nbctl show | grep -A2 'port {vip_port.id}'",
  shell=True)
  output = output.decode('utf-8')
   
  if 'type: virtual' not in output:
  print("Port is not virtual, this is an issue.")
  print(output)

  ===

  
  In my env (devstack master on c9s):
  $ python3 /mnt/host/virtual_port_issue.py
  Port is virtual, this is ok.
  port e0fe2894-e306-42d9-8c5e-6e77b77659e2 (aka lb-vip)
  type: virtual
  addresses: ["fa:16:3e:93:00:8f 172.24.4.111 2001:db8::178"]

  Port is not virtual, this is an issue.
  port e0fe2894-e306-42d9-8c5e-6e77b77659e2 (aka lb-vip)
  addresses: ["fa:16:3e:93:00:8f 172.24.4.111 2001:db8::178"]
  port 8ec36278-82b1-436b-bc5e-ea03ef22192f

  
  In Octavia, the "port: virtual" is _sometimes_ back after other updates of 
the ports, but in some cases the LB is unreachable.

  (and "ovn-nbctl lsp-set-type  virtual" fixes the LB)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1973276/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1852610] Re: [SRU] API allows source compute service/node deletion while instances are pending a resize confirm/revert

2022-07-18 Thread Edward Hope-Morley
** Changed in: cloud-archive
   Status: New => Fix Committed

** Changed in: cloud-archive
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1852610

Title:
  [SRU] API allows source compute service/node deletion while instances
  are pending a resize confirm/revert

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive queens series:
  In Progress
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Fix Committed
Status in OpenStack Compute (nova) rocky series:
  Fix Committed
Status in OpenStack Compute (nova) stein series:
  Fix Committed
Status in OpenStack Compute (nova) train series:
  Fix Committed
Status in nova package in Ubuntu:
  Invalid
Status in nova source package in Bionic:
  In Progress

Bug description:
  [Impact]

   * API will allow deleting a source compute service which has
  migration-based allocations for the source node resource provider and
  pending instance resizes involving the source node.

   * Backporting the fix will improve application resilience in this
  case.

  [Test Case]

   1. create a server on host1
   2. resize or cold migrate it to a dest host2
   3. delete the compute service for host1

   At this point the resource provider for host1 is orphaned.

   4. try to confirm/revert the resize of the server which will fail
  because the compute node for host1 is gone and this results in the
  server going to ERROR status

  [Where problems could occur]

   * This change introduces an exception condition in the API and
  prevents the erroneous deletion of compute services which would result
  in orphaned state.

   * As such we should expect to see altered behavior from the API as
  detailed in api-ref/source/os-services.inc

   * If problems were to occur they would manifest in behavior that is
  different from both the original behavior of the API and the new
  behavior.

  --- Original Description ---
  This is split off from bug 1829479 which is about deleting a compute service 
which had servers evacuated from it which will orphan resource providers in 
placement.

  A similar scenario is true where the API will allow deleting a source
  compute service which has migration-based allocations for the source
  node resource provider and pending instance resizes involving the
  source node. A  simple scenario is:

  1. create a server on host1
  2. resize or cold migrate it to a dest host2
  3. delete the compute service for host1

  At this point the resource provider for host1 is orphaned.

  4. try to confirm/revert the resize of the server which will fail
  because the compute node for host1 is gone and this results in the
  server going to ERROR status

  Based on the discussion in this mailing list thread:

  http://lists.openstack.org/pipermail/openstack-
  discuss/2019-November/010843.html

  We should probably have the DELETE /os-services/{service_id} API block
  trying to delete a service that has pending migrations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1852610/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1947127] Re: [SRU] Some DNS extensions not working with OVN

2022-07-11 Thread Edward Hope-Morley
Added U/V/W for backport.

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/xena
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Impish)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1947127

Title:
  [SRU] Some DNS extensions not working with OVN

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Impish:
  Fix Released
Status in neutron source package in Jammy:
  Fix Released
Status in neutron source package in Kinetic:
  Fix Released

Bug description:
  [Impact]

  On a fresh devstack install with the q-dns service enable from the
  neutron devstack plugin, some features still don't work, e.g.:

  $ openstack subnet set private-subnet --dns-publish-fixed-ip
  BadRequestException: 400: Client Error for url: 
https://10.250.8.102:9696/v2.0/subnets/9f50c79e-6396-4c5b-be92-f64aa0f25beb, 
Unrecognized attribute(s) 'dns_publish_fixed_ip'

  $ openstack port create p1 --network private --dns-name p1 --dns-domain a.b.
  BadRequestException: 400: Client Error for url: 
https://10.250.8.102:9696/v2.0/ports, Unrecognized attribute(s) 'dns_domain'

  The reason seems to be that
  
https://review.opendev.org/c/openstack/neutron/+/686343/31/neutron/common/ovn/extensions.py
  only added dns_domain_keywords, but not e.g. dns_domain_ports as
  supported by OVN

  [Test Case]

  Create a normal OpenStack neutron test environment to see if we can
  successfully run the following commands:

  openstack subnet set private_subnet --dns-publish-fixed-ip
  openstack port create p1 --network private --dns-name p1 --dns-domain a.b.

  [Regression Potential]

  The fix has merged into the upstream stable/xena branch [1], here's
  just SRU into the 19.1.0 branch of UCA xena (the fix is already in
  20.0.0 so it's already in jammy and kinetic and focal-yoga), so it is
  a clean backport and might be helpful for deployments migrating to
  OVN.

  [1] https://review.opendev.org/c/openstack/neutron/+/838650

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1947127/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1979089] [NEW] l3ha router delete race condition if state changes to master

2022-06-17 Thread Edward Hope-Morley
Public bug reported:

We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA
are occasionally running into the following:

On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
sometimes gets into a state where it is repeatedly trying to spawn a
metadata proxy (haproxy) for a router that no longer exists and fails
because the namespace is no longer there. This happens thousands of
times a day and basically blocks the l3-agent from processing other
updates. The error looks like:

2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 
'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
-4a1b-9393-1374a8c744c3.conf'] create_process 
/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; 
Stdin: ; Stdout: ; Stderr: Cannot open network namespace 
"qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory

Some background; when the l3-agent starts is subscribes callback methods
to certain events. One of those events is "before_delete" [1] and the
method is neutron.agent.metadata.driver.before_router_removed. The idea
here is that when an update to delete a router is received it will first
execute this method which will delete the "neutron metadata-proxy
monitor" which looks at the haproxy pid and respawns it if it dies.

A successful callback execution looks like:

2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager 
[req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - 
- -] Notify callbacks 
['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] 
for router, before_delete _notify_loop 
/usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils 
[req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - 
- -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process 
/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88

And an unsuccessful one looks like:

2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 
57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent 
sgdemr0114bp007
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager 
[req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - 
- -] Notify callbacks 
['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] 
for router, before_delete _notify_loop 
/usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata 
proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy 
/usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219

The difference being that instead of killing the proxy monitor it is
actually spawning a new one! The other thing to notice is that in the
second case it isn't servicing the same request (it has "[-]"). I looked
at the code and found that this is because when the router transitions
to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then
proceeds with the update i.e. spawning the metadata proxy and while it
was asleep it started to process the before_delete callback but then got
pre-empted.

So this looks like a simple race condition and occurs if a router
transitions to master at the same time as being deleted.

A simple interim workaround is to manually create the non-existent
namespace which will allow the respawn to succeed and then hopefully the
callback gets to run and deletes it again to clean up. That or restart
your neutron-l3-agent service.

[1] 
https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
[2] 
https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1979089

Title:
  l3ha router delete race condition if state changes to master

Status in neutron:
  New

Bug description:
  We have hit an issue whereby nodes running Neutron Ussuri with ML2
  L3HA are occasionally running into the following:

  On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
  sometimes gets into a state where it is repeatedly trying to spawn a
  metadata proxy (haproxy) for a router that no longer exists and fails
  because the namespace is no longer there. This happens thousands of
  times a day and basically blocks the l3-agent from 

[Yahoo-eng-team] [Bug 1947127] Re: [SRU] Some DNS extensions not working with OVN

2022-05-30 Thread Edward Hope-Morley
this is releases in all but xena which will be available in the Ubuntu
Cloud Archive in the upcoming 19.0.3 stable release

** Changed in: cloud-archive/xena
   Status: Fix Released => New

** Changed in: cloud-archive/yoga
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Jammy)
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Kinetic)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1947127

Title:
  [SRU] Some DNS extensions not working with OVN

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  New
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Impish:
  New
Status in neutron source package in Jammy:
  Fix Released
Status in neutron source package in Kinetic:
  Fix Released

Bug description:
  [Impact]

  On a fresh devstack install with the q-dns service enable from the
  neutron devstack plugin, some features still don't work, e.g.:

  $ openstack subnet set private-subnet --dns-publish-fixed-ip
  BadRequestException: 400: Client Error for url: 
https://10.250.8.102:9696/v2.0/subnets/9f50c79e-6396-4c5b-be92-f64aa0f25beb, 
Unrecognized attribute(s) 'dns_publish_fixed_ip'

  $ openstack port create p1 --network private --dns-name p1 --dns-domain a.b.
  BadRequestException: 400: Client Error for url: 
https://10.250.8.102:9696/v2.0/ports, Unrecognized attribute(s) 'dns_domain'

  The reason seems to be that
  
https://review.opendev.org/c/openstack/neutron/+/686343/31/neutron/common/ovn/extensions.py
  only added dns_domain_keywords, but not e.g. dns_domain_ports as
  supported by OVN

  [Test Case]

  Create a normal OpenStack neutron test environment to see if we can
  successfully run the following commands:

  openstack subnet set private_subnet --dns-publish-fixed-ip
  openstack port create p1 --network private --dns-name p1 --dns-domain a.b.

  [Regression Potential]

  The fix has merged into the upstream stable/xena branch [1], here's
  just SRU into the 19.1.0 branch of UCA xena (the fix is already in
  20.0.0 so it's already in jammy and kinetic and focal-yoga), so it is
  a clean backport and might be helpful for deployments migrating to
  OVN.

  [1] https://review.opendev.org/c/openstack/neutron/+/838650

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1947127/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1965297] [NEW] l3ha don't set backup qg ports down

2022-03-17 Thread Edward Hope-Morley
Public bug reported:

The history to this request is as follows; bug 1916024 fixed an issue
that subsequently had to be reverted due to a regression that it
introduced (see bug 1927868) and the original issue can once again
present itself in that keepalived is unable to send GARP on the qg port
until the port is marked as UP by neutron which in loaded environments
can sometimes take longer than keepalived will wait (e.g. when an
l3-agent is restarted on a host that has hundreds of routers). The
reason why qg- ports are marked as DOWN is because of the patch landed
as part of bug 1859832 and as I understand it there is now consensus
from upstream [1] to revert that patch as well and a better solution is
needed to fix that particular issue. I have not found a bug open yet for
the revert hence why I am opening this one.

[1]
https://meetings.opendev.org/meetings/neutron_drivers/2022/neutron_drivers.2022-03-04-14.03.log.txt

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1965297

Title:
  l3ha don't set backup qg ports down

Status in neutron:
  New

Bug description:
  The history to this request is as follows; bug 1916024 fixed an issue
  that subsequently had to be reverted due to a regression that it
  introduced (see bug 1927868) and the original issue can once again
  present itself in that keepalived is unable to send GARP on the qg
  port until the port is marked as UP by neutron which in loaded
  environments can sometimes take longer than keepalived will wait (e.g.
  when an l3-agent is restarted on a host that has hundreds of routers).
  The reason why qg- ports are marked as DOWN is because of the patch
  landed as part of bug 1859832 and as I understand it there is now
  consensus from upstream [1] to revert that patch as well and a better
  solution is needed to fix that particular issue. I have not found a
  bug open yet for the revert hence why I am opening this one.

  [1]
  
https://meetings.opendev.org/meetings/neutron_drivers/2022/neutron_drivers.2022-03-04-14.03.log.txt

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1965297/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1906375] Re: [L3] router HA port concurrently deleting

2022-03-11 Thread Edward Hope-Morley
This fix for this is merged [1] and released to ill update the status
accordingly.

[1]
https://github.com/openstack/neutron/commit/91eb3d8346a8964aa046d1e016d571056de868de

** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1906375

Title:
  [L3] router HA port concurrently deleting

Status in neutron:
  Fix Released

Bug description:
  Router HA port may be deleted concurrently while the plugin is trying to 
update. Then a PortNotFound exception raised.
  ERROR was found at rocky deployment, but the master branch has the same code.

  
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
[req-86012433-ab6b-41f5-bbba-411ec3e1d973 - - - - -] Exception during message 
handling: PortNotFound: Port 3f838c59-e84a-49de-a381-f3328d47a69f could not be 
found.
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in 
_process_incoming
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server res = 
self.dispatcher.dispatch(message)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, 
in dispatch
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server return 
self._do_dispatch(endpoint, method, ctxt, args)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, 
in _do_dispatch
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server result = 
func(ctxt, **new_args)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 93, 
in update_all_ha_network_port_statuses
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server context, p, 
{'port': {'status': constants.PORT_STATUS_DOWN}})
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron/common/utils.py", line 632, in inner
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server return 
f(self, context, *args, **kwargs)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron/db/api.py", line 123, in wrapped
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server return 
method(*args, **kwargs)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 140, in wrapped
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server setattr(e, 
'_RETRY_EXCEEDED', True)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
self.force_reraise()
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
six.reraise(self.type_, self.value, self.tb)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 136, in wrapped
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server return 
f(*args, **kwargs)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server ectxt.value 
= e.inner_exc
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
self.force_reraise()
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
six.reraise(self.type_, self.value, self.tb)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server return 
f(*args, **kwargs)
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron_lib/db/api.py", line 183, in wrapped
  2020-12-01 10:52:46.738 62077 ERROR oslo_messaging.rpc.server 
LOG.debug("Retry wrapper got retriable exception: 

[Yahoo-eng-team] [Bug 1939604] Re: [SRU] Cannot create 1vcpu instance with multiqueue image, vif_type=tap (calico)

2022-01-18 Thread Edward Hope-Morley
** Changed in: cloud-archive/ussuri
   Status: Triaged => Fix Released

** Changed in: cloud-archive/victoria
   Status: Triaged => Fix Released

** Changed in: cloud-archive/wallaby
   Status: Triaged => Fix Released

** Changed in: nova (Ubuntu Focal)
   Status: Triaged => Fix Released

** Changed in: nova (Ubuntu Hirsute)
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1939604

Title:
  [SRU] Cannot create 1vcpu instance with multiqueue image, vif_type=tap
  (calico)

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ussuri series:
  Fix Released
Status in OpenStack Compute (nova) victoria series:
  Fix Released
Status in OpenStack Compute (nova) wallaby series:
  Fix Released
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Focal:
  Fix Released
Status in nova source package in Hirsute:
  Fix Released

Bug description:
  Tested on stable/wallaby

  Fix for bug #1893263, in which it enabled vif_type=tap (calico use
  case) devices to support multiqueue in nova, also caused a regression
  where now when creating the instances with multiqueue, if using a
  flavor with only VCPU, it fails with the error below in the logs.

  This problem can easily be avoided by not using 1VCPUs flavors with
  multiqueue images (because they wouldn't make sense anyway), and
  therefore using non-multiqueue images when the flavor is 1VCPU, but
  provides a bad user experience: Users shouldn't need to be concerned
  about flavor+image combinations

  Steps to reproduce are the same as #1893263 but using a 1VCPU flavor +
  multiqueue metadata on images

  21-08-11 17:36:44.317 376565 ERROR nova.compute.manager 
[req-99e80890-6c99-4015-91b6-ef99e6be3fa7 ea7dfe225d48428c860321498e184739 
8833157a5d244727a74017e5f8729312 - 0373963ccb0042da8306b35775521d60 
0373963ccb0042da8306b35775521d60] [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] Failed to build and run instance: 
libvirt.libvirtError: Unable to create tap device tap73a105b8-82: Invalid 
argument
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] Traceback (most recent call last):
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2366, in 
_build_and_run_instance
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] self.driver.spawn(context, instance, 
image_meta,
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3885, in 
spawn
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] self._create_guest_with_network(
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6961, in 
_create_guest_with_network
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] self._cleanup_failed_start(
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] self.force_reraise()
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in 
force_reraise
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] raise self.value
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6930, in 
_create_guest_with_network
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b] guest = self._create_guest(
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 
505b68b4-498c-4ea9-85ce-8be0c305ec4b]   File 
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 6863, in 
_create_guest
  2021-08-11 17:36:44.317 376565 ERROR nova.compute.manager [instance: 

[Yahoo-eng-team] [Bug 1934912] Re: Router update fails for ports with allowed_address_pairs containg IP range in CIDR notation

2021-09-30 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Impish)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Hirsute)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1934912

Title:
  Router update fails for ports with allowed_address_pairs containg IP
  range in CIDR  notation

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Hirsute:
  New
Status in neutron source package in Impish:
  New

Bug description:
  With https://review.opendev.org/c/openstack/neutron/+/792791 neutron build 
from branch `stable/train` fails to update routers with ports containing an 
`allowed_address_pair` containing an IP address range in CIDR notation, i.e.:
  ```
  openstack port show 135515bf-6cdf-45d7-affa-c775d2a43ce1 -f value -c 
allowed_address_pairs
  [{'mac_address': 'fa:16:3e:1e:c4:f1', 'ip_address': '192.168.0.0/16'}]
  ```

  I could not find definitive information on wether this is an allowed
  value for allowed_address_pairs, but at least the openstack/magnum
  project makes use of this.

  Once the above is set neutron-l3-agent logs errors shown in
  http://paste.openstack.org/show/807237/ and connection to all
  resources behind the router stop.

  Steps to reproduce:
  Set up openstack environment with neutron build from git branch stable/train 
with OVS, DVR and router HA in a multinode deployment on ubuntu bionic.

  Create a test environment:
  openstack network create test
  openstack subnet create --network test --subnet-range 10.0.0.0/24 test
  openstack router create --ha --distributed test
  openstack router set --external-gateway  test
  openstack router add subnet test test
  openstack server create --image  --flavor m1.small 
--security-group  --network test test
  openstack security group create icmp
  openstack security group rule create --protocol icmp --ingress icmp
  openstack server add security group test icmp
  openstack floating ip create 
  openstack server add floating ip test 
  ping 
  openstack port set --allowed-address ip-address=192.168.0.0/16 
  ping 

  Observe loss of ping after setting allowed_address_pairs.
  Revert https://review.opendev.org/c/openstack/neutron/+/792791 and redeploy 
neutron
  ping 
  Observe reestablishment of the connection.

  Please let me know if you need any other information


  +

  SRU:

  [Impact]
  VM with floating ip are unreachable from external

  [Test Case]
  Create a test environment on bionic ussuri
  openstack network create test
  openstack subnet create --network test --subnet-range 10.0.0.0/24 test
  openstack router create --ha --distributed test
  openstack router set --external-gateway  test
  openstack router add subnet test test
  openstack server create --image  --flavor m1.small 
--security-group  --network test test
  openstack security group create icmp
  openstack security group rule create --protocol icmp --ingress icmp
  openstack server add security group test icmp
  openstack floating ip create 
  openstack server add floating ip test 
  ping 
  openstack port set --allowed-address ip-address=192.168.0.0/16 
  openstack router set --disable 
  openstack router set --enable 
  ping 

  # ping should be successful after router is enabled.

  [Regression Potential]
  The only possibilities for allowed_address_pair are either IP or a CIDR. 
There is no chance of garbage values since it is verified during port update 
with allowed_address_pair. The edge case of IP with CIDR notation like /32 are 
already covered in common_utils.is_cidr_host() function call. All the upstream 
CI builds until stable/ussuri are successful.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1934912/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : 

[Yahoo-eng-team] [Bug 1915480] Re: DeviceManager's fill_dhcp_udp_checksums assumes IPv6 available

2021-09-17 Thread Edward Hope-Morley
** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1915480

Title:
  DeviceManager's fill_dhcp_udp_checksums assumes IPv6 available

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Committed
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New

Bug description:
  The following code in DeviceManager's fill_dhcp_udp_checksums assumes
  IPv6 is always enabled:

  iptables_mgr = iptables_manager.IptablesManager(use_ipv6=True,
  namespace=namespace)

  When iptables_mgr.apply() is later called, an attempt to add the UDP
  checksum rule for DHCP is done via iptables-save/iptables-restore and
  if IPv6 has been disabled on a hypervisor (eg, by setting
  `ipv6.disable=1` on the kernel command line) then an many-line error
  occurs in the DHCP agent logfile.

  There should be a way of telling the agent that IPv6 is disabled and
  as such, it should ignore trying to set up the UDP checksum rule for
  IPv6. This can be easily achieved given that IptablesManager already
  has support for disabling it.

  We've seen this on Rocky on Ubuntu Bionic but it appears the issue
  still exists on the master branch.

  =
  Ubuntu SRU details:

  [Impact]
  See above.

  [Test Case]
  Deploy openstack on a hypervisor with IPv6 disabled.
  Create a network which has a subnetwork with DHCP enabled.
  Search the `neutron-dhcp-agent.log` (with debug log enabled) and check if 
there are any `ip6tables-restore` commands.

  [Regression Potential]
  Minimal.
  Users which were relying on the setting to always be true could be affected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1915480/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1915480] Re: DeviceManager's fill_dhcp_udp_checksums assumes IPv6 available

2021-09-16 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1915480

Title:
  DeviceManager's fill_dhcp_udp_checksums assumes IPv6 available

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Committed
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New

Bug description:
  The following code in DeviceManager's fill_dhcp_udp_checksums assumes
  IPv6 is always enabled:

  iptables_mgr = iptables_manager.IptablesManager(use_ipv6=True,
  namespace=namespace)

  When iptables_mgr.apply() is later called, an attempt to add the UDP
  checksum rule for DHCP is done via iptables-save/iptables-restore and
  if IPv6 has been disabled on a hypervisor (eg, by setting
  `ipv6.disable=1` on the kernel command line) then an many-line error
  occurs in the DHCP agent logfile.

  There should be a way of telling the agent that IPv6 is disabled and
  as such, it should ignore trying to set up the UDP checksum rule for
  IPv6. This can be easily achieved given that IptablesManager already
  has support for disabling it.

  We've seen this on Rocky on Ubuntu Bionic but it appears the issue
  still exists on the master branch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1915480/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1900851] Re: Cannot Create Port with Fixed IP Address

2021-06-25 Thread Edward Hope-Morley
** Changed in: cloud-archive/xena
   Status: New => Fix Released

** Changed in: cloud-archive/wallaby
   Status: New => Fix Released

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

** Changed in: horizon (Ubuntu Impish)
   Status: New => Fix Released

** Changed in: horizon (Ubuntu Hirsute)
   Status: New => Fix Released

** Changed in: horizon (Ubuntu Groovy)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1900851

Title:
  Cannot Create Port with Fixed IP Address

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  Fix Released
Status in horizon source package in Focal:
  New
Status in horizon source package in Groovy:
  Fix Released
Status in horizon source package in Hirsute:
  Fix Released
Status in horizon source package in Impish:
  Fix Released

Bug description:
  With Ussuri on Ubuntu 20.04, I can create port with fixed IP address
  by CLI but I cannot do the same by Horizon GUI. I find some error like
  following on /var/log/apache2/error.log

  openstack_dashboard.dashboards.project.networks.ports.workflows Failed
  to create a port for network 91f04dfb-7f69-4050-8b3b-142ee555ae55:
  dictionary keys changed during iteration

  By more inspection, I can find that horizon never send that create
  port request to neutron. So I think it is horizon problem. Is this
  expected result or is this horizon bug? Is this related to policy?

  Following debug logs maybe related too.

  [Wed Oct 21 17:48:06.123807 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG neutronclient.client GET 
call to neutron for http://10.7.55.18:9696/v2.0/extensions used request id 
req-95db8d1f-387b-492b-aff6-8238f09e504d
  [Wed Oct 21 17:48:06.125925 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG django.template Exception 
while resolving variable 'add_to_field' in template 
'horizon/common/_workflow.html'.
  [Wed Oct 21 17:48:06.126064 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] 
django.template.base.VariableDoesNotExist: Failed lookup for key [add_to_field] 
in [{'True': True, 'False': False, 'None': None}, {'csrf_token': 
._get_val at 0x7f54c8e30f70>>, 
'LANGUAGES': (('cs', 'Czech'), ('de', 'German'), ('en', 'English'), ('en-au', 
'Australian English'), ('en-gb', 'British English'), ('eo', 'Esperanto'), 
('es', 'Spanish'), ('fr', 'French'), ('id', 'Indonesian'), ('it', 'Italian'), 
('ja', 'Japanese'), ('ko', 'Korean (Korea)'), ('pl', 'Polish'), ('pt-br', 
'Portuguese (Brazil)'), ('ru', 'Russian'), ('tr', 'Turkish'), ('zh-cn', 
'Simplified Chinese'), ('zh-tw', 'Chinese (Taiwan)')), 'LANGUAGE_CODE': 'en', 
'LANGUAGE_BIDI': False, 'request': , 
'MEDIA_URL': '/horizon/media/', 'STATIC_URL': '/horizon/static/', 'messages': 
, 'DEFAULT_MESSAGE_LEVELS': {'DEBUG': 10, 'INFO': 20, 'SUCCESS': 
25, 'WARNING': 30, 'ERROR': 40}, 'HORIZON_CONFIG': , 'True': True, 'False': False, 'authorized_tenants': 
[http://10.7.55.18:5000/v3/projects/84725e39c7a9462495e2cb6ae0cd111b'}, 
name=admin, options={}, parent_id=default, tags=[]>], 'keystone_providers': 
{'support': False}, 'regions': {'support': False, 'current': {'endpoint': 
'http://10.7.55.18:5000/v3/', 'name': 'Default Region'}, 'available': []}, 
'WEBROOT': '/horizon/', 'USER_MENU_LINKS': [{'name': 'OpenStack RC File', 
'icon_classes': ['fa-download'], 'url': 'horizon:project:api_access:openrc'}], 
'LOGOUT_URL': '/horizon/auth/logout/', 'profiler_enabled': False, 'JS_CATALOG': 
'horizon+openstack_dashboard'}, {}, {'network_id': 
'91f04dfb-7f69-4050-8b3b-142ee555ae55', 'view': 
, 'modal_backdrop': 'static', 'workflow': , 'REDIRECT_URL': None, 'layout': ['modal'], 'modal': True}, 
{'entry_point': 'create_info'}]

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1900851/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1900851] Re: Cannot Create Port with Fixed IP Address

2021-06-25 Thread Edward Hope-Morley
** Also affects: cloud-archive/xena
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/wallaby
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Hirsute)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Impish)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Groovy)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1900851

Title:
  Cannot Create Port with Fixed IP Address

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in Ubuntu Cloud Archive wallaby series:
  New
Status in Ubuntu Cloud Archive xena series:
  New
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Focal:
  New
Status in horizon source package in Groovy:
  New
Status in horizon source package in Hirsute:
  New
Status in horizon source package in Impish:
  New

Bug description:
  With Ussuri on Ubuntu 20.04, I can create port with fixed IP address
  by CLI but I cannot do the same by Horizon GUI. I find some error like
  following on /var/log/apache2/error.log

  openstack_dashboard.dashboards.project.networks.ports.workflows Failed
  to create a port for network 91f04dfb-7f69-4050-8b3b-142ee555ae55:
  dictionary keys changed during iteration

  By more inspection, I can find that horizon never send that create
  port request to neutron. So I think it is horizon problem. Is this
  expected result or is this horizon bug? Is this related to policy?

  Following debug logs maybe related too.

  [Wed Oct 21 17:48:06.123807 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG neutronclient.client GET 
call to neutron for http://10.7.55.18:9696/v2.0/extensions used request id 
req-95db8d1f-387b-492b-aff6-8238f09e504d
  [Wed Oct 21 17:48:06.125925 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG django.template Exception 
while resolving variable 'add_to_field' in template 
'horizon/common/_workflow.html'.
  [Wed Oct 21 17:48:06.126064 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] 
django.template.base.VariableDoesNotExist: Failed lookup for key [add_to_field] 
in [{'True': True, 'False': False, 'None': None}, {'csrf_token': 
._get_val at 0x7f54c8e30f70>>, 
'LANGUAGES': (('cs', 'Czech'), ('de', 'German'), ('en', 'English'), ('en-au', 
'Australian English'), ('en-gb', 'British English'), ('eo', 'Esperanto'), 
('es', 'Spanish'), ('fr', 'French'), ('id', 'Indonesian'), ('it', 'Italian'), 
('ja', 'Japanese'), ('ko', 'Korean (Korea)'), ('pl', 'Polish'), ('pt-br', 
'Portuguese (Brazil)'), ('ru', 'Russian'), ('tr', 'Turkish'), ('zh-cn', 
'Simplified Chinese'), ('zh-tw', 'Chinese (Taiwan)')), 'LANGUAGE_CODE': 'en', 
'LANGUAGE_BIDI': False, 'request': , 
'MEDIA_URL': '/horizon/media/', 'STATIC_URL': '/horizon/static/', 'messages': 
, 'DEFAULT_MESSAGE_LEVELS': {'DEBUG': 10, 'INFO': 20, 'SUCCESS': 
25, 'WARNING': 30, 'ERROR': 40}, 'HORIZON_CONFIG': , 'True': True, 'False': False, 'authorized_tenants': 
[http://10.7.55.18:5000/v3/projects/84725e39c7a9462495e2cb6ae0cd111b'}, 
name=admin, options={}, parent_id=default, tags=[]>], 'keystone_providers': 
{'support': False}, 'regions': {'support': False, 'current': {'endpoint': 
'http://10.7.55.18:5000/v3/', 'name': 'Default Region'}, 'available': []}, 
'WEBROOT': '/horizon/', 'USER_MENU_LINKS': [{'name': 'OpenStack RC File', 
'icon_classes': ['fa-download'], 'url': 'horizon:project:api_access:openrc'}], 
'LOGOUT_URL': '/horizon/auth/logout/', 'profiler_enabled': False, 'JS_CATALOG': 
'horizon+openstack_dashboard'}, {}, {'network_id': 
'91f04dfb-7f69-4050-8b3b-142ee555ae55', 'view': 
, 'modal_backdrop': 'static', 'workflow': , 'REDIRECT_URL': None, 'layout': ['modal'], 'modal': True}, 
{'entry_point': 'create_info'}]

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1900851/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1900851] Re: Cannot Create Port with Fixed IP Address

2021-06-24 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1900851

Title:
  Cannot Create Port with Fixed IP Address

Status in Ubuntu Cloud Archive:
  New
Status in OpenStack Dashboard (Horizon):
  Fix Released

Bug description:
  With Ussuri on Ubuntu 20.04, I can create port with fixed IP address
  by CLI but I cannot do the same by Horizon GUI. I find some error like
  following on /var/log/apache2/error.log

  openstack_dashboard.dashboards.project.networks.ports.workflows Failed
  to create a port for network 91f04dfb-7f69-4050-8b3b-142ee555ae55:
  dictionary keys changed during iteration

  By more inspection, I can find that horizon never send that create
  port request to neutron. So I think it is horizon problem. Is this
  expected result or is this horizon bug? Is this related to policy?

  Following debug logs maybe related too.

  [Wed Oct 21 17:48:06.123807 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG neutronclient.client GET 
call to neutron for http://10.7.55.18:9696/v2.0/extensions used request id 
req-95db8d1f-387b-492b-aff6-8238f09e504d
  [Wed Oct 21 17:48:06.125925 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] DEBUG django.template Exception 
while resolving variable 'add_to_field' in template 
'horizon/common/_workflow.html'.
  [Wed Oct 21 17:48:06.126064 2020] [wsgi:error] [pid 3095280:tid 
140002354386688] [remote 192.168.202.12:60886] 
django.template.base.VariableDoesNotExist: Failed lookup for key [add_to_field] 
in [{'True': True, 'False': False, 'None': None}, {'csrf_token': 
._get_val at 0x7f54c8e30f70>>, 
'LANGUAGES': (('cs', 'Czech'), ('de', 'German'), ('en', 'English'), ('en-au', 
'Australian English'), ('en-gb', 'British English'), ('eo', 'Esperanto'), 
('es', 'Spanish'), ('fr', 'French'), ('id', 'Indonesian'), ('it', 'Italian'), 
('ja', 'Japanese'), ('ko', 'Korean (Korea)'), ('pl', 'Polish'), ('pt-br', 
'Portuguese (Brazil)'), ('ru', 'Russian'), ('tr', 'Turkish'), ('zh-cn', 
'Simplified Chinese'), ('zh-tw', 'Chinese (Taiwan)')), 'LANGUAGE_CODE': 'en', 
'LANGUAGE_BIDI': False, 'request': , 
'MEDIA_URL': '/horizon/media/', 'STATIC_URL': '/horizon/static/', 'messages': 
, 'DEFAULT_MESSAGE_LEVELS': {'DEBUG': 10, 'INFO': 20, 'SUCCESS': 
25, 'WARNING': 30, 'ERROR': 40}, 'HORIZON_CONFIG': , 'True': True, 'False': False, 'authorized_tenants': 
[http://10.7.55.18:5000/v3/projects/84725e39c7a9462495e2cb6ae0cd111b'}, 
name=admin, options={}, parent_id=default, tags=[]>], 'keystone_providers': 
{'support': False}, 'regions': {'support': False, 'current': {'endpoint': 
'http://10.7.55.18:5000/v3/', 'name': 'Default Region'}, 'available': []}, 
'WEBROOT': '/horizon/', 'USER_MENU_LINKS': [{'name': 'OpenStack RC File', 
'icon_classes': ['fa-download'], 'url': 'horizon:project:api_access:openrc'}], 
'LOGOUT_URL': '/horizon/auth/logout/', 'profiler_enabled': False, 'JS_CATALOG': 
'horizon+openstack_dashboard'}, {}, {'network_id': 
'91f04dfb-7f69-4050-8b3b-142ee555ae55', 'view': 
, 'modal_backdrop': 'static', 'workflow': , 'REDIRECT_URL': None, 'layout': ['modal'], 'modal': True}, 
{'entry_point': 'create_info'}]

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1900851/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1929832] Re: stable/ussuri py38 support for keepalived-state-change monitor

2021-06-24 Thread Edward Hope-Morley
This has been released to the ussuri cloud archive (which is currently
on 2:16.3.2-0ubuntu3~cloud0) so marking Fix Released.

** Changed in: cloud-archive/ussuri
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1929832

Title:
  stable/ussuri py38 support for keepalived-state-change monitor

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in neutron:
  In Progress
Status in neutron package in Ubuntu:
  Invalid
Status in neutron source package in Focal:
  Fix Released

Bug description:
  [Impact]
  Please see original bug description. Without this fix, the neutron-l3-agent 
is unable to teardown an HA router and leaves it partially configured on every 
node it was running on.

  [Test Plan]
  * deploy Openstack ussuri on Ubuntu Focal
  * enable L3 HA
  * create a router and vm on network attached to router
  * disable or delete the router and check for errors like the one below
  * ensure that the following line exists in /etc/neutron/rootwrap.d/l3.filters:

  kill_keepalived_monitor_py38: KillFilter, root, python3.8, -15, -9

  -

  The victoria release of Openstack received patch [1] which allows the
  neutron-l3-agent to SIGKILL or SIGTERM the keepalived-state-change
  monitor when running under py38. This patch is needed in Ussuri for
  users running with py38 so we need to backport it.

  The consequence of not having this is that you get the following when
  you delete or disable a router:

  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
[req-8c69af29-8f9c-4721-9cba-81ff4e9be92c - 9320f5ac55a04fb280d9ceb0b1106a6e - 
- -] Error while deleting router ab63ccd8-1197-48d0-815e-31adc40e5193: 
neutron_lib.exceptions.ProcessExecutionError: Exit code: 99; Stdin: ; Stdout: ; 
Stderr: /usr/bin/neutron-rootwrap: Unauthorized command: kill -15 2516433 (no 
filter matched)
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent Traceback (most 
recent call last):
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 512, in 
_safe_router_removed
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self._router_removed(ri, router_id)
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 548, in 
_router_removed
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.router_info[router_id] = ri
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.force_reraise()
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
six.reraise(self.type_, self.value, self.tb)
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/six.py", line 703, in reraise
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent raise value
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 545, in 
_router_removed
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent ri.delete()
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/dvr_edge_router.py", line 236, 
in delete
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
super(DvrEdgeRouter, self).delete()
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 492, in 
delete
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.destroy_state_change_monitor(self.process_monitor)
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 438, in 
destroy_state_change_monitor
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
pm.disable(sig=str(int(signal.SIGTERM)))
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/external_process.py", line 
113, in disable
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
utils.execute(cmd, run_as_root=self.run_as_root)
  2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in 
execute
  2021-05-26 02:11:44.653 3457514 

[Yahoo-eng-team] [Bug 1931244] [NEW] ovn sriov broken from ussuri onwards

2021-06-08 Thread Edward Hope-Morley
Public bug reported:

I have an Openstack Ussuri 16.3.2 deployment using OVN. When I create a
vm with one or more sriov ports it fails with:

2021-06-08 11:38:31.939 526862 WARNING nova.virt.libvirt.driver [req-
c4be797e-7d7e-4e73-8406-f74ae51db192 696c98b722a44d229e16b6d6474a27d4
0b9102977dcc4d4ab662b48494bb3110 - 2e0bf6ec95c047d986a61f7570222149
2e0bf6ec95c047d986a61f7570222149] [instance: 7ab9b374-51eb-
4e94-8055-c69e8a7d76b3] Timeout waiting for [('network-vif-plugged',
'c2b7c68d-c465-4ca2-869a-59bc73b2b595'), ('network-vif-plugged',
'a50de16a-29ac-4dca-9cb6-0247a932fbf3')] for instance with vm_state
building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds

A bit of analysis shows that nova-compute did its thing and sits there
waiting on network-vif-plugged. The sriov-agent then notices new VFs
configured and sends a get_devices_details_list() rpc call to neutron
and neutron never responds. Reverting to 16.3.1 fixes the issue. Taking
a closer look at 16.3.2 by reverting patches lead to [1] as the culprit.

[1]
https://github.com/openstack/neutron/commit/7cf9597570f288d27768dc5ff7be04824d09f8bc

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1931244

Title:
  ovn sriov broken from ussuri onwards

Status in neutron:
  New

Bug description:
  I have an Openstack Ussuri 16.3.2 deployment using OVN. When I create
  a vm with one or more sriov ports it fails with:

  2021-06-08 11:38:31.939 526862 WARNING nova.virt.libvirt.driver [req-
  c4be797e-7d7e-4e73-8406-f74ae51db192 696c98b722a44d229e16b6d6474a27d4
  0b9102977dcc4d4ab662b48494bb3110 - 2e0bf6ec95c047d986a61f7570222149
  2e0bf6ec95c047d986a61f7570222149] [instance: 7ab9b374-51eb-
  4e94-8055-c69e8a7d76b3] Timeout waiting for [('network-vif-plugged',
  'c2b7c68d-c465-4ca2-869a-59bc73b2b595'), ('network-vif-plugged',
  'a50de16a-29ac-4dca-9cb6-0247a932fbf3')] for instance with vm_state
  building and task_state spawning.: eventlet.timeout.Timeout: 300
  seconds

  A bit of analysis shows that nova-compute did its thing and sits there
  waiting on network-vif-plugged. The sriov-agent then notices new VFs
  configured and sends a get_devices_details_list() rpc call to neutron
  and neutron never responds. Reverting to 16.3.1 fixes the issue.
  Taking a closer look at 16.3.2 by reverting patches lead to [1] as the
  culprit.

  [1]
  
https://github.com/openstack/neutron/commit/7cf9597570f288d27768dc5ff7be04824d09f8bc

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1931244/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1929832] [NEW] stable/ussuri py38 support for keepalived-state-change monitor

2021-05-27 Thread Edward Hope-Morley
Public bug reported:

The victoria release of Openstack received patch [1] which allows the
neutron-l3-agent to SIGKILL or SIGTERM the keepalived-state-change
monitor when running under py38. This patch is needed in Ussuri for
users running with py38 so we need to backport it.

The consequence of not having this is that you get the following when
you delete or disable a router:

2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
[req-8c69af29-8f9c-4721-9cba-81ff4e9be92c - 9320f5ac55a04fb280d9ceb0b1106a6e - 
- -] Error while deleting router ab63ccd8-1197-48d0-815e-31adc40e5193: 
neutron_lib.exceptions.ProcessExecutionError: Exit code: 99; Stdin: ; Stdout: ; 
Stderr: /usr/bin/neutron-rootwrap: Unauthorized command: kill -15 2516433 (no 
filter matched)
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent Traceback (most 
recent call last):
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 512, in 
_safe_router_removed
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self._router_removed(ri, router_id)
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 548, in 
_router_removed
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.router_info[router_id] = ri
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.force_reraise()
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
six.reraise(self.type_, self.value, self.tb)
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/six.py", line 703, in reraise
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent raise value
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/agent.py", line 545, in 
_router_removed
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent ri.delete()
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/dvr_edge_router.py", line 236, 
in delete
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
super(DvrEdgeRouter, self).delete()
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 492, in 
delete
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
self.destroy_state_change_monitor(self.process_monitor)
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 438, in 
destroy_state_change_monitor
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
pm.disable(sig=str(int(signal.SIGTERM)))
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/external_process.py", line 
113, in disable
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
utils.execute(cmd, run_as_root=self.run_as_root)
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent   File 
"/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in 
execute
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent raise 
exceptions.ProcessExecutionError(msg,
2021-05-26 02:11:44.653 3457514 ERROR neutron.agent.l3.agent 
neutron_lib.exceptions.ProcessExecutionError: Exit code: 99; Stdin: ; Stdout: ; 
Stderr: /usr/bin/neutron-rootwrap: Unauthorized command: kill -15 2516433 (no 
filter matched)

Which results in the router being deleted from neutron but not the node.
In my case i had both a qrouter and snat ns left with IPs still
configured as well as my fip ip rule allocation still present in
/var/lib/neutron/fip-priorities

[1]
https://github.com/openstack/neutron/commit/4fb505891ee32ae41247f1d7a48b7455b342840e

** Affects: cloud-archive
 Importance: Undecided
 Status: Invalid

** Affects: cloud-archive/ussuri
 Importance: High
 Status: Triaged

** Affects: neutron
 Importance: Undecided
 Assignee: Edward Hope-Morley (hopem)
 Status: In Progress

** Affects: neutron (Ubuntu)
 Importance: Undecided
 Status: Invalid

** Affects: neutron (Ubuntu Focal)
 Importance: High
 Status: Triaged

** Changed in: neutron
   Status: New => In Progress

** Changed in: neutron
 Assignee: (unassigned) => Edward Hope-Morley (hopem)

-- 
You recei

[Yahoo-eng-team] [Bug 1929821] [NEW] [dvr] misleading fip rule priority not found error message

2021-05-27 Thread Edward Hope-Morley
Public bug reported:

The fix for bug 1891673 added an error log like "Rule priority not found
for floating ip a.b.c.d" such that if the ip rule priority information
needed to configure a floating ip could not be found (in the fip-
priorities file) the error message is logged. This error message can be
misleading since not all floating ips will have or need a priority
allocation since only those configured in qrouter namespaces need them
but not fips for unbound ports that are configured in the snat ns (dvr).
We should gate retrieving that information on whether the fixed_ip
associated with fip is bound or not to avoid having this misleading
error messages.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1929821

Title:
  [dvr] misleading fip rule priority not found error message

Status in neutron:
  New

Bug description:
  The fix for bug 1891673 added an error log like "Rule priority not
  found for floating ip a.b.c.d" such that if the ip rule priority
  information needed to configure a floating ip could not be found (in
  the fip-priorities file) the error message is logged. This error
  message can be misleading since not all floating ips will have or need
  a priority allocation since only those configured in qrouter
  namespaces need them but not fips for unbound ports that are
  configured in the snat ns (dvr). We should gate retrieving that
  information on whether the fixed_ip associated with fip is bound or
  not to avoid having this misleading error messages.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1929821/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1907686] Re: ovn: instance unable to retrieve metadata

2021-05-24 Thread Edward Hope-Morley
** Changed in: cloud-archive/victoria
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1907686

Title:
  ovn: instance unable to retrieve metadata

Status in charm-ovn-chassis:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Won't Fix
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in neutron:
  Invalid
Status in openvswitch package in Ubuntu:
  Fix Released
Status in openvswitch source package in Focal:
  Fix Released
Status in openvswitch source package in Groovy:
  Fix Released
Status in openvswitch source package in Hirsute:
  Fix Released

Bug description:
  [Impact]
  Cloud instances are unable to retrieve metadata on startup.

  [Test Case]
  Deploy OpenStack with OVN/OVS
  Restart OVN central controllers
  Create a new instance
  Instance will fail to retrieve metadata with the message from the original 
bug report displayed in the metadata agent log on the local hypervisor

  [Regression Potential]
  The fix for this issue is included in the upstream 2.13.3 release of OVS.
  The fix ensures that SSL related connection issues are correctly handling in 
python3-ovs avoiding an issue where the connection to the OVN SB IDL is reset 
and never recreated.
  The OVN drivers use python3-ovsdbapp which in turn bases off code provided by 
python3-ovs.

  
  [Original Bug Report]
  Ubuntu:focal
  OpenStack: ussuri
  Instance port: hardware offloaded

  instance created, attempts to access metadata - metadata agent can't
  resolve the port/network combination:

  2020-12-10 15:00:18.258 4732 INFO neutron.agent.ovn.metadata.agent [-] Port 
d65418a6-d0e9-47e6-84ba-3d02fe75131a in datapath 
37706e4d-ce2a-4d81-8c61-3fd12437a0a7 bound to our ch
  assis
  2020-12-10 15:00:31.672 8062 ERROR neutron.agent.ovn.metadata.server [-] No 
port found in network 37706e4d-ce2a-4d81-8c61-3fd12437a0a7 with IP address 
10.5.1.155
  2020-12-10 15:00:31.673 8062 INFO eventlet.wsgi.server [-] 10.5.1.155, 
"GET /openstack HTTP/1.1" status: 404  len: 297 time: 0.0043790
  2020-12-10 15:00:34.639 8062 ERROR neutron.agent.ovn.metadata.server [-] No 
port found in network 37706e4d-ce2a-4d81-8c61-3fd12437a0a7 with IP address 
10.5.1.155
  2020-12-10 15:00:34.639 8062 INFO eventlet.wsgi.server [-] 10.5.1.155, 
"GET /openstack HTTP/1.1" status: 404  len: 297 time: 0.0040138

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ovn-chassis/+bug/1907686/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1926653] Re: [ovn] ml2/ovn may time out connecting to ovsdb server and stays dead in the water

2021-05-17 Thread Edward Hope-Morley
** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1926653

Title:
  [ovn] ml2/ovn may time out connecting to ovsdb server and stays dead
  in the water

Status in neutron:
  Fix Released

Bug description:
  Right now, the IDL connections between ml2/ovn are not resilient
  enough when connecting. It doesn't make sense to give up on that
  since the ml2/ovn is useless w/out that access.

  If ovsdb-server is slow and takes more than timeout seconds, everything
  reconnecting after partial downloads and starting over is not going to
  make things better. That is particularly likely to happen when the OVN
  DB is very large.

  This work is also tracked under Bugzilla:
  https://bugzilla.redhat.com/1955271

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1926653/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1849098] Re: ovs agent is stuck with OVSFWTagNotFound when dealing with unbound port

2021-05-11 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1849098

Title:
  ovs agent is stuck with OVSFWTagNotFound when dealing with unbound
  port

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  Fix Committed

Bug description:
  [Impact]

  somehow port is unbounded, then neutron-openvswitch-agent raise
  OVSFWTagNotFound, then creating new instance will be failed.

  [Test Plan]
  1. deploy bionic openstack env
  2. launch one instance
  3. modify neutron-openvswitch-agent code inside nova-compute
  - https://pastebin.ubuntu.com/p/nBRKkXmjx8/
  4. restart neutron-openvswitch-agent
  5. check if there are a lot of cannot get tag for port ..
  6. launch another instance.
  7. It fails after vif_plugging_timeout, with "virtual interface creation 
failed"

  [Where problems could occur]
  while no regressions are expected, if they do occur it would be when getting 
or creating vif port

  [Others]

  Original description.

  neutron-openvswitch-agent meets unbound port:

  2019-10-17 11:32:21.868 135 WARNING
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-
  aae68b42-a99f-4bb3-bcf6-a6d3c4ca9e31 - - - - -] Device
  ef34215f-e099-4fd0-935f-c9a42951d166 not defined on plugin or binding
  failed

  Later when applying firewall rules:

  2019-10-17 11:32:21.901 135 INFO neutron.agent.securitygroups_rpc 
[req-aae68b42-a99f-4bb3-bcf6-a6d3c4ca9e31 - - - - -] Preparing filters for 
devices {'ef34215f-e099-4fd0-935f-c9a42951d166', 
'e9c97cf0-1a5e-4d77-b57b-0ba474d12e29', 'fff1bb24-6423-4486-87c4-1fe17c552cca', 
'2e20f9ee-bcb5-445c-b31f-d70d276d45c9', '03a60047-cb07-42a4-8b49-619d5982a9bd', 
'a452cea2-deaf-4411-bbae-ce83870cbad4', '79b03e5c-9be0-4808-9784-cb4878c3dbd5', 
'9b971e75-3c1b-463d-88cf-3f298105fa6e'}
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-aae68b42-a99f-4bb3-bcf6-a6d3c4ca9e31 - - - - -] Error while processing VIF 
ports: neutron.agent.linux.openvswitch_firewall.exceptions.OVSFWTagNotFound: 
Cannot get tag for port o-hm0 from its other_config: {}
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most 
recent call last):
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.6/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py",
 line 530, in get_or_create_ofport
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent of_port = 
self.sg_port_map.ports[port_id]
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent KeyError: 
'ef34215f-e099-4fd0-935f-c9a42951d166'
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent During handling 
of the above exception, another exception occurred:
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most 
recent call last):
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.6/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py",
 line 81, in get_tag_from_other_config
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return 
int(other_config['tag'])
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent KeyError: 'tag'
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent During handling 
of the above exception, another exception occurred:
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most 
recent call last):
  2019-10-17 11:32:21.906 135 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File 
"/var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 2280, in rpc_loop
  2019-10-17 11:32:21.906 135 ERROR 

[Yahoo-eng-team] [Bug 1883089] Re: [L3] floating IP failed to bind due to no agent gateway port(fip-ns)

2021-04-27 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1883089

Title:
  [L3] floating IP failed to bind due to no agent gateway port(fip-ns)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released
Status in neutron source package in Impish:
  Fix Released

Bug description:
  In patch [1] it introduced a binding of DB uniq constraint for L3
  agent gateway. In some extreme case the DvrFipGatewayPortAgentBinding
  is in DB while the gateway port not. The current code path only checks
  the binding existence which will pass a "None" port to the following
  code path that results an AttributeError.

  [1] https://review.opendev.org/#/c/702547/

  Exception log:

  2020-06-11 15:39:28.361 1285214 INFO neutron.db.l3_dvr_db [None 
req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway 
port for network 3fcb7702-ae0b-46b4-807f-8ae94d656dd3 does not exist on host 
host-compute-1. Creating one.
  2020-06-11 15:39:28.370 1285214 DEBUG neutron.db.l3_dvr_db [None 
req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway 
port for network 3fcb7702-ae0b-46b4-807f-8ae94d656dd3 already exists on host 
host-compute-1. Probably it was just created by other worker. 
create_fip_agent_gw_port_if_not_exists 
/usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py:927
  2020-06-11 15:39:28.390 1285214 DEBUG neutron.db.l3_dvr_db [None 
req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Floating IP Agent Gateway 
port None found for the destination host: host-compute-1 
create_fip_agent_gw_port_if_not_exists 
/usr/lib/python2.7/site-packages/neutron/db/l3_dvr_db.py:933
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server [None 
req-d6a41187-2495-46bf-a424-ab7195c0ecb1 - - - - -] Exception during message 
handling: AttributeError: 'NoneType' object has no attribute 'get'
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server Traceback 
(most recent call last):
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 170, in 
_process_incoming
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server res = 
self.dispatcher.dispatch(message)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, 
in dispatch
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return 
self._do_dispatch(endpoint, method, ctxt, args)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, 
in _do_dispatch
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server result = 
func(ctxt, **new_args)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron/db/api.py", line 91, in wrapped
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server 
setattr(e, '_RETRY_EXCEEDED', True)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server 
self.force_reraise()
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server 
six.reraise(self.type_, self.value, self.tb)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/neutron/db/api.py", line 87, in wrapped
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server return 
f(*args, **kwargs)
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_db/api.py", line 147, in wrapper
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server 
ectxt.value = e.inner_exc
  2020-06-11 15:39:28.391 1285214 ERROR oslo_messaging.rpc.server   File 
"/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-06-11 

[Yahoo-eng-team] [Bug 1916022] Re: L3HA Race condition during startup of the agent may cause inconsistent router's states

2021-03-25 Thread Edward Hope-Morley
This is already fix released all the way back to stable/train -
https://review.opendev.org/q/I2cc58c30cf844ee0ecf0611ecdec430086464790 -
so i will update LP to reflect that.

** Changed in: neutron
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916022

Title:
  L3HA Race condition during startup of the agent may cause inconsistent
  router's states

Status in neutron:
  Fix Released

Bug description:
  I observed that issue in Tobiko jobs, like e.g.
  
https://5f31a0f7dc56e4b42a89-207bd119fd0c3b58e9c78074b243256d.ssl.cf2.rackcdn.com/776284/2/check
  /devstack-tobiko-gate-
  multinode/257fd87/tobiko_results_05_verify_resources_scenario.html

  Problem with HA routers. What happens there is that when neutron-l3-agent and 
then keepalived on node which is master is killed, new node becomes master but 
VIP address isn't removed from the qrouter namespace.
  Then some other node becomes new master as keepalived on that running nodes 
did its job.
  When stopped agent is started it first calls update_initial_state() 
https://github.com/openstack/neutron/blob/90309cf6e2f3ed5ae6d5f4cca3c5351c2ac67a13/neutron/agent/l3/ha_router.py#L159
  which will enqueue state change event and may do it with "primary" state 
(it's old state from before agent and keepalived was down.
  And immediately after that, it will also spawn state change monitor. And that 
monitor will also enqueue state change event. This one may be with correct 
"backup" state already. But as there was already state "primary" scheduled to 
be processed, new one will be dropped.
  And due to that it will end up with 2 nodes in "primary" state.

  I think that calling of update_initial_state() isn't really needed as
  state change monitor is handling notification of the initial state
  always just after start of the process.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916022/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1920975] Re: neutron dvr should lower proxy_delay when using proxy_arp

2021-03-23 Thread Edward Hope-Morley
This can alternatively easily be fixed by changing the default vi the
charm using the sysctl config option but of course that would not fix
existing fip namespaces.

** Description changed:

- Neutron DVR uses arp_proxy in fip namespaces to respond to arp requests
+ Neutron DVR uses proxy_arp in fip namespaces to respond to arp requests
  for instance floating ips. In doing so it is susceptible to a random
  delay up to by default 800ms which is added to the time taken to respond
  to an arp request that has to be proxied i.e.
  
  # ip netns exec fip-a297543b-9ef9-4bd5-b1ca-e85a726c1726 sysctl 
net.ipv4.{conf.fg-51f3e07b-2d.proxy_arp,neigh.fg-51f3e07b-2d.proxy_delay}
  net.ipv4.conf.fg-51f3e07b-2d.proxy_arp = 1
  net.ipv4.neigh.fg-51f3e07b-2d.proxy_delay = 80
  
  The result of this is seen when e.g. you ping a vm fip and the first
  request takes significantly longer than subsequent requests:
  
  $ ping -c 5 10.5.150.90
  PING 10.5.150.90 (10.5.150.90) 56(84) bytes of data.
  64 bytes from 10.5.150.90: icmp_seq=1 ttl=60 time=491 ms
  64 bytes from 10.5.150.90: icmp_seq=2 ttl=60 time=1.08 ms
  64 bytes from 10.5.150.90: icmp_seq=3 ttl=60 time=1.39 ms
  64 bytes from 10.5.150.90: icmp_seq=4 ttl=60 time=1.16 ms
  64 bytes from 10.5.150.90: icmp_seq=5 ttl=60 time=1.03 ms
  
  --- 10.5.150.90 ping statistics ---
  5 packets transmitted, 5 received, 0% packet loss, time 4007ms
  rtt min/avg/max/mdev = 1.034/99.157/491.134/195.988 ms
  
  To repro again simply delete arp entry for fip from fip ns of source
  compute host.
  
  By kernel standards this behaviour is by-design when using the default
  settings but some workloads may be impacted by this initial delay
  especially e.g. in loaded environments where the arp caches are under
  strain and hitting gc_thresh limits.

** Also affects: charm-neutron-openvswitch
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1920975

Title:
  neutron dvr should lower proxy_delay when using proxy_arp

Status in OpenStack neutron-openvswitch charm:
  New
Status in neutron:
  In Progress

Bug description:
  Neutron DVR uses proxy_arp in fip namespaces to respond to arp
  requests for instance floating ips. In doing so it is susceptible to a
  random delay up to by default 800ms which is added to the time taken
  to respond to an arp request that has to be proxied i.e.

  # ip netns exec fip-a297543b-9ef9-4bd5-b1ca-e85a726c1726 sysctl 
net.ipv4.{conf.fg-51f3e07b-2d.proxy_arp,neigh.fg-51f3e07b-2d.proxy_delay}
  net.ipv4.conf.fg-51f3e07b-2d.proxy_arp = 1
  net.ipv4.neigh.fg-51f3e07b-2d.proxy_delay = 80

  The result of this is seen when e.g. you ping a vm fip and the first
  request takes significantly longer than subsequent requests:

  $ ping -c 5 10.5.150.90
  PING 10.5.150.90 (10.5.150.90) 56(84) bytes of data.
  64 bytes from 10.5.150.90: icmp_seq=1 ttl=60 time=491 ms
  64 bytes from 10.5.150.90: icmp_seq=2 ttl=60 time=1.08 ms
  64 bytes from 10.5.150.90: icmp_seq=3 ttl=60 time=1.39 ms
  64 bytes from 10.5.150.90: icmp_seq=4 ttl=60 time=1.16 ms
  64 bytes from 10.5.150.90: icmp_seq=5 ttl=60 time=1.03 ms

  --- 10.5.150.90 ping statistics ---
  5 packets transmitted, 5 received, 0% packet loss, time 4007ms
  rtt min/avg/max/mdev = 1.034/99.157/491.134/195.988 ms

  To repro again simply delete arp entry for fip from fip ns of source
  compute host.

  By kernel standards this behaviour is by-design when using the default
  settings but some workloads may be impacted by this initial delay
  especially e.g. in loaded environments where the arp caches are under
  strain and hitting gc_thresh limits.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1920975/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1920975] [NEW] neutron dvr should lower proxy_delay when using proxy_arp

2021-03-23 Thread Edward Hope-Morley
Public bug reported:

Neutron DVR uses proxy_arp in fip namespaces to respond to arp requests
for instance floating ips. In doing so it is susceptible to a random
delay up to by default 800ms which is added to the time taken to respond
to an arp request that has to be proxied i.e.

# ip netns exec fip-a297543b-9ef9-4bd5-b1ca-e85a726c1726 sysctl 
net.ipv4.{conf.fg-51f3e07b-2d.proxy_arp,neigh.fg-51f3e07b-2d.proxy_delay}
net.ipv4.conf.fg-51f3e07b-2d.proxy_arp = 1
net.ipv4.neigh.fg-51f3e07b-2d.proxy_delay = 80

The result of this is seen when e.g. you ping a vm fip and the first
request takes significantly longer than subsequent requests:

$ ping -c 5 10.5.150.90
PING 10.5.150.90 (10.5.150.90) 56(84) bytes of data.
64 bytes from 10.5.150.90: icmp_seq=1 ttl=60 time=491 ms
64 bytes from 10.5.150.90: icmp_seq=2 ttl=60 time=1.08 ms
64 bytes from 10.5.150.90: icmp_seq=3 ttl=60 time=1.39 ms
64 bytes from 10.5.150.90: icmp_seq=4 ttl=60 time=1.16 ms
64 bytes from 10.5.150.90: icmp_seq=5 ttl=60 time=1.03 ms

--- 10.5.150.90 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 1.034/99.157/491.134/195.988 ms

To repro again simply delete arp entry for fip from fip ns of source
compute host.

By kernel standards this behaviour is by-design when using the default
settings but some workloads may be impacted by this initial delay
especially e.g. in loaded environments where the arp caches are under
strain and hitting gc_thresh limits.

** Affects: charm-neutron-openvswitch
 Importance: Undecided
 Status: New

** Affects: neutron
 Importance: Undecided
 Status: New

** Summary changed:

- neutron dvr should lower arp_delay when using arp_proxy
+ neutron dvr should lower proxy_delay when using arp_proxy

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1920975

Title:
  neutron dvr should lower proxy_delay when using proxy_arp

Status in OpenStack neutron-openvswitch charm:
  New
Status in neutron:
  New

Bug description:
  Neutron DVR uses proxy_arp in fip namespaces to respond to arp
  requests for instance floating ips. In doing so it is susceptible to a
  random delay up to by default 800ms which is added to the time taken
  to respond to an arp request that has to be proxied i.e.

  # ip netns exec fip-a297543b-9ef9-4bd5-b1ca-e85a726c1726 sysctl 
net.ipv4.{conf.fg-51f3e07b-2d.proxy_arp,neigh.fg-51f3e07b-2d.proxy_delay}
  net.ipv4.conf.fg-51f3e07b-2d.proxy_arp = 1
  net.ipv4.neigh.fg-51f3e07b-2d.proxy_delay = 80

  The result of this is seen when e.g. you ping a vm fip and the first
  request takes significantly longer than subsequent requests:

  $ ping -c 5 10.5.150.90
  PING 10.5.150.90 (10.5.150.90) 56(84) bytes of data.
  64 bytes from 10.5.150.90: icmp_seq=1 ttl=60 time=491 ms
  64 bytes from 10.5.150.90: icmp_seq=2 ttl=60 time=1.08 ms
  64 bytes from 10.5.150.90: icmp_seq=3 ttl=60 time=1.39 ms
  64 bytes from 10.5.150.90: icmp_seq=4 ttl=60 time=1.16 ms
  64 bytes from 10.5.150.90: icmp_seq=5 ttl=60 time=1.03 ms

  --- 10.5.150.90 ping statistics ---
  5 packets transmitted, 5 received, 0% packet loss, time 4007ms
  rtt min/avg/max/mdev = 1.034/99.157/491.134/195.988 ms

  To repro again simply delete arp entry for fip from fip ns of source
  compute host.

  By kernel standards this behaviour is by-design when using the default
  settings but some workloads may be impacted by this initial delay
  especially e.g. in loaded environments where the arp caches are under
  strain and hitting gc_thresh limits.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1920975/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1916761] Re: [dvr] bound port permanent arp entries never deleted

2021-03-18 Thread Edward Hope-Morley
** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916761

Title:
  [dvr] bound port permanent arp entries never deleted

Status in Ubuntu Cloud Archive:
  Fix Committed
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  Fix Committed
Status in neutron source package in Groovy:
  Fix Committed
Status in neutron source package in Hirsute:
  Fix Released

Bug description:
  [Impact]

  See original bug desription but in short commit b3a42cddc5 removed all
  the arp management code in favour of using the arp_reponder but missed
  the fact that DVR floating ips don't use the arp_responder. As a
  result it was possible to end up with permanent arp entries in qrouter
  namespaces such that if you created a new port with the same IP as
  that of a previous port for which there is an arp entry, associating a
  fip with that port would never be accessible until that arp entry was
  manually deleted. This patch adds the reverted code back in.

  [Test Plan]

* deploy Openstack Ussuri
* create port P1 with address A1 and create vm on node C1 with this port
* associate floating ip with P1 and ping it
* observe REACHABLE or PERMANENT arp entry for A1 in qrouter arp cache
* delete vm and port
* ensure arp entry for A1 in qrouter arp cache is deleted
* create port P2 with address A1 and create vm on node C1 with this port
* associate floating ip with P2 and ping it

  [Where problems could occur]

  No problems anticipated from re-introducing this code. Of course this
  code uses RPC notifications and as a result will incur some extra amqp
  load but is not anticipated to be a problem and it was not considered
  a problem when the code existed prior to removal.

  --

  With Openstack Ussuri using dvr-snat I do the following:

* create port P1 with address A1 and create vm on node C1 with this port
* associate floating ip with P1 and ping it
* observe REACHABLE arp entry for A1 in qrouter arp cache
* so far so good
* restart the neutron-l3-agent
* observe REACHABLE arp entry for A1 is now PERMANENT
* delete vm and port
* create port P2 with address A1 and create vm on node C1 with this port
* vm is unreachable since arp cache contains PERMANENT entry for old port 
P1 mac/ip combo

  If I don't restart the l3-agent, once I have deleted the port it's arp
  entry does REACHABLE -> STALE and will either be replaced or timeout
  as expected but once it is set to PERMANENT it will never disappear
  which means any future use of that ip address (by a port with a
  different mac) will not work until that entry is manually deleted.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1916761/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1887405] Re: Race condition while processing security_groups_member_updated events (ipset)

2021-03-12 Thread Edward Hope-Morley
** Package changed: ubuntu => neutron (Ubuntu)

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Hirsute)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1887405

Title:
  Race condition while processing security_groups_member_updated events
  (ipset)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  New

Bug description:
  # Summary

  Race condition while processing security_groups_member_updated events
  (ipset)

  # Overview

  We have a customer that uses heat templates to deploy large
  environments (e.g. 21 instances) with a significant number of security
  groups (e.g. 60) that use bi-directional remote group references for
  both ingress and egress filtering.  These heat stacks are deployed
  using a CI pipeline and intermittently suffer from application layer
  failures due to broken network connectivity.  We found that this was
  caused by the ipsets used to implement remote_group memberships
  missing IPs from their member lists.  Troubleshooting suggests this is
  caused by a race condition, which I've attempted to describe in detail
  below.

  Version: `54e1a6b1bc378c0745afc03987d0fea241b826ae` (HEAD of
  stable/rocky as of Jan 26, 2020), though I suspect this issue persists
  through master.

  I'm working on getting some multi-node environments deployed (I don't
  think it's possible to reproduce this with a single hypervisor) and
  hope to provide reproduction steps on Rocky and master soon.  I wanted
  to get this report submitted as-is with the hopes that an experienced
  Neutron dev might be able to spot possible solutions or provide
  diagnostic insight that I am not yet able to produce.

  I suspect this report may be easier to read with some markdown, so
  please feel free to read it in a gist:
  https://gist.github.com/cfarquhar/20fddf2000a83216021bd15b512f772b

  Also, this diagram is probably critical to following along: https
  ://user-images.githubusercontent.com/1253665/87317744-0a75b180-c4ed-
  11ea-9bad-085019c0f954.png

  # Race condition symptoms

  Given the following security groups/rules:

  ```
  | secgroup name | secgroup id  | direction | remote 
group | dest port |
  
|---|--|---|--|---|
  | server| fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | ingress   | 
b52c8c54-b97a-477d-8b68-f4075e7595d9 | 9092  |
  | client| b52c8c54-b97a-477d-8b68-f4075e7595d9 | egress| 
fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | 9092  |
  ```

  And the following instances:

  ```
  | instance name | hypervisor | ip  | secgroup assignment |
  |---||-|-|
  | server01  | compute01  | 192.168.0.1 | server  |
  | server02  | compute02  | 192.168.0.2 | server  |
  | server03  | compute03  | 192.168.0.3 | server  |
  | client01  | compute04  | 192.168.0.4 | client  |
  ```

  We would expect to find the following ipset representing the `server`
  security group members on `compute04`:

  ```
  # ipset list NIPv4fcd6cf12-2ac9-4704-9208-
  Name: NIPv4fcd6cf12-2ac9-4704-9208-
  Type: hash:net
  Revision: 6
  Header: family inet hashsize 1024 maxelem 65536
  Size in memory: 536
  References: 4
  Number of entries: 3
  Members:
  192.168.0.1
  192.168.0.2
  192.168.0.3
  ```

  What we actually get when the race condition is triggered is an
  incomplete list of members in the ipset.  The member list could
  contain anywhere between zero and two of the expected IPs.

  # Triggering the race condition

  The problem occurs when `security_group_member_updated` events arrive between 
`port_update` steps 12 and 22 (see diagram and process details below).  
- `port_update` step 12 retrieves the remote security groups' member lists, 
which are not necessarily complete yet.
- `port_update` step 22 adds the port to `IptablesFirewallDriver.ports()`.

  This results in `security_group_member_updated` step 3 looking for the
  port to apply the updated member list 

[Yahoo-eng-team] [Bug 1887405] Re: Race condition while processing security_groups_member_updated events (ipset)

2021-03-12 Thread Edward Hope-Morley
Fix released and backported to stable/victoria and stable/ussuri -
https://review.opendev.org/q/I6e4abe13a7541c21399466c5eb0d61ff5780c887

** Changed in: neutron
   Status: In Progress => Fix Released

** Also affects: ubuntu
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1887405

Title:
  Race condition while processing security_groups_member_updated events
  (ipset)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  New

Bug description:
  # Summary

  Race condition while processing security_groups_member_updated events
  (ipset)

  # Overview

  We have a customer that uses heat templates to deploy large
  environments (e.g. 21 instances) with a significant number of security
  groups (e.g. 60) that use bi-directional remote group references for
  both ingress and egress filtering.  These heat stacks are deployed
  using a CI pipeline and intermittently suffer from application layer
  failures due to broken network connectivity.  We found that this was
  caused by the ipsets used to implement remote_group memberships
  missing IPs from their member lists.  Troubleshooting suggests this is
  caused by a race condition, which I've attempted to describe in detail
  below.

  Version: `54e1a6b1bc378c0745afc03987d0fea241b826ae` (HEAD of
  stable/rocky as of Jan 26, 2020), though I suspect this issue persists
  through master.

  I'm working on getting some multi-node environments deployed (I don't
  think it's possible to reproduce this with a single hypervisor) and
  hope to provide reproduction steps on Rocky and master soon.  I wanted
  to get this report submitted as-is with the hopes that an experienced
  Neutron dev might be able to spot possible solutions or provide
  diagnostic insight that I am not yet able to produce.

  I suspect this report may be easier to read with some markdown, so
  please feel free to read it in a gist:
  https://gist.github.com/cfarquhar/20fddf2000a83216021bd15b512f772b

  Also, this diagram is probably critical to following along: https
  ://user-images.githubusercontent.com/1253665/87317744-0a75b180-c4ed-
  11ea-9bad-085019c0f954.png

  # Race condition symptoms

  Given the following security groups/rules:

  ```
  | secgroup name | secgroup id  | direction | remote 
group | dest port |
  
|---|--|---|--|---|
  | server| fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | ingress   | 
b52c8c54-b97a-477d-8b68-f4075e7595d9 | 9092  |
  | client| b52c8c54-b97a-477d-8b68-f4075e7595d9 | egress| 
fcd6cf12-2ac9-4704-9208-7c6cb83d1a71 | 9092  |
  ```

  And the following instances:

  ```
  | instance name | hypervisor | ip  | secgroup assignment |
  |---||-|-|
  | server01  | compute01  | 192.168.0.1 | server  |
  | server02  | compute02  | 192.168.0.2 | server  |
  | server03  | compute03  | 192.168.0.3 | server  |
  | client01  | compute04  | 192.168.0.4 | client  |
  ```

  We would expect to find the following ipset representing the `server`
  security group members on `compute04`:

  ```
  # ipset list NIPv4fcd6cf12-2ac9-4704-9208-
  Name: NIPv4fcd6cf12-2ac9-4704-9208-
  Type: hash:net
  Revision: 6
  Header: family inet hashsize 1024 maxelem 65536
  Size in memory: 536
  References: 4
  Number of entries: 3
  Members:
  192.168.0.1
  192.168.0.2
  192.168.0.3
  ```

  What we actually get when the race condition is triggered is an
  incomplete list of members in the ipset.  The member list could
  contain anywhere between zero and two of the expected IPs.

  # Triggering the race condition

  The problem occurs when `security_group_member_updated` events arrive between 
`port_update` steps 12 and 22 (see diagram and process details below).  
- `port_update` step 12 retrieves the remote security groups' member lists, 
which are not necessarily complete yet.
- `port_update` step 22 adds the port to `IptablesFirewallDriver.ports()`.

  This results in `security_group_member_updated` step 3 looking for the
  port to apply the updated member list to (in
  `IptablesFirewallDriver.ports()`) BEFORE it has been added by
  `port_update`'s step 22. This causes the membership update event to
  effectively be discarded.  We are then left with whatever the remote
  security group's member list was when the `port_update` 

[Yahoo-eng-team] [Bug 1832021] Re: Checksum drop of metadata traffic on isolated networks with DPDK

2021-03-11 Thread Edward Hope-Morley
** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/stein
   Status: New => Fix Released

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1832021

Title:
  Checksum drop of metadata traffic on isolated networks with DPDK

Status in OpenStack neutron-openvswitch charm:
  Fix Released
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New

Bug description:
  [Impact]

  When an isolated network using provider networks for tenants (meaning
  without virtual routers: DVR or network node), metadata access occurs
  in the qdhcp ip netns rather than the qrouter netns.

  The following options are set in the dhcp_agent.ini file:
  force_metadata = True
  enable_isolated_metadata = True

  VMs on the provider tenant network are unable to access metadata as
  packets are dropped due to checksum.

  [Test Plan]

  1. Create an OpenStack deployment with DPDK options enabled and
  'enable-local-dhcp-and-metadata: true' in neutron-openvswitch. A
  sample, simple 3 node bundle can be found here[1].

  2. Create an external flat network and subnet:

  openstack network show dpdk_net || \
openstack network create --provider-network-type flat \
 --provider-physical-network physnet1 dpdk_net \
 --external

  openstack subnet show dpdk_net || \
  openstack subnet create --allocation-pool 
start=10.230.58.100,end=10.230.58.200 \
  --subnet-range 10.230.56.0/21 --dhcp --gateway 
10.230.56.1 \
  --dns-nameserver 10.230.56.2 \
  --ip-version 4 --network dpdk_net dpdk_subnet

  
  3. Create an instance attached to that network. The instance must have a 
flavor that uses huge pages.

  openstack flavor create --ram 8192 --disk 50 --vcpus 4 m1.dpdk
  openstack flavor set m1.dpdk --property hw:mem_page_size=large

  openstack server create --wait --image xenial --flavor m1.dpdk --key-
  name testkey --network dpdk_net i1

  4. Log into the instance host and check the instance console. The
  instance will hang into the boot and show the following message:

  2020-11-20 09:43:26,790 - openstack.py[DEBUG]: Failed reading optional
  path http://169.254.169.254/openstack/2015-10-15/user_data due to:
  HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out.
  (read timeout=10.0)

  5. Apply the fix in all computes, restart the DHCP agents in all
  computes and create the instance again.

  6. No errors should be shown and the instance quickly boots.

  
  [Where problems could occur]

  * This change is only touched if datapath_type and ovs_use_veth. Those 
settings are mostly used for DPDK environments. The core of the fix is
  to toggle off checksum offload done by the DHCP namespace interfaces.
  This will have the drawback of adding some overhead on the packet processing 
for DHCP traffic but given DHCP does not demand too much data, this should be a 
minor proble.

  * Future changes on the syntax of the ethtool command could cause
  regressions


  [Other Info]

   * None


  [1] https://gist.github.com/sombrafam/e0741138773e444960eb4aeace6e3e79

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1832021/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1896734] Re: A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpac

2021-03-09 Thread Edward Hope-Morley
Neutron fix is released upstream and backported all the way to Train.

** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1896734

Title:
  A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
  logging is enabled (large number of registered NICs) - an RPC response
  is too large for msgpack

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in neutron:
  Fix Released
Status in oslo.privsep:
  New
Status in python-oslo.privsep package in Ubuntu:
  New

Bug description:
  When there is a large amount of netdevs registered in the kernel and
  debug logging is enabled, neutron-openvswitch-agent and the privsep
  daemon spawned by it hang since the RPC call result sent by the
  privsep daemon over a unix socket exceeds the message sizes that the
  msgpack library can handle.

  The impact of this is that enabling debug logging on the cloud
  completely stalls neutron-openvswitch-agents and makes them "dead"
  from the Neutron server perspective.

  The issue is summarized in detail in comment #5
  https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

  
  Old Description

  While trying to debug a different issue, I encountered a situation
  where privsep hangs in the process of handling a request from neutron-
  openvswitch-agent when debug logging is enabled (juju debug-log
  neutron-openvswitch=true):

  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12

  The issue gets reproduced reliably in the environment where I
  encountered it on all units. As a result, neutron-openvswitch-agent
  services hang while waiting for a response from the privsep daemon and
  do not progress past basic initialization. They never post any state
  back to the Neutron server and thus are marked dead by it.

  The processes though are shown as "active (running)" by systemd which
  adds to the confusion since they do indeed start from the systemd's
  perspective.

  systemctl --no-pager status neutron-openvswitch-agent.service
  ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin 
Agent
     Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; 
enabled; vendor preset: enabled)
     Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
   Main PID: 247772 (/usr/bin/python)
  Tasks: 4 (limit: 9830)
     CGroup: /system.slice/neutron-openvswitch-agent.service
     ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent 
--config-file=/etc/neutron/neutron.conf 
--config-file=/etc/neutron/plugins/ml2/openvswitch_…og
     └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck

  

  An strace shows that the privsep daemon tries to receive input from fd
  3 which is the unix socket it uses to communicate with the client.
  However, this is just one tread out of many spawned by the privsep
  daemon so it is unlikely to be the root cause (there are 65 threads
  there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)

  # there is one extra neutron-openvvswitch-agent running in a LXD container 
which can be ignored here (there is an octavia unit on the node which has a 
neutron-openvswitch subordinate)
  root@node2:~# ps -eo pid,user,args --sort user | grep -P 
'privsep.*openvswitch'
   860690 10   /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context 
neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
   248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file 
/etc/neutron/neutron.conf --config-file 
/etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context 
neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
   363905 root grep --color=auto -P privsep.*openvswitch

  root@node2:~# strace -f -p 248453 2>&1
  [pid 248786] futex(0x7f6a6401c1d0, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0x 
  [pid 248475] futex(0x7f6a6c024590, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0x 
  [pid 248473] futex(0x7f6a746d9fd0, 
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0x 
  [pid 248453] recvfrom(3,

  root@node2:~# lsof -p 248453  | grep 3u
  privsep-h 248453 root3u  unix 0x8e6d8abdec00  0t0 356522977 
type=STREAM

  root@node2:~# ss -pax | grep 356522977
  u_str ESTAB   00  

[Yahoo-eng-team] [Bug 1916761] [NEW] [dvr] bound port permanent arp entries never deleted

2021-02-24 Thread Edward Hope-Morley
Public bug reported:

With Openstack Ussuri using dvr-snat I do the following:

  * create port P1 with address A1 and create vm on node C1 with this port
  * associate floating ip with P1 and ping it
  * observe REACHABLE arp entry for A1 in qrouter arp cache
  * so far so good
  * restart the neutron-l3-agent
  * observe REACHABLE arp entry for A1 is now PERMANENT
  * delete vm and port
  * create port P2 with address A1 and create vm on node C1 with this port
  * vm is unreachable since arp cache contains PERMANENT entry for old port P1 
mac/ip combo

If I don't restart the l3-agent, once I have deleted the port it's arp
entry does REACHABLE -> STALE and will either be replaced or timeout as
expected but once it is set to PERMANENT it will never disappear which
means any future use of that ip address (by a port with a different mac)
will not work until that entry is manually deleted.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916761

Title:
  [dvr] bound port permanent arp entries never deleted

Status in neutron:
  New

Bug description:
  With Openstack Ussuri using dvr-snat I do the following:

* create port P1 with address A1 and create vm on node C1 with this port
* associate floating ip with P1 and ping it
* observe REACHABLE arp entry for A1 in qrouter arp cache
* so far so good
* restart the neutron-l3-agent
* observe REACHABLE arp entry for A1 is now PERMANENT
* delete vm and port
* create port P2 with address A1 and create vm on node C1 with this port
* vm is unreachable since arp cache contains PERMANENT entry for old port 
P1 mac/ip combo

  If I don't restart the l3-agent, once I have deleted the port it's arp
  entry does REACHABLE -> STALE and will either be replaced or timeout
  as expected but once it is set to PERMANENT it will never disappear
  which means any future use of that ip address (by a port with a
  different mac) will not work until that entry is manually deleted.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916761/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1894843] Re: [dvr_snat] Router update deletes rfp interface from qrouter even when VM port is present on this host

2021-02-24 Thread Edward Hope-Morley
** Description changed:

+ [Impact]
+ When neutron schedules snat namespaces it sometimes deletes the rfp interface 
from qrouter namespaces which breaks external network (fip) connectivity. The 
fix prevents this from happening.
+ 
+ [Test Case]
+  * deploy Openstack (Ussuri or above) with dvr_snat enabled in compute hosts.
+  * ensure min. 2 compute hosts
+  * create one ext network and one private network
+  * add private subnet to router and ext as gateway
+  * check which compute has the snat ns (ip netns| grep snat)
+  * create a vm on each compute host
+  * check that qrouter ns on both computes has rfp interface
+  * ip netns| grep qrouter; ip netns exec  ip a s| grep rfp
+  * disable and re-enable router
+  * openstack router set --disable ;  openstack router set --enable 

+  * check again
+  * ip netns| grep qrouter; ip netns exec  ip a s| grep rfp
+ 
+ [Regression Potential]
+ This patch is in fact restoring expected behaviour and is not expected to
+ introduce any new regressions.
+ 
+ -
+ 
  Hello,
  
  In the case of dvr_snat l3 agents are deployed on hypervisors there can
  be race condition. The agent creates snat namespaces on each scheduled
  host and removes them at second step. At this second step agent removes
  the rfp interface from qrouter even when there is VM with floating IP on
  the host.
  
  When VM is deployed at the time of second step we can lost external
  access to VMs floating IP. The issue can be reproduced by hand:
  
  1. Create tenant network and router with external gateway
  2. Create VM with floating ip
  3. Ensure that VM on the hypervisor without snat-* namespace
  4. Set the router to disabled state (openstack router set --disable )
  5. Set the router to enabled state (openstack router set --enabled )
  6. The external access to VMs FIP have lost because L3 agent creates the 
qrouter namespace without rfp interface.
  
- 
  Environment:
  
  1. Neutron with ML2 OVS plugin.
  2. L3 agents in dvr_snat mode on each hypervisor
  3. openstack-neutron-common-15.1.1-0.2020061910.7d97420.el8ost.noarch

** Changed in: neutron (Ubuntu Hirsute)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1894843

Title:
  [dvr_snat] Router update deletes rfp interface from qrouter even when
  VM port is present on this host

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  Fix Released

Bug description:
  [Impact]
  When neutron schedules snat namespaces it sometimes deletes the rfp interface 
from qrouter namespaces which breaks external network (fip) connectivity. The 
fix prevents this from happening.

  [Test Case]
   * deploy Openstack (Ussuri or above) with dvr_snat enabled in compute hosts.
   * ensure min. 2 compute hosts
   * create one ext network and one private network
   * add private subnet to router and ext as gateway
   * check which compute has the snat ns (ip netns| grep snat)
   * create a vm on each compute host
   * check that qrouter ns on both computes has rfp interface
   * ip netns| grep qrouter; ip netns exec  ip a s| grep rfp
   * disable and re-enable router
   * openstack router set --disable ;  openstack router set --enable 

   * check again
   * ip netns| grep qrouter; ip netns exec  ip a s| grep rfp

  [Regression Potential]
  This patch is in fact restoring expected behaviour and is not expected to
  introduce any new regressions.

  -

  Hello,

  In the case of dvr_snat l3 agents are deployed on hypervisors there
  can be race condition. The agent creates snat namespaces on each
  scheduled host and removes them at second step. At this second step
  agent removes the rfp interface from qrouter even when there is VM
  with floating IP on the host.

  When VM is deployed at the time of second step we can lost external
  access to VMs floating IP. The issue can be reproduced by hand:

  1. Create tenant network and router with external gateway
  2. Create VM with floating ip
  3. Ensure that VM on the hypervisor without snat-* namespace
  4. Set the router to disabled state (openstack router set --disable )
  5. Set the router to enabled state (openstack router set --enabled )
  6. The external access to VMs FIP have lost because L3 agent creates the 
qrouter namespace without rfp interface.

  Environment:

  1. Neutron with ML2 OVS plugin.
  2. L3 agents in dvr_snat mode on each hypervisor
  3. 

[Yahoo-eng-team] [Bug 1894843] Re: [dvr_snat] Router update deletes rfp interface from qrouter even when VM port is present on this host

2021-02-24 Thread Edward Hope-Morley
** Changed in: neutron
   Status: In Progress => Fix Released

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Hirsute)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1894843

Title:
  [dvr_snat] Router update deletes rfp interface from qrouter even when
  VM port is present on this host

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  New

Bug description:
  Hello,

  In the case of dvr_snat l3 agents are deployed on hypervisors there
  can be race condition. The agent creates snat namespaces on each
  scheduled host and removes them at second step. At this second step
  agent removes the rfp interface from qrouter even when there is VM
  with floating IP on the host.

  When VM is deployed at the time of second step we can lost external
  access to VMs floating IP. The issue can be reproduced by hand:

  1. Create tenant network and router with external gateway
  2. Create VM with floating ip
  3. Ensure that VM on the hypervisor without snat-* namespace
  4. Set the router to disabled state (openstack router set --disable )
  5. Set the router to enabled state (openstack router set --enabled )
  6. The external access to VMs FIP have lost because L3 agent creates the 
qrouter namespace without rfp interface.

  
  Environment:

  1. Neutron with ML2 OVS plugin.
  2. L3 agents in dvr_snat mode on each hypervisor
  3. openstack-neutron-common-15.1.1-0.2020061910.7d97420.el8ost.noarch

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1894843/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

2021-02-18 Thread Edward Hope-Morley
** Changed in: cloud-archive/victoria
   Status: New => Fix Released

** Changed in: cloud-archive/ussuri
   Status: New => Fix Released

** Changed in: cloud-archive/train
   Status: New => Fix Released

** Changed in: cloud-archive/stein
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Hirsute)
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Groovy)
   Status: New => Fix Released

** Changed in: neutron (Ubuntu Focal)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Groovy:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released

Bug description:
  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root@mgt01:~# openstack server list
  
+--+-++--+--+---+
  | ID   | Name| Status | Networks  
   | Image| Flavor|
  
+--+-++--+--+---+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | 
vlan105=172.31.10.4  | Cirros 0.4.0 64-bit  | 
m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | 
vlan105=172.31.10.18 | Cirros 0.4.0 64-bit  | 
m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ..
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """   duration=secs
The  time,  in  seconds,  that  the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
  """

  root@compute01:~# ovs-ofctl dump-flows br-floating
  ...
   cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, 
n_bytes=103409, 
  ^-- this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

2021-02-17 Thread Edward Hope-Morley
** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Hirsute)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Bionic)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  New

Bug description:
  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root@mgt01:~# openstack server list
  
+--+-++--+--+---+
  | ID   | Name| Status | Networks  
   | Image| Flavor|
  
+--+-++--+--+---+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | 
vlan105=172.31.10.4  | Cirros 0.4.0 64-bit  | 
m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | 
vlan105=172.31.10.18 | Cirros 0.4.0 64-bit  | 
m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ..
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """   duration=secs
The  time,  in  seconds,  that  the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
  """

  root@compute01:~# ovs-ofctl dump-flows br-floating
  ...
   cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, 
n_bytes=103409, 
  ^-- this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1869808] Re: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

2021-02-17 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1869808

Title:
  reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New
Status in neutron source package in Hirsute:
  New

Bug description:
  We are using Openstack Neutron 13.0.6 and it is deployed using
  OpenStack-helm.

  I test ping servers in the same vlan while rebooting neutron-ovs-
  agent. The result shows

  root@mgt01:~# openstack server list
  
+--+-++--+--+---+
  | ID   | Name| Status | Networks  
   | Image| Flavor|
  
+--+-++--+--+---+
  | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | 
vlan105=172.31.10.4  | Cirros 0.4.0 64-bit  | 
m1.tiny   |
  | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | 
vlan105=172.31.10.18 | Cirros 0.4.0 64-bit  | 
m1.tiny   |

  $ ping 172.31.10.4
  PING 172.31.10.4 (172.31.10.4): 56 data bytes
  ..
  64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
  64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <
  64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
  64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
  64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
  64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
  64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
  64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

  As one can see, packet seq 62 is lost, I believe, during rebooting ovs
  agent.

  Right now, I am suspecting
  
https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229
  this code is refreshing flow table rules even though it is not
  necessary.

  Because when I dump flows on phys bridge, I can see duration is
  rewinding to 0 which suggests flow has been deleted and created again

  """   duration=secs
The  time,  in  seconds,  that  the entry has been in the table.
secs includes as much precision as the switch provides, possibly
to nanosecond resolution.
  """

  root@compute01:~# ovs-ofctl dump-flows br-floating
  ...
   cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, 
n_bytes=103409, 
  ^-- this value resets
  priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
  ...

  IMO, rebooting ovs-agent should not affecting data plane.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1869808/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1874733] Re: [OVN] Stale ports can be present in OVN NB leading to metadata errors

2021-02-04 Thread Edward Hope-Morley
** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1874733

Title:
  [OVN] Stale ports can be present in OVN NB leading to metadata errors

Status in neutron:
  Fix Released

Bug description:
  Right now, there's a chance that deleting a port in Neutron with
  ML2/OVN actually deletes the object from Neutron DB while leaving a
  stale port in the OVN NB database.

  This can happen when deleting a port [0] raises a RowNotFound
  exception. While it may look like it'd mean that the port didn't exist
  already in OVN NB truth is that the current port_delete function can
  throw that exception for different reasons (especially against OVN <
  2.10 when Address Sets were used instead of Port Groups).

  Such exception can be observed for example if some ACL or Address Set
  doesn't exist [1][2] amongst others. In this case, the revision number
  of the object will be deleted [3] and the port will be stale forever
  in OVN NB (it'll be skipped by the maintenance task).

  One of the main impacts of this issue is that the OVN NB database will
  grow and have stale objects that are undetected (they'll be detected
  by the neutron-ovn-db-sync-script) but most importantly, that multiple
  ports in the same OVN Logical Switch may have the same IP addresses
  and this cause legitimate ports to be left without Metadata.

  As per metadata agent code here [4] if more than one port in the same
  network has the same IP address, a 404 will be returned back to the
  instance upon requesting metadata.

  The workaround is running the neutron-db-sync script in repair mode to
  get rid of the stale ports.

  A proper fix would involve a better granularity of the exceptions that
  can happen around a port deletion and acting accordingly upon each of
  them. In the worst case, we won't be deleting the revision number if
  the port still exists leaving up to the Maintenance task to fix it
  later on (< 5 minutes). Ideally, we should identify all possible code
  paths and delete the port from OVN whenever possible even if some
  other associated operation fails (with proper logging).

  
  Also, this scenario seems to be more likely under a high concurrency of API 
operations (such as heat) and possibly when Port Groups are not supported by 
the schema (OVN < 2.10).

  Danie Alvarez

  
  [0] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
  [1] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
  [2] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
  [3] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
  [4] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1874733/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1893263] Re: [SRU] Cannot create instance with multiqueue image and vif_type=tap (calico)

2020-10-28 Thread Edward Hope-Morley
** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1893263

Title:
  [SRU] Cannot create instance with multiqueue image and vif_type=tap
  (calico)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  New

Bug description:
  When using calico, the vif_type is tap, therefore when the instance is
  being created, the method plug_tap() is invoked, which creates the tap
  device prior to launching the instance.

  That tap device is currently always created without multiqueue as per
  [1]. When libvirt creates the instance, the XML definition
  "queues=" clashes with the fact that the pre-existing tap interface
  doesn't have multiqueue enabled, and therefore errors out with the
  exception below. The code at [2] already handles multiqueue, but it is
  never invoked with multiqueue=True.

  Alternatively, as a current workaround, if the instance is shutdown
  through virsh, or rebooted through nova, it causes the tap device to
  be removed, to be created again by libvirt instead, allowing the tap
  device to be set up with multiqueue appropriately if its XML is
  manually edited. This begs the question as why the plug_tap() method
  needs to pre-create the interface at all, if when the VM rebooted
  libvirt does so regardless of plug_tap().

  Steps to reproduce:

  1) Ubuntu bionic + devstack master + follow instructions at [3]
  2) wget 
https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
  3) openstack image create bionic-mq --file bionic-server-cloudimg-amd64.img 
--property hw_vif_multiqueue_enabled=True
  4) openstack image create bionic --file bionic-server-cloudimg-amd64.img
  5) ssh-keygen
  6) openstack keypair create key1 --public-key ~/.ssh/id_rsa.pub
  7) openstack flavor create --vcpu 2 --ram 1024 --disk 10 --public --id 10 
test_flavor
  8) openstack server create --network calico --flavor test_flavor --image 
bionic --key-name key1 no-mq

  instance is created successfully

  9) ip a

  6: tapcc353751-13:  mtu 1500 qdisc
  fq_codel state UP group default qlen 1000

  10) sudo virsh edit 1

  add "" to the interface section

  11) openstack server reboot no-mq

  wait a few secs

  12) ip a

  7: tapcc353751-13:  mtu 1500 qdisc mq
  state UNKNOWN group default qlen 1000

  13) ssh to the instance and run "sudo ethtool -l "

  Combined:   2

  14) openstack server delete no-mq

  15) openstack server create --network calico --flavor test_flavor
  --image bionic-mq --key-name key1 mq

  instance fails to be created, log shows the below stack trace.

  [1] 
https://github.com/openstack/nova/blob/f521f4dbace0e35bedd089369da6f6969da5ca32/nova/virt/libvirt/vif.py#L701
  [2] 
https://github.com/openstack/nova/blob/f521f4dbace0e35bedd089369da6f6969da5ca32/nova/privsep/linux_net.py#L109
  [3] 
https://docs.projectcalico.org/getting-started/openstack/installation/devstack

  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager [None 
req-71d40776-0fa7-466e-9060-11472b5bce42 admin admin] [instance: 
69a0a527-9c33-432f-8889-c421ae8aebb4] Instance failed to spawn: 
libvirt.libvirtError: Unable to create tap device tapb6021dd0-fd: Invalid 
argument
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4] Traceback (most recent call 
last):
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4]   File 
"/opt/stack/nova/nova/compute/manager.py", line 2628, in _build_resources
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4] yield resources
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4]   File 
"/opt/stack/nova/nova/compute/manager.py", line 2401, in _build_and_run_instance
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4] accel_info=accel_info)
  Aug 27 18:58:38 devstack nova-compute[7968]: ERROR nova.compute.manager 
[instance: 69a0a527-9c33-432f-8889-c421ae8aebb4]   File 
"/opt/stack/nova/nova/virt/libvirt/driver.py", line 3701, in spawn
  Aug 27 18:58:38 devstack 

[Yahoo-eng-team] [Bug 1881157] Re: [OVS][FW] Remote SG IDs left behind when a SG is removed

2020-10-24 Thread Edward Hope-Morley
** Changed in: cloud-archive/train
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1881157

Title:
  [OVS][FW] Remote SG IDs left behind when a SG is removed

Status in Ubuntu Cloud Archive:
  Fix Committed
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in neutron:
  New
Status in neutron package in Ubuntu:
  Fix Committed
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Groovy:
  Fix Committed

Bug description:
  [Impact]

  neutron does not remove all trace of remote sg conj ids when deleting
  a security group.

  [Test Case]

   * deploy openstack (no particular feature needed)
   * create two networks N1, N2 with security groups SG1, SG2 respectively
   * SG2 must have a custom ingress tcp rule from remote SG1
   * create a vm on each network, make a note of their fixed_ip then delete 
those vms
   * on compute host running VM2 do the following:
   * sudo ovs-ofctl dump-flows br-int table=82| grep 
   * sudo ovs-ofctl dump-flows br-int table=82| egrep "conjunction([0-9]+,2/2)"
   * the above should not return anything

  [Regression Potential]
  Since the flows being deleted belong to deleted ports their deletion is not 
expected to have a noticeable impact but as this bug describes, their existance 
could be having an unexpected impact on ports that have a security that happens 
to share the same conjunction id.

  -

  When any port in the OVS agent is using a SG, is marked to be deleted.
  This deletion process is done in [1].

  The SG deletion process consists on removing any reference of this SG
  from the firewall and the SG port map. The firewall removes this SG in
  [2].

  The information of a SG is stored in:
  - ConjIPFlowManager.conj_id_map = ConjIdMap(). This class stores the 
conjunction IDS (conj_ids) in a dictionary using the following keys:
    ConjIdMap.id_map[(sg_id, remote_sg_id, direction, ethertype, conj_ids)] = 
conj_id_XXX

  - ConjIPFlowManager.conj_ids is a nested dictionary, built in the following 
way:
    self.conj_ids[vlan_tag][(direction, ethertype)][remote_sg_id] = 
set([conj_id_1, conj_id_2, ...])

  When a SG is removed, this reference should be deleted both from
  "conj_id_map" and "conj_ids". From "conj_id_map" is correctly removed
  in [3]. But from "conj_ids" is not being deleted properly. Instead of
  the current logic, what we should do is to walk through the nested
  dictionary and remove any entry with "remote_sg_id" == "sg_id" (<-- SG
  ID to be removed).

  The current implementation leaves some "remote_sg_id" in the nested 
dictionary "conj_ids". That could cause:
  - A memory leak in the OVS agent, storing in memory those unneeded remote SG.
  - A increase in the complexity of the OVS rules, adding those unused SG 
(actually the conj_ids related to those SG)
  - A security breach between SGs if the conj_ids left in an unused SG is 
deleted and reused again (the FW stores the unused conj_ids to be recycled in 
later rules).

  
[1]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L731
  
[2]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L399
  
[3]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L296

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1881157/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1881157] Re: [OVS][FW] Remote SG IDs left behind when a SG is removed

2020-09-22 Thread Edward Hope-Morley
** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: neutron (Ubuntu Groovy)
   Status: New => Fix Committed

** Changed in: cloud-archive/victoria
   Status: Fix Released => Fix Committed

** Changed in: neutron (Ubuntu Focal)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1881157

Title:
  [OVS][FW] Remote SG IDs left behind when a SG is removed

Status in Ubuntu Cloud Archive:
  Fix Committed
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in neutron:
  New
Status in neutron package in Ubuntu:
  Fix Committed
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Groovy:
  Fix Committed

Bug description:
  When any port in the OVS agent is using a SG, is marked to be deleted.
  This deletion process is done in [1].

  The SG deletion process consists on removing any reference of this SG
  from the firewall and the SG port map. The firewall removes this SG in
  [2].

  The information of a SG is stored in:
  - ConjIPFlowManager.conj_id_map = ConjIdMap(). This class stores the 
conjunction IDS (conj_ids) in a dictionary using the following keys:
ConjIdMap.id_map[(sg_id, remote_sg_id, direction, ethertype, conj_ids)] = 
conj_id_XXX

  - ConjIPFlowManager.conj_ids is a nested dictionary, built in the following 
way:
self.conj_ids[vlan_tag][(direction, ethertype)][remote_sg_id] = 
set([conj_id_1, conj_id_2, ...])

  When a SG is removed, this reference should be deleted both from
  "conj_id_map" and "conj_ids". From "conj_id_map" is correctly removed
  in [3]. But from "conj_ids" is not being deleted properly. Instead of
  the current logic, what we should do is to walk through the nested
  dictionary and remove any entry with "remote_sg_id" == "sg_id" (<-- SG
  ID to be removed).

  The current implementation leaves some "remote_sg_id" in the nested 
dictionary "conj_ids". That could cause:
  - A memory leak in the OVS agent, storing in memory those unneeded remote SG.
  - A increase in the complexity of the OVS rules, adding those unused SG 
(actually the conj_ids related to those SG)
  - A security breach between SGs if the conj_ids left in an unused SG is 
deleted and reused again (the FW stores the unused conj_ids to be recycled in 
later rules).

  
  
[1]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L731
  
[2]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L399
  
[3]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L296

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1881157/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1881157] Re: [OVS][FW] Remote SG IDs left behind when a SG is removed

2020-09-21 Thread Edward Hope-Morley
As things stand, this is already available in Ussuri uca (16.2.0), and
will be available in the upcoming 15.2.0 Train uca point release. Stein
and Queens have no existing upstream tag that contains the fix so will
require an SRU.

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/ussuri
   Status: New => Fix Released

** Changed in: cloud-archive/victoria
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1881157

Title:
  [OVS][FW] Remote SG IDs left behind when a SG is removed

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  New

Bug description:
  When any port in the OVS agent is using a SG, is marked to be deleted.
  This deletion process is done in [1].

  The SG deletion process consists on removing any reference of this SG
  from the firewall and the SG port map. The firewall removes this SG in
  [2].

  The information of a SG is stored in:
  - ConjIPFlowManager.conj_id_map = ConjIdMap(). This class stores the 
conjunction IDS (conj_ids) in a dictionary using the following keys:
ConjIdMap.id_map[(sg_id, remote_sg_id, direction, ethertype, conj_ids)] = 
conj_id_XXX

  - ConjIPFlowManager.conj_ids is a nested dictionary, built in the following 
way:
self.conj_ids[vlan_tag][(direction, ethertype)][remote_sg_id] = 
set([conj_id_1, conj_id_2, ...])

  When a SG is removed, this reference should be deleted both from
  "conj_id_map" and "conj_ids". From "conj_id_map" is correctly removed
  in [3]. But from "conj_ids" is not being deleted properly. Instead of
  the current logic, what we should do is to walk through the nested
  dictionary and remove any entry with "remote_sg_id" == "sg_id" (<-- SG
  ID to be removed).

  The current implementation leaves some "remote_sg_id" in the nested 
dictionary "conj_ids". That could cause:
  - A memory leak in the OVS agent, storing in memory those unneeded remote SG.
  - A increase in the complexity of the OVS rules, adding those unused SG 
(actually the conj_ids related to those SG)
  - A security breach between SGs if the conj_ids left in an unused SG is 
deleted and reused again (the FW stores the unused conj_ids to be recycled in 
later rules).

  
  
[1]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L731
  
[2]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L399
  
[3]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L296

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1881157/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1881157] Re: [OVS][FW] Remote SG IDs left behind when a SG is removed

2020-09-17 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1881157

Title:
  [OVS][FW] Remote SG IDs left behind when a SG is removed

Status in Ubuntu Cloud Archive:
  New
Status in neutron:
  New

Bug description:
  When any port in the OVS agent is using a SG, is marked to be deleted.
  This deletion process is done in [1].

  The SG deletion process consists on removing any reference of this SG
  from the firewall and the SG port map. The firewall removes this SG in
  [2].

  The information of a SG is stored in:
  - ConjIPFlowManager.conj_id_map = ConjIdMap(). This class stores the 
conjunction IDS (conj_ids) in a dictionary using the following keys:
ConjIdMap.id_map[(sg_id, remote_sg_id, direction, ethertype, conj_ids)] = 
conj_id_XXX

  - ConjIPFlowManager.conj_ids is a nested dictionary, built in the following 
way:
self.conj_ids[vlan_tag][(direction, ethertype)][remote_sg_id] = 
set([conj_id_1, conj_id_2, ...])

  When a SG is removed, this reference should be deleted both from
  "conj_id_map" and "conj_ids". From "conj_id_map" is correctly removed
  in [3]. But from "conj_ids" is not being deleted properly. Instead of
  the current logic, what we should do is to walk through the nested
  dictionary and remove any entry with "remote_sg_id" == "sg_id" (<-- SG
  ID to be removed).

  The current implementation leaves some "remote_sg_id" in the nested 
dictionary "conj_ids". That could cause:
  - A memory leak in the OVS agent, storing in memory those unneeded remote SG.
  - A increase in the complexity of the OVS rules, adding those unused SG 
(actually the conj_ids related to those SG)
  - A security breach between SGs if the conj_ids left in an unused SG is 
deleted and reused again (the FW stores the unused conj_ids to be recycled in 
later rules).

  
  
[1]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L731
  
[2]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L399
  
[3]https://github.com/openstack/neutron/blob/118930f03d31f157f8c7a9e6c57122ecea8982b9/neutron/agent/linux/openvswitch_firewall/firewall.py#L296

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1881157/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1891673] Re: qrouter ns ip rules not deleted when fip removed from vm

2020-09-09 Thread Edward Hope-Morley
** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Groovy)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Patch added: "lp1891673-focal-ussuri.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1891673/+attachment/5408960/+files/lp1891673-focal-ussuri.debdiff

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1891673

Title:
  qrouter ns ip rules not deleted when fip removed from vm

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  In Progress
Status in Ubuntu Cloud Archive rocky series:
  In Progress
Status in Ubuntu Cloud Archive stein series:
  In Progress
Status in Ubuntu Cloud Archive train series:
  In Progress
Status in Ubuntu Cloud Archive ussuri series:
  In Progress
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New
Status in neutron source package in Bionic:
  New
Status in neutron source package in Focal:
  New
Status in neutron source package in Groovy:
  New

Bug description:
  [Impact]

  neutron-l3-agent restart causes partial loss of fip information such
  that fip removal from vm results in ip rules left behind which breaks
  external network access for that vm.

  [Test Case]

  * deploy openstack with dvr enabled
  * create distributed router, network etc
  * create a vm and attach a floating ip
  * go to compute host on which vm is running and restart neutron-l3-agent
  * tail -f /var/log/neutron/neutron-l3-agent.log until it settles
  * remove fip from vm
  * run https://gist.github.com/dosaboy/eca8dcd4560f68d856f465ca8382c58b on 
that compute node
  * should return with "nothing to do"

  [Regression Potential]
  none expected

  [Other Info]
  patched neutron l3 agent will reload info for *used* floating ips when 
restarted BUT if there are ip rules left behind from fips removed prior to 
using a pathed neutron then manual cleanup is still required and for that you 
can use https://gist.github.com/dosaboy/eca8dcd4560f68d856f465ca8382c58b.
   
  --

  With Bionic Stein using dvr_snat if I add a floating ip to a vm then
  remove the floating ip, the corresponding ip rules in the associated
  qrouter ns local to the instance are not deleted which results in no
  longer being able to reach the external network because packets are
  still sent to the fip namespace (via rfp-/fpr-) e.g. in my compute
  host running a vm whose address is 192.168.21.28 for which i have
  removed the fip I still see:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip rule list
  0:  from all lookup local
  32765:  from 192.168.21.28 lookup 16
  32766:  from all lookup main
  32767:  from all lookup default
  3232240897: from 192.168.21.1/24 lookup 3232240897
  3232241231: from 192.168.22.79/24 lookup 3232241231

  And table 16 leads to:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip route show 
table 16
  default via 169.254.109.249 dev rfp-5e45608f-3

  Which results in the instance no longer being able to reach the
  external network (packets are never sent to the snat- ns in my case).

  The workaround is to delete that ip rule but neutron should be taking
  care of this. Looks like the culprit is in
  neutron/agent/l3/dvr_local_router.py:floating_ip_removed_dist

  Note that the NAT rules were successfully removed from iptables so
  looks like it is just this bit that is left behind.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1891673/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1891673] Re: qrouter ns ip rules not deleted when fip removed from vm

2020-09-07 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/victoria
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1891673

Title:
  qrouter ns ip rules not deleted when fip removed from vm

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in Ubuntu Cloud Archive victoria series:
  New
Status in neutron:
  In Progress

Bug description:
  With Bionic Stein using dvr_snat if I add a floating ip to a vm then
  remove the floating ip, the corresponding ip rules in the associated
  qrouter ns local to the instance are not deleted which results in no
  longer being able to reach the external network because packets are
  still sent to the fip namespace (via rfp-/fpr-) e.g. in my compute
  host running a vm whose address is 192.168.21.28 for which i have
  removed the fip I still see:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip rule list
  0:  from all lookup local 
  32765:  from 192.168.21.28 lookup 16 
  32766:  from all lookup main 
  32767:  from all lookup default 
  3232240897: from 192.168.21.1/24 lookup 3232240897 
  3232241231: from 192.168.22.79/24 lookup 3232241231

  And table 16 leads to:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip route show 
table 16
  default via 169.254.109.249 dev rfp-5e45608f-3

  Which results in the instance no longer being able to reach the
  external network (packets are never sent to the snat- ns in my case).

  The workaround is to delete that ip rule but neutron should be taking
  care of this. Looks like the culprit is in
  neutron/agent/l3/dvr_local_router.py:floating_ip_removed_dist

  Note that the NAT rules were successfully removed from iptables so
  looks like it is just this bit that is left behind.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1891673/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1892200] [NEW] Make keepalived healthcheck more configurable

2020-08-19 Thread Edward Hope-Morley
Public bug reported:

Since the Newton release, users of HA routers have had a keepalived
healthcheck that fails if it doesn't get a response to a single ping or
if the expected tenant network address is not configured in the local
namespace being watched. While this works for most cases where an
environment is stable it appears to produce a lot of instability as soon
as an environment gets loaded or a node fails and transitions/failovers
occur. An example of this appears to be where transitions of the MASTER
to a new node take a little longer than they should. For example we have
seen in the field that under heavy load a node can, for a very short
period of time, have the external network address that keepalived is
tracking be configured on two interfaces/hosts at once and while neutron
is still doing its garp updates it is possible that a ping from the new
master router can fail to get a response for 50% of requests since the
switch may still send the reply to either the new master or the old one.

In order to avoid transient problems like this from causing further
instability we would like to be able to make the healthcheck a little
more tolerant of transient issues. Currently the healthcheck script is
generated by Neutron for each router and its contents are not
configurable. It would be great to be able to change e.g. the number of
pings that it will do before declaring a failure.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

- Since the Newton release we have had a keepalived healthcheck that fails
- if it doesn't get a response to a single ping or if the expected tenant
- network address is not configured in the local namespace being watched.
- While this works for most cases where an environment is stable it
- appears to produce a lot of instability as soon as an environment gets
- loaded or a node fails and transitions/failovers occur. An example of
- this appears to be where transitions of the MASTER to a new node take a
- little longer than they should. For example we have seen in the field
- that under heavy load a node can, for a very short period of time, have
- the external network address that keepalived is tracking be configured
- on two interfaces/hosts at once and while neutron is still doing its
- garp updates it is possible that a ping from the new master router can
- fail to get a response for 50% of requests since the switch may still
- send the reply to either the new master or the old one.
+ Since the Newton release, users of HA routers have had a keepalived
+ healthcheck that fails if it doesn't get a response to a single ping or
+ if the expected tenant network address is not configured in the local
+ namespace being watched. While this works for most cases where an
+ environment is stable it appears to produce a lot of instability as soon
+ as an environment gets loaded or a node fails and transitions/failovers
+ occur. An example of this appears to be where transitions of the MASTER
+ to a new node take a little longer than they should. For example we have
+ seen in the field that under heavy load a node can, for a very short
+ period of time, have the external network address that keepalived is
+ tracking be configured on two interfaces/hosts at once and while neutron
+ is still doing its garp updates it is possible that a ping from the new
+ master router can fail to get a response for 50% of requests since the
+ switch may still send the reply to either the new master or the old one.
  
  In order to avoid transient problems like this from causing further
  instability we would like to be able to make the healthcheck a little
  more tolerant of transient issues. Currently the healthcheck script is
  generated by Neutron for each router and its contents are not
  configurable. It would be great to be able to change e.g. the number of
  pings that it will do before declaring a failure.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1892200

Title:
  Make keepalived healthcheck more configurable

Status in neutron:
  New

Bug description:
  Since the Newton release, users of HA routers have had a keepalived
  healthcheck that fails if it doesn't get a response to a single ping
  or if the expected tenant network address is not configured in the
  local namespace being watched. While this works for most cases where
  an environment is stable it appears to produce a lot of instability as
  soon as an environment gets loaded or a node fails and
  transitions/failovers occur. An example of this appears to be where
  transitions of the MASTER to a new node take a little longer than they
  should. For example we have seen in the field that under heavy load a
  node can, for a very short period of time, have the external network
  address that keepalived is tracking be configured on two
  interfaces/hosts at once and while 

[Yahoo-eng-team] [Bug 1891673] [NEW] qrouter ns ip rules not deleted when fip removed from vm

2020-08-14 Thread Edward Hope-Morley
Public bug reported:

With Bionic Stein using dvr_snat if I add a floating ip to a vm then
remove the floating ip, the corresponding ip rules in the associated
qrouter ns local to the instance are not deleted which results in no
longer being able to reach the external network because packets are
still sent to the fip namespace (via rfp-/fpr-) e.g. in my compute host
running a vm whose address is 192.168.21.28 for which i have removed the
fip I still see:

# ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip rule list
0:  from all lookup local 
32765:  from 192.168.21.28 lookup 16 
32766:  from all lookup main 
32767:  from all lookup default 
3232240897: from 192.168.21.1/24 lookup 3232240897 
3232241231: from 192.168.22.79/24 lookup 3232241231

And table 16 leads to:

# ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip route show 
table 16
default via 169.254.109.249 dev rfp-5e45608f-3

Which results in the instance no longer being able to reach the external
network (packets are never sent to the snat- ns in my case).

The workaround is to delete that ip rule but neutron should be taking
care of this. Looks like the culprit is in
neutron/agent/l3/dvr_local_router.py:floating_ip_removed_dist

Note that the NAT rules were successfully removed from iptables so looks
like it is just this bit that is left behind.

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: sts

** Tags added: sts

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1891673

Title:
  qrouter ns ip rules not deleted when fip removed from vm

Status in neutron:
  New

Bug description:
  With Bionic Stein using dvr_snat if I add a floating ip to a vm then
  remove the floating ip, the corresponding ip rules in the associated
  qrouter ns local to the instance are not deleted which results in no
  longer being able to reach the external network because packets are
  still sent to the fip namespace (via rfp-/fpr-) e.g. in my compute
  host running a vm whose address is 192.168.21.28 for which i have
  removed the fip I still see:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip rule list
  0:  from all lookup local 
  32765:  from 192.168.21.28 lookup 16 
  32766:  from all lookup main 
  32767:  from all lookup default 
  3232240897: from 192.168.21.1/24 lookup 3232240897 
  3232241231: from 192.168.22.79/24 lookup 3232241231

  And table 16 leads to:

  # ip netns exec qrouter-5e45608f-33d4-41bf-b3ba-915adf612e65 ip route show 
table 16
  default via 169.254.109.249 dev rfp-5e45608f-3

  Which results in the instance no longer being able to reach the
  external network (packets are never sent to the snat- ns in my case).

  The workaround is to delete that ip rule but neutron should be taking
  care of this. Looks like the culprit is in
  neutron/agent/l3/dvr_local_router.py:floating_ip_removed_dist

  Note that the NAT rules were successfully removed from iptables so
  looks like it is just this bit that is left behind.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1891673/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1852221] Re: ovs-vswitchd needs to be forced to reconfigure after adding protocols to bridges

2020-06-02 Thread Edward Hope-Morley
** No longer affects: openvswitch (Ubuntu Focal)

** No longer affects: openvswitch (Ubuntu Eoan)

** Also affects: neutron (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: openvswitch (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: openvswitch (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Also affects: neutron (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Changed in: neutron (Ubuntu Focal)
   Status: New => Fix Released

** No longer affects: openvswitch (Ubuntu Eoan)

** No longer affects: openvswitch (Ubuntu Focal)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1852221

Title:
  ovs-vswitchd needs to be forced to reconfigure after adding protocols
  to bridges

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in kolla-ansible:
  New
Status in neutron:
  New
Status in openvswitch:
  New
Status in neutron package in Ubuntu:
  New
Status in openvswitch package in Ubuntu:
  Confirmed
Status in neutron source package in Eoan:
  New
Status in neutron source package in Focal:
  Fix Released

Bug description:
  [Impact]
  When the neutron native ovs driver creates bridges it will sometimes 
apply/modify the supported openflow protocols on that bridge. The OpenVswitch 
versions shipped with Train and Ussuri don't support this which results in OF 
protocol mismatches when neutron performs operations on that bridge. The patch 
we are backporting here ensures that all protocol versions are set on the 
bridge at the point on create/init.

  [Test Case]
   * deploy Openstack Train
   * go to a compute host and do: sudo ovs-ofctl -O OpenFlow14 dump-flows br-int
   * ensure you do not see "negotiation failed" errors

  [Regression Potential]
   * this patch is ensuring that newly created Neutron ovs bridges have 
OpenFlow 1.0, 1.3 and 1.4 set on them. Neutron already supports these so is not 
expected to have any change in behaviour. The patch will not impact bridges 
that already exist (so will not fix them either if they are affected).

  --

  As part of programming OpenvSwitch, Neutron will add to which
  protocols bridges support [0].

  However, the Open vSwitch `ovs-vswitchd` process does not appear to
  always update its perspective of which protocol versions it should
  support for bridges:

  # ovs-ofctl -O OpenFlow14 dump-flows br-int
  2019-11-12T12:52:56Z|1|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: 
version negotiation failed (we support version 0x05, peer supports version 0x01)
  ovs-ofctl: br-int: failed to connect to socket (Broken pipe)

  # systemctl restart ovsdb-server
  # ovs-ofctl -O OpenFlow14 dump-flows br-int
   cookie=0x84ead4b79da3289a, duration=1.576s, table=0, n_packets=0, n_bytes=0, 
priority=65535,vlan_tci=0x0fff/0x1fff actions=drop
   cookie=0x84ead4b79da3289a, duration=1.352s, table=0, n_packets=0, n_bytes=0, 
priority=5,in_port="int-br-ex",dl_dst=fa:16:3f:69:2e:c6 actions=goto_table:4
  ...
  (Success)

  The restart of the `ovsdb-server` process above will make `ovs-
  vswitchd` reassess its configuration.

  0:
  
https://github.com/openstack/neutron/blob/0fa7e74ebb386b178d36ae684ff04f03bdd6cb0d/neutron/agent/common/ovs_lib.py#L281

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1852221/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1852221] Re: ovs-vswitchd needs to be forced to reconfigure after adding protocols to bridges

2020-05-31 Thread Edward Hope-Morley
** Description changed:

+ [Impact]
+ When the neutron native ovs driver creates bridges it will sometimes 
apply/modify the supported openflow protocols on that bridge. The OpenVswitch 
versions shipped with Train and Ussuri don't support this which results in OF 
protocol mismatches when neutron performs operations on that bridge. The patch 
we are backporting here ensures that all protocol versions are set on the 
bridge at the point on create/init.
+ 
+ [Test Case]
+  * deploy Openstack Train
+  * go to a compute host and do: sudo ovs-ofctl -O OpenFlow14 dump-flows br-int
+  * ensure you do not see "negotiation failed" errors
+ 
+ [Regression Potential]
+  * this patch is ensuring that newly created Neutron ovs bridges have 
OpenFlow 1.0, 1.3 and 1.4 set on them. Neutron already supports these so is not 
expected to have any change in behaviour. The patch will not impact bridges 
that already exist (so will not fix them either if they are affected).
+ 
+ --
+ 
  As part of programming OpenvSwitch, Neutron will add to which protocols
  bridges support [0].
  
  However, the Open vSwitch `ovs-vswitchd` process does not appear to
  always update its perspective of which protocol versions it should
  support for bridges:
  
  # ovs-ofctl -O OpenFlow14 dump-flows br-int
  2019-11-12T12:52:56Z|1|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: 
version negotiation failed (we support version 0x05, peer supports version 0x01)
  ovs-ofctl: br-int: failed to connect to socket (Broken pipe)
  
  # systemctl restart ovsdb-server
  # ovs-ofctl -O OpenFlow14 dump-flows br-int
   cookie=0x84ead4b79da3289a, duration=1.576s, table=0, n_packets=0, n_bytes=0, 
priority=65535,vlan_tci=0x0fff/0x1fff actions=drop
   cookie=0x84ead4b79da3289a, duration=1.352s, table=0, n_packets=0, n_bytes=0, 
priority=5,in_port="int-br-ex",dl_dst=fa:16:3f:69:2e:c6 actions=goto_table:4
  ...
  (Success)
  
  The restart of the `ovsdb-server` process above will make `ovs-vswitchd`
  reassess its configuration.
  
- 
- 0: 
https://github.com/openstack/neutron/blob/0fa7e74ebb386b178d36ae684ff04f03bdd6cb0d/neutron/agent/common/ovs_lib.py#L281
+ 0:
+ 
https://github.com/openstack/neutron/blob/0fa7e74ebb386b178d36ae684ff04f03bdd6cb0d/neutron/agent/common/ovs_lib.py#L281

** Patch added: "lp1852221-eoan-train.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1852221/+attachment/5379057/+files/lp1852221-eoan-train.debdiff

** Also affects: openvswitch (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Also affects: openvswitch (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Changed in: openvswitch (Ubuntu Focal)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1852221

Title:
  ovs-vswitchd needs to be forced to reconfigure after adding protocols
  to bridges

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in kolla-ansible:
  New
Status in neutron:
  New
Status in openvswitch:
  New
Status in openvswitch package in Ubuntu:
  Confirmed
Status in openvswitch source package in Eoan:
  New
Status in openvswitch source package in Focal:
  Fix Released

Bug description:
  [Impact]
  When the neutron native ovs driver creates bridges it will sometimes 
apply/modify the supported openflow protocols on that bridge. The OpenVswitch 
versions shipped with Train and Ussuri don't support this which results in OF 
protocol mismatches when neutron performs operations on that bridge. The patch 
we are backporting here ensures that all protocol versions are set on the 
bridge at the point on create/init.

  [Test Case]
   * deploy Openstack Train
   * go to a compute host and do: sudo ovs-ofctl -O OpenFlow14 dump-flows br-int
   * ensure you do not see "negotiation failed" errors

  [Regression Potential]
   * this patch is ensuring that newly created Neutron ovs bridges have 
OpenFlow 1.0, 1.3 and 1.4 set on them. Neutron already supports these so is not 
expected to have any change in behaviour. The patch will not impact bridges 
that already exist (so will not fix them either if they are affected).

  --

  As part of programming OpenvSwitch, Neutron will add to which
  protocols bridges support [0].

  However, the Open vSwitch `ovs-vswitchd` process does not appear to
  always update its perspective of which protocol versions it should
  support for bridges:

  # ovs-ofctl -O OpenFlow14 dump-flows br-int
  2019-11-12T12:52:56Z|1|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: 
version negotiation failed (we support version 0x05, peer supports 

[Yahoo-eng-team] [Bug 1852221] Re: ovs-vswitchd needs to be forced to reconfigure after adding protocols to bridges

2020-05-31 Thread Edward Hope-Morley
** Changed in: cloud-archive/ussuri
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1852221

Title:
  ovs-vswitchd needs to be forced to reconfigure after adding protocols
  to bridges

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in kolla-ansible:
  New
Status in neutron:
  New
Status in openvswitch:
  New
Status in openvswitch package in Ubuntu:
  Confirmed

Bug description:
  As part of programming OpenvSwitch, Neutron will add to which
  protocols bridges support [0].

  However, the Open vSwitch `ovs-vswitchd` process does not appear to
  always update its perspective of which protocol versions it should
  support for bridges:

  # ovs-ofctl -O OpenFlow14 dump-flows br-int
  2019-11-12T12:52:56Z|1|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: 
version negotiation failed (we support version 0x05, peer supports version 0x01)
  ovs-ofctl: br-int: failed to connect to socket (Broken pipe)

  # systemctl restart ovsdb-server
  # ovs-ofctl -O OpenFlow14 dump-flows br-int
   cookie=0x84ead4b79da3289a, duration=1.576s, table=0, n_packets=0, n_bytes=0, 
priority=65535,vlan_tci=0x0fff/0x1fff actions=drop
   cookie=0x84ead4b79da3289a, duration=1.352s, table=0, n_packets=0, n_bytes=0, 
priority=5,in_port="int-br-ex",dl_dst=fa:16:3f:69:2e:c6 actions=goto_table:4
  ...
  (Success)

  The restart of the `ovsdb-server` process above will make `ovs-
  vswitchd` reassess its configuration.

  
  0: 
https://github.com/openstack/neutron/blob/0fa7e74ebb386b178d36ae684ff04f03bdd6cb0d/neutron/agent/common/ovs_lib.py#L281

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1852221/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1852221] Re: ovs-vswitchd needs to be forced to reconfigure after adding protocols to bridges

2020-05-31 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ussuri
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1852221

Title:
  ovs-vswitchd needs to be forced to reconfigure after adding protocols
  to bridges

Status in OpenStack neutron-openvswitch charm:
  Invalid
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in Ubuntu Cloud Archive ussuri series:
  New
Status in kolla-ansible:
  New
Status in neutron:
  New
Status in openvswitch:
  New
Status in openvswitch package in Ubuntu:
  Confirmed

Bug description:
  As part of programming OpenvSwitch, Neutron will add to which
  protocols bridges support [0].

  However, the Open vSwitch `ovs-vswitchd` process does not appear to
  always update its perspective of which protocol versions it should
  support for bridges:

  # ovs-ofctl -O OpenFlow14 dump-flows br-int
  2019-11-12T12:52:56Z|1|vconn|WARN|unix:/var/run/openvswitch/br-int.mgmt: 
version negotiation failed (we support version 0x05, peer supports version 0x01)
  ovs-ofctl: br-int: failed to connect to socket (Broken pipe)

  # systemctl restart ovsdb-server
  # ovs-ofctl -O OpenFlow14 dump-flows br-int
   cookie=0x84ead4b79da3289a, duration=1.576s, table=0, n_packets=0, n_bytes=0, 
priority=65535,vlan_tci=0x0fff/0x1fff actions=drop
   cookie=0x84ead4b79da3289a, duration=1.352s, table=0, n_packets=0, n_bytes=0, 
priority=5,in_port="int-br-ex",dl_dst=fa:16:3f:69:2e:c6 actions=goto_table:4
  ...
  (Success)

  The restart of the `ovsdb-server` process above will make `ovs-
  vswitchd` reassess its configuration.

  
  0: 
https://github.com/openstack/neutron/blob/0fa7e74ebb386b178d36ae684ff04f03bdd6cb0d/neutron/agent/common/ovs_lib.py#L281

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1852221/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1662324] Re: linux bridge agent disables ipv6 before adding an ipv6 address

2020-05-26 Thread Edward Hope-Morley
** Also affects: cloud-archive/mitaka
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1662324

Title:
  linux bridge agent disables ipv6 before adding an ipv6 address

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  In Progress

Bug description:
  [Impact]
  When using linuxbridge and after creating network & interface to ext-net, 
disable_ipv6 is 1. then linuxbridge-agent doesn't add ipv6 properly to newly 
created bridge.

  [Test Case]

  1. deploy basic mitaka env
  2. create external network(ext-net)
  3. create ipv6 network and interface to ext-net
  4. check if related bridge has ipv6 ip
  - no ipv6 originally
  or
  - cat /proc/sys/net/ipv6/conf/[BRIDGE]/disable_ipv6

  after this commit, I was able to see ipv6 address properly.

  [Regression]
  You need to restart neutron-linuxbridge-agent then there could be short 
downtime needed.

  [Others]

  -- original description --

  Summary:
  
  I have a dual-stack NIC with only an IPv6 SLAAC and link local address 
plumbed. This is the designated provider network nic. When I create a network 
and then a subnet, the linux bridge agent first disables IPv6 on the bridge and 
then tries to add the IPv6 address from the NIC to the bridge. Since IPv6 was 
disabled on the bridge, this fails with 'RTNETLINK answers: Permission denied'. 
My intent was to create an IPv4 subnet over this interface with floating IPv4 
addresses for assignment to VMs via this command:
    openstack subnet create --network provider \
  --allocation-pool start=10.54.204.200,end=10.54.204.217 \
  --dns-nameserver 69.252.80.80 --dns-nameserver 69.252.81.81 \
  --gateway 10.54.204.129 --subnet-range 10.54.204.128/25 provider

  I don't know why the agent is disabling IPv6 (I wish it wouldn't),
  that's probably the problem. However, if the agent knows to disable
  IPv6 it should also know not to try to add an IPv6 address.

  Details:
  
  Version: Newton on CentOS 7.3 minimal (CentOS-7-x86_64-Minimal-1611.iso) as 
per these instructions: http://docs.openstack.org/newton/install-guide-rdo/

  Seemingly relevant section of /var/log/neutron/linuxbridge-agent.log:
  2017-02-06 15:09:20.863 1551 INFO 
neutron.plugins.ml2.drivers.linuxbridge.agent.arp_protect 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Skipping ARP spoofing 
rules for port 'tap3679987e-ce' because it has port security disabled
  2017-02-06 15:09:20.863 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Running command: ['ip', 
'-o', 'link', 'show', 'tap3679987e-ce'] create_process 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-02-06 15:09:20.870 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Exit code: 0 execute 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:146
  2017-02-06 15:09:20.871 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Running command: ['ip', 
'addr', 'show', 'eno1', 'scope', 'global'] create_process 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-02-06 15:09:20.878 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Exit code: 0 execute 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:146
  2017-02-06 15:09:20.879 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Running command: ['ip', 
'route', 'list', 'dev', 'eno1', 'scope', 'global'] create_process 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-02-06 15:09:20.885 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Exit code: 0 execute 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:146
  2017-02-06 15:09:20.886 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Running command (rootwrap 
daemon): ['ip', 'link', 'set', 'brqe1623c94-1f', 'up'] execute_rootwrap_daemon 
/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:105
  2017-02-06 15:09:20.895 1551 DEBUG 
neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Starting bridge 
brqe1623c94-1f for subinterface eno1 ensure_bridge 
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:367
  2017-02-06 15:09:20.895 1551 DEBUG neutron.agent.linux.utils 
[req-4917c507-369e-4a36-a381-e8b287cbc988 - - - - -] Running command (rootwrap 
daemon): ['brctl', 'addbr', 'brqe1623c94-1f'] execute_rootwrap_daemon 

[Yahoo-eng-team] [Bug 1826114] Re: Errors creating users and projects

2020-01-07 Thread Edward Hope-Morley
** Also affects: horizon (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Focal)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1826114

Title:
  Errors creating users and projects

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  New
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Disco:
  New
Status in horizon source package in Eoan:
  New
Status in horizon source package in Focal:
  New

Bug description:
  After a fresh install of OpenStack Stein using Juju Charms I'm getting
  errors and items not appearing when creating new users and projects.
  Here are several tests I made:

  When creating ONLY a new user in a domain different than admin_domain,
  the user creates successfully but is not shown in the user list on the
  GUI even after a refresh/logout. Using the CLI shows that it was
  created successfully.

  When creating ONLY a project in a domain different than admin_domain,
  the project creates successfully but is not shown in the user list on
  the GUI even after a refresh/logout. Using the CLI shows that it was
  created successfully.

  When creating a user in a domain different than admin_domain, the
  project list in the form is empty.

  When creating a user AND a project in any domain (using the "+" icon
  in the user creation form), after creating the project the GIU hangs
  on a "Working..." spinning wheel and the following error appears on
  the web console:

  Uncaught SyntaxError: Unexpected token < in JSON at position 53
  at JSON.parse ()
  at Function.jQuery.parseJSON (439e79e74c16.js:567)
  at Function.jQuery.parseJSON (439e79e74c16.js:720)
  at processServerSuccess (c1b8ea0c3a19.js:80)
  at Object.success (c1b8ea0c3a19.js:84)
  at fire (439e79e74c16.js:210)
  at Object.fireWith [as resolveWith] (439e79e74c16.js:216)
  at done (439e79e74c16.js:626)
  at XMLHttpRequest.callback (439e79e74c16.js:653)

  After a refresh, the project is created but not the user.

  When creating a new user and a project separately in admin_domain
  everything works OK. After that I can see both and assign the project
  to the user.

  In all the pages I load from Horizon I get the following warning on
  the web console:

  439e79e74c16.js:700 JQMIGRATE: $(html) HTML strings must start with '<' 
character
  migrateWarn @ 439e79e74c16.js:700
  jQuery.fn.init @ 439e79e74c16.js:714
  jQuery @ 439e79e74c16.js:15
  success @ c1b8ea0c3a19.js:204
  fire @ 439e79e74c16.js:210
  fireWith @ 439e79e74c16.js:216
  done @ 439e79e74c16.js:626
  callback @ 439e79e74c16.js:653
  439e79e74c16.js:700 console.trace
  migrateWarn @ 439e79e74c16.js:700
  jQuery.fn.init @ 439e79e74c16.js:714
  jQuery @ 439e79e74c16.js:15
  success @ c1b8ea0c3a19.js:204
  fire @ 439e79e74c16.js:210
  fireWith @ 439e79e74c16.js:216
  done @ 439e79e74c16.js:626
  callback @ 439e79e74c16.js:653

  
  ==

  [Impact]

  New users and projects created in domain other than admin_domain are not
  displayed in dashboard. 
  Creation of project from User creation form hangs and never returns

  This patch fixes both the issues mentioned above

  [Test Case]

  Tests involved 3 cases
  - Creation of Project in different domain context
  - Creation of User in different domain context
  - Creation of Project in Create User form

  1. Reproducing the issue

  1a. Login to horizon as admin user
  1b. Go to page Identity->Domains and select "SetDomainContext" button for 
domain other than default or admin

  1c. Go to page Identity->Projects
  1d. Click on Create Project, enter new project details and click on Finish
  1e. The page gets refreshed but the new project is not displayed

  1f. Go to page Identity->Users
  1g. Click on Create User, enter new user details and click on Create User 
button in the form
  1h. The page gets refreshed but the new user is not displayed

  1i. Go to page Identity->Users
  1j. Click on Create User and enter new user details
  1k. Click on + symbol under Primary project in the form to create New 
project. Spinning wheel rotates forever.

  2. Install the package with fixed code

  3. Confirm bug have been fixed

  3a. Repeat steps 1a-1d. New project should be displayed in the table
  3b. Repeat steps 

[Yahoo-eng-team] [Bug 1837635] Re: HA router state change from "standby" to "master" should be delayed

2019-11-11 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Tags added: sts

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1837635

Title:
  HA router state change from "standby" to "master" should be delayed

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in neutron:
  Fix Released

Bug description:
  Currently, when a HA state change occurs, the agent execute a series
  of actions [1]: updates the metadata proxy, updates the prefix
  delegation, executed L3 extension "ha_state_change" methods, updates
  the radvd status and notifies this to the server.

  When, in a system with more than two routers (one in "active" mode and
  the others in "standby"), a switch-over is done, the "keepalived"
  process [2] in each "standby" server will set the virtual IP in the HA
  interface and advert it. In case that other router HA interface has
  the same priority (by default in Neutron, the HA instances of the same
  router ID will have the same priority, 50) but higher IP [3], the HA
  interface of this instance will have the VIPs and routes deleted and
  will become "standby" again. E.g.: [4]

  In some cases, we have detected that when the master controller is
  rebooted, the change from "standby" to "master" of the other two
  servers is detected, but the change from "master" to "standby" of the
  server with lower IP (as commented before) is not registered by the
  server, because the Neutron server is still not accessible (the master
  controller was rebooted). This status change, sometimes, is lost. This
  is the situation when both "standby" servers become "master" but the
  "master"-"standby" transition of one of them is lost.

  1) INITIAL STATUS
  (overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router 
router
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | standby  |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | standby  |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | active   |
  
+--+--++---+--+

  2) CONTROLLER 1 REBOOTED
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  
+--+--++---+--+
  | id   | host | 
admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True  
 | :-)   | active   |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True  
 | :-)   | active   |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True  
 | :-)   | standby  |
  
+--+--++---+--+

  
  The aim of this bug is to make public this problem and propose a patch to 
delay the transition from "standby" to "master" to let keepalived, among all 
the instances running in the HA servers, to decide which one of them is the 
"master" server.

  
  [1] 
https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
  [2] https://www.keepalived.org/
  [3] This method is used by keepalived to define which router is predominant 
and must be master.
  [4] http://paste.openstack.org/show/754760/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1837635/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1840465] Re: Fails to list security groups if one or more exists without rules

2019-10-03 Thread Edward Hope-Morley
** Also affects: horizon (Ubuntu Eoan)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1840465

Title:
  Fails to list security groups if one or more exists without rules

Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Bionic:
  New
Status in horizon source package in Disco:
  New
Status in horizon source package in Eoan:
  New

Bug description:
  Horizon 14.0.2 (rocky)
  If a security group without any rules exists the listing of security groups 
fails with a KeyError.

  Traceback (most recent call last):
File 
"/usr/share/openstack-dashboard/openstack_dashboard/api/rest/utils.py", line 
127, in _wrapped
  data = function(self, request, *args, **kw)
File 
"/usr/share/openstack-dashboard/openstack_dashboard/api/rest/network.py", line 
44, in get
  security_groups = api.neutron.security_group_list(request)
File "/usr/lib/python2.7/site-packages/horizon/utils/memoized.py", line 95, 
in wrapped
  value = cache[key] = func(*args, **kwargs)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 1641, in security_group_list
  return SecurityGroupManager(request).list(**params)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 372, in list
  return self._list(**params)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 359, in _list
  return [SecurityGroup(sg) for sg in secgroups.get('security_groups')]
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 240, in __init__
  for rule in sg['security_group_rules']]
  KeyError: 'security_group_rules'

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1840465/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1840465] Re: Fails to list security groups if one or more exists without rules

2019-10-03 Thread Edward Hope-Morley
** Also affects: horizon (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Also affects: horizon (Ubuntu Bionic)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1840465

Title:
  Fails to list security groups if one or more exists without rules

Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  New
Status in horizon source package in Bionic:
  New
Status in horizon source package in Disco:
  New
Status in horizon source package in Eoan:
  New

Bug description:
  Horizon 14.0.2 (rocky)
  If a security group without any rules exists the listing of security groups 
fails with a KeyError.

  Traceback (most recent call last):
File 
"/usr/share/openstack-dashboard/openstack_dashboard/api/rest/utils.py", line 
127, in _wrapped
  data = function(self, request, *args, **kw)
File 
"/usr/share/openstack-dashboard/openstack_dashboard/api/rest/network.py", line 
44, in get
  security_groups = api.neutron.security_group_list(request)
File "/usr/lib/python2.7/site-packages/horizon/utils/memoized.py", line 95, 
in wrapped
  value = cache[key] = func(*args, **kwargs)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 1641, in security_group_list
  return SecurityGroupManager(request).list(**params)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 372, in list
  return self._list(**params)
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 359, in _list
  return [SecurityGroup(sg) for sg in secgroups.get('security_groups')]
File "/usr/share/openstack-dashboard/openstack_dashboard/api/neutron.py", 
line 240, in __init__
  for rule in sg['security_group_rules']]
  KeyError: 'security_group_rules'

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1840465/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1843056] Re: Error listing security groups after default rules deletion

2019-10-03 Thread Edward Hope-Morley
*** This bug is a duplicate of bug 1840465 ***
https://bugs.launchpad.net/bugs/1840465

Has been backported to Rocky so should not be hard to get this into
Queens. I will mark this as a duplicate of 1840465 and we can use that
bug to request the backport to Q.

** This bug has been marked a duplicate of bug 1840465
   Fails to list security groups if one or more exists without rules

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1843056

Title:
  Error listing security groups after default rules deletion

Status in OpenStack Dashboard (Horizon):
  Confirmed

Bug description:
  Issue occured in Queens.

  Steps to reproduce the issue:

  - create new security group
  - delete default rules

  When no rule is inserted, Horizon will displaying error when trying to
  list security groups. Error in apache:

  Recoverable error: 'security_group_rules'

  Empty security group must be edited or deleted via cli to fix listing
  in Horizon.

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1843056/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1751923] Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

2019-08-29 Thread Edward Hope-Morley
** Changed in: nova (Ubuntu Disco)
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1751923

Title:
  _heal_instance_info_cache periodic task bases on port list from nova
  db, not from neutron server

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Bionic:
  New
Status in nova source package in Disco:
  Fix Released

Bug description:
  Description
  ===

  During periodic task _heal_instance_info_cache the
  instance_info_caches are not updated using instance port_ids taken
  from neutron, but from nova db.

  Sometimes, perhaps because of some race-condition, its possible to
  lose some ports from instance_info_caches. Periodic task
  _heal_instance_info_cache should clean this up (add missing records),
  but in fact it's not working this way.

  How it looks now?
  =

  _heal_instance_info_cache during crontask:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525

  is using network_api to get instance_nw_info (instance_info_caches):

  try:
  # Call to network API to get instance info.. this will
  # force an update to the instance's info_cache
  self.network_api.get_instance_nw_info(context, instance)

  self.network_api.get_instance_nw_info() is listed below:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377

  and it uses _build_network_info_model() without networks and port_ids
  parameters (because we're not adding any new interface to instance):

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356

  Next: _gather_port_ids_and_networks() generates the list of instance
  networks and port_ids:

    networks, port_ids = self._gather_port_ids_and_networks(
  context, instance, networks, port_ids, client)

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393

  As we see that _gather_port_ids_and_networks() takes the port list
  from DB:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176

  And thats it. When we lose a port its not possible to add it again with this 
periodic task.
  The only way is to clean device_id field in neutron port object and re-attach 
the interface using `nova interface-attach`.

  When the interface is missing and there is no port configured on
  compute host (for example after compute reboot) - interface is not
  added to instance and from neutron point of view port state is DOWN.

  When the interface is missing in cache and we reboot hard the instance
  - its not added as tapinterface in xml file = we don't have the
  network on host.

  Steps to reproduce
  ==
  1. Spawn devstack
  2. Spawn VM inside devstack with multiple ports (for example also from 2 
different networks)
  3. Update the DB row, drop one interface from interfaces_list
  4. Hard-Reboot the instance
  5. See that nova list shows instance without one address, but nova 
interface-list shows all addresses
  6. See that one port is missing in instance xml files
  7. In theory the _heal_instance_info_cache should fix this things, it relies 
on memory, not on the fresh list of instance ports taken from neutron.

  Reproduced Example
  ==
  1. Spawn VM with 1 private network port
  nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic 
net-name=private  test-2
  2. Attach ports to have 2 private and 2 public interfaces
  nova list:
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -  | 
Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; 
private=fdda:5d77:e18e:0:f816:3eff:fee8:, 10.0.0.3, 
fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |

  So we see 4 ports:
  stack@mjozefcz-devstack-ptg:~$ nova interface-list 
a64ed18d-9868-4bf0-90d3-d710d278922d
  
++--+--+---+---+
  | Port State | Port ID  | Net ID  
 | IP addresses  | MAC Addr 
 |
  

[Yahoo-eng-team] [Bug 1751923] Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

2019-08-29 Thread Edward Hope-Morley
** Also affects: nova (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/stein
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1751923

Title:
  _heal_instance_info_cache periodic task bases on port list from nova
  db, not from neutron server

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in nova package in Ubuntu:
  New
Status in nova source package in Bionic:
  New
Status in nova source package in Disco:
  New

Bug description:
  Description
  ===

  During periodic task _heal_instance_info_cache the
  instance_info_caches are not updated using instance port_ids taken
  from neutron, but from nova db.

  Sometimes, perhaps because of some race-condition, its possible to
  lose some ports from instance_info_caches. Periodic task
  _heal_instance_info_cache should clean this up (add missing records),
  but in fact it's not working this way.

  How it looks now?
  =

  _heal_instance_info_cache during crontask:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525

  is using network_api to get instance_nw_info (instance_info_caches):

  try:
  # Call to network API to get instance info.. this will
  # force an update to the instance's info_cache
  self.network_api.get_instance_nw_info(context, instance)

  self.network_api.get_instance_nw_info() is listed below:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377

  and it uses _build_network_info_model() without networks and port_ids
  parameters (because we're not adding any new interface to instance):

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356

  Next: _gather_port_ids_and_networks() generates the list of instance
  networks and port_ids:

    networks, port_ids = self._gather_port_ids_and_networks(
  context, instance, networks, port_ids, client)

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393

  As we see that _gather_port_ids_and_networks() takes the port list
  from DB:

  
https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176

  And thats it. When we lose a port its not possible to add it again with this 
periodic task.
  The only way is to clean device_id field in neutron port object and re-attach 
the interface using `nova interface-attach`.

  When the interface is missing and there is no port configured on
  compute host (for example after compute reboot) - interface is not
  added to instance and from neutron point of view port state is DOWN.

  When the interface is missing in cache and we reboot hard the instance
  - its not added as tapinterface in xml file = we don't have the
  network on host.

  Steps to reproduce
  ==
  1. Spawn devstack
  2. Spawn VM inside devstack with multiple ports (for example also from 2 
different networks)
  3. Update the DB row, drop one interface from interfaces_list
  4. Hard-Reboot the instance
  5. See that nova list shows instance without one address, but nova 
interface-list shows all addresses
  6. See that one port is missing in instance xml files
  7. In theory the _heal_instance_info_cache should fix this things, it relies 
on memory, not on the fresh list of instance ports taken from neutron.

  Reproduced Example
  ==
  1. Spawn VM with 1 private network port
  nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic 
net-name=private  test-2
  2. Attach ports to have 2 private and 2 public interfaces
  nova list:
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -  | 
Running | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; 
private=fdda:5d77:e18e:0:f816:3eff:fee8:, 

[Yahoo-eng-team] [Bug 1633120] Re: [SRU] Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new instance

2019-08-01 Thread Edward Hope-Morley
Mitaka not backportable so abandoning:

$ git-deps -e mitaka-eol 5c5a6b93a07b0b58f513396254049c17e2883894^!
c2c3b97259258eec3c98feabde3b411b519eae6e

$ git-deps -e mitaka-eol c2c3b97259258eec3c98feabde3b411b519eae6e^!
a023c32c70b5ddbae122636c26ed32e5dcba66b2
74fbff88639891269f6a0752e70b78340cf87e9a
e83842b80b73c451f78a4bb9e7bd5dfcebdefcab
1f259e2a9423a4777f79ca561d5e6a74747a5019
b01187eede3881f72addd997c8fd763ddbc137fc
49d9433c62d74f6ebdcf0832e3a03e544b1d6c83


** Changed in: cloud-archive/mitaka
   Status: Triaged => Won't Fix

** Changed in: nova (Ubuntu Xenial)
   Status: Triaged => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1633120

Title:
  [SRU] Nova scheduler tries to assign an already-in-use SRIOV QAT VF to
  a new instance

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Won't Fix
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ocata series:
  Fix Committed
Status in OpenStack Compute (nova) pike series:
  Fix Committed
Status in OpenStack Compute (nova) queens series:
  Fix Committed
Status in OpenStack Compute (nova) rocky series:
  Fix Committed
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Xenial:
  Won't Fix
Status in nova source package in Bionic:
  Fix Released
Status in nova source package in Cosmic:
  Fix Released
Status in nova source package in Disco:
  Fix Released
Status in nova source package in Eoan:
  Fix Released

Bug description:
  [Impact]
  This patch is required to prevent nova from accidentally marking pci_device 
allocations as deleted when it incorrectly reads the passthrough whitelist 

  [Test Case]
  * deploy openstack (any version that supports sriov)
  * single compute configured for sriov with at least once device in 
pci_passthrough_whitelist
  * create a vm and attach sriov port
  * remove device from pci_passthrough_whitelist and restart nova-compute
  * check that pci_devices allocations have not been marked as deleted

  [Regression Potential]
  None anticipated
  
  Upon trying to create VM instance (Say A) with one QAT VF, it fails with the 
following error i.e., “Requested operation is not valid: PCI device 
:88:04.7 is in use by driver QEMU, domain instance-0081”. Please note 
that, PCI device :88:04.7 is already being assigned to another VM (Say B) . 
 We have installed openstack-mitaka release on CentO7 system. It has two Intel 
QAT devices. There are 32 VF devices available per QAT Device/DH895xCC device 
Out of 64 VFs, only 8 VFs are allocated (to VM instances) and rest should be 
available.
  But the nova scheduler tries to assign an already-in-use SRIOV VF to a new 
instance and instance fails. It appears that the nova database is not tracking 
which VF's have already been taken. But if I shut down VM B instance, then 
other instance VM A boots up and vice-versa. Note that, both the VM instances 
cannot run simultaneously because of the aforesaid issue.

  We should always be able to create as many instances with the
  requested PCI devices as there are available VFs.

  Please feel free to let me know if additional information is needed.
  Can anyone please suggest why it tries to assign same PCI device which
  has been assigned already? Is there any way to resolve this issue?
  Thank you in advance for your support and help.

  [root@localhost ~(keystone_admin)]# lspci -d:435
  83:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
  88:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
  [root@localhost ~(keystone_admin)]#

  [root@localhost ~(keystone_admin)]# lspci -d:443 | grep "QAT Virtual 
Function" | wc -l
  64
  [root@localhost ~(keystone_admin)]#

  [root@localhost ~(keystone_admin)]# mysql -u root nova -e "SELECT 
hypervisor_hostname, address, instance_uuid, status FROM pci_devices JOIN 
compute_nodes oncompute_nodes.id=compute_node_id" | grep :88:04.7
  localhost  :88:04.7e10a76f3-e58e-4071-a4dd-7a545e8000deallocated
  localhost  :88:04.7c3dbac90-198d-4150-ba0f-a80b912d8021allocated
  localhost  :88:04.7c7f6adad-83f0-4881-b68f-6d154d565ce3allocated
  localhost.nfv.benunets.com :88:04.7
0c3c11a5-f9a4-4f0d-b120-40e4dde843d4allocated
  [root@localhost ~(keystone_admin)]#

  [root@localhost ~(keystone_admin)]# grep -r 
e10a76f3-e58e-4071-a4dd-7a545e8000de /etc/libvirt/qemu
  /etc/libvirt/qemu/instance-0081.xml:  
e10a76f3-e58e-4071-a4dd-7a545e8000de
  

[Yahoo-eng-team] [Bug 1821594] Re: [SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state

2019-07-15 Thread Edward Hope-Morley
** Changed in: nova/stein
   Status: Fix Committed => Fix Released

** Changed in: cloud-archive/stein
   Status: Fix Committed => Fix Released

** Changed in: nova (Ubuntu Disco)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821594

Title:
  [SRU] Error in confirm_migration leaves stale allocations and
  'confirming' migration state

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Fix Committed
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Fix Committed
Status in OpenStack Compute (nova) rocky series:
  Fix Committed
Status in OpenStack Compute (nova) stein series:
  Fix Released
Status in nova package in Ubuntu:
  Fix Committed
Status in nova source package in Bionic:
  Fix Committed
Status in nova source package in Cosmic:
  Fix Released
Status in nova source package in Disco:
  Fix Released
Status in nova source package in Eoan:
  Fix Committed

Bug description:
  Description:

  When performing a cold migration, if an exception is raised by the
  driver during confirm_migration (this runs in the source node), the
  migration record is stuck in "confirming" state and the allocations
  against the source node are not removed.

  The instance is fine at the destination in this stage, but the source
  host has allocations that is not possible to clean without going to
  the database or invoking the Placement API via curl. After several
  migration attempts that fail in the same spot, the source node is
  filled with these allocations that prevent new instances from being
  created or instances migrated to this node.

  When confirm_migration fails in this stage, the migrating instance can
  be saved through a hard reboot or a reset state to active.

  Steps to reproduce:

  Unfortunately, I don't have logs of the real root cause of the problem
  inside driver.confirm_migration running libvirt driver. However, the
  stale allocations and migration status problem can be easily
  reproduced by raising an exception in libvirt driver's
  confirm_migration method, and it would affect any driver.

  Expected results:

  Discussed this issue with efried and mriedem over #openstack-nova on
  March 25th, 2019. They confirmed that allocations not being cleared up
  is a bug.

  Actual results:

  Instance is fine at the destination after a reset-state. Source node
  has stale allocations that prevent new instances from being
  created/migrated to the source node. Migration record is stuck in
  "confirming" state.

  Environment:

  I verified this bug on on pike, queens and stein branches. Running
  libvirt KVM driver.

  ===

  [Impact]

  If users attempting to perform cold migrations face any issues when
  the virt driver is running the "Confirm Migration" step, the failure leaves 
stale allocation records in the database, in migration records in "confirming" 
state. The stale allocations are not cleaned up by nova, consuming the user's 
quota indefinitely.

  This bug was confirmed from pike to stein release, and a fix was
  implemented for queens, rocky and stein. It should be backported to
  those releases to prevent the issue from reoccurring.

  This fix prevents new stale allocations being left over, by cleaning
  them up immediately when the failures occur. At the moment, the users
  affected by this bug have to clean their previous stale allocations
  manually.

  [Test Case]

  1. Reproducing the bug

  1a. Inject failure

  The root cause for this problem may vary for each driver and
  environment, so to reproduce the bug, it is necessary first to inject
  a failure in the driver's confirm_migration method to cause an
  exception to be raised.

  An example when using libvirt is to add a line:

  raise Exception("TEST")

  in
  
https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012

  1b. Restart nova-compute service: systemctl restart nova-compute

  1c. Create a VM

  1d. Then, invoke a cold migration: "openstack server migrate {id}"

  1e. Wait for instance status: VERIFY_RESIZE

  1f. Invoke "openstack server resize {id} --confirm"

  1g. Wait for instance status: ERROR

  1h. Check migration stuck in "confirming" status: nova migration-list

  1i. Check allocations, you should see 2 allocations, one with the VM
  ID, the other with the migration uuid

  export ENDPOINT=
  export TOKEN=`openstack token issue| grep ' id '| awk '{print $4}'`
  for id in $(curl 

[Yahoo-eng-team] [Bug 1633120] Re: [SRU] Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new instance

2019-07-04 Thread Edward Hope-Morley
** Summary changed:

- Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new 
instance
+ [SRU] Nova scheduler tries to assign an already-in-use SRIOV QAT VF to a new 
instance

** Description changed:

+ [Impact]
+ This patch is required to prevent nova from accidentally marking pci_device 
allocations as deleted when it incorrectly reads the passthrough whitelist 
+ 
+ [Test Case]
+ * deploy openstack (any version that supports sriov)
+ * single compute configured for sriov with at least once device in 
pci_passthrough_whitelist
+ * create a vm and attach sriov port
+ * remove device from pci_passthrough_whitelist and restart nova-compute
+ * check that pci_devices allocations have not been marked as deleted
+ 
+ [Regression Potential]
+ None anticipated
+ 
  Upon trying to create VM instance (Say A) with one QAT VF, it fails with the 
following error i.e., “Requested operation is not valid: PCI device 
:88:04.7 is in use by driver QEMU, domain instance-0081”. Please note 
that, PCI device :88:04.7 is already being assigned to another VM (Say B) . 
 We have installed openstack-mitaka release on CentO7 system. It has two Intel 
QAT devices. There are 32 VF devices available per QAT Device/DH895xCC device 
Out of 64 VFs, only 8 VFs are allocated (to VM instances) and rest should be 
available.
- But the nova scheduler tries to assign an already-in-use SRIOV VF to a new 
instance and instance fails. It appears that the nova database is not tracking 
which VF's have already been taken. But if I shut down VM B instance, then 
other instance VM A boots up and vice-versa. Note that, both the VM instances 
cannot run simultaneously because of the aforesaid issue. 
+ But the nova scheduler tries to assign an already-in-use SRIOV VF to a new 
instance and instance fails. It appears that the nova database is not tracking 
which VF's have already been taken. But if I shut down VM B instance, then 
other instance VM A boots up and vice-versa. Note that, both the VM instances 
cannot run simultaneously because of the aforesaid issue.
  
  We should always be able to create as many instances with the requested
  PCI devices as there are available VFs.
  
  Please feel free to let me know if additional information is needed. Can
  anyone please suggest why it tries to assign same PCI device which has
  been assigned already? Is there any way to resolve this issue? Thank you
  in advance for your support and help.
  
  [root@localhost ~(keystone_admin)]# lspci -d:435
  83:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
  88:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
  [root@localhost ~(keystone_admin)]#
  
- 
  [root@localhost ~(keystone_admin)]# lspci -d:443 | grep "QAT Virtual 
Function" | wc -l
  64
  [root@localhost ~(keystone_admin)]#
-  
-  
+ 
  [root@localhost ~(keystone_admin)]# mysql -u root nova -e "SELECT 
hypervisor_hostname, address, instance_uuid, status FROM pci_devices JOIN 
compute_nodes oncompute_nodes.id=compute_node_id" | grep :88:04.7
  localhost  :88:04.7e10a76f3-e58e-4071-a4dd-7a545e8000deallocated
  localhost  :88:04.7c3dbac90-198d-4150-ba0f-a80b912d8021allocated
  localhost  :88:04.7c7f6adad-83f0-4881-b68f-6d154d565ce3allocated
  localhost.nfv.benunets.com :88:04.7
0c3c11a5-f9a4-4f0d-b120-40e4dde843d4allocated
  [root@localhost ~(keystone_admin)]#
-  
+ 
  [root@localhost ~(keystone_admin)]# grep -r 
e10a76f3-e58e-4071-a4dd-7a545e8000de /etc/libvirt/qemu
  /etc/libvirt/qemu/instance-0081.xml:  
e10a76f3-e58e-4071-a4dd-7a545e8000de
  /etc/libvirt/qemu/instance-0081.xml:  e10a76f3-e58e-4071-a4dd-7a545e8000de
  /etc/libvirt/qemu/instance-0081.xml:  
  /etc/libvirt/qemu/instance-0081.xml:  
  /etc/libvirt/qemu/instance-0081.xml:  
  [root@localhost ~(keystone_admin)]#
  [root@localhost ~(keystone_admin)]# grep -r 
0c3c11a5-f9a4-4f0d-b120-40e4dde843d4 /etc/libvirt/qemu
  /etc/libvirt/qemu/instance-00ab.xml:  
0c3c11a5-f9a4-4f0d-b120-40e4dde843d4
  /etc/libvirt/qemu/instance-00ab.xml:  0c3c11a5-f9a4-4f0d-b120-40e4dde843d4
  /etc/libvirt/qemu/instance-00ab.xml:  
  /etc/libvirt/qemu/instance-00ab.xml:  
  /etc/libvirt/qemu/instance-00ab.xml:  
  [root@localhost ~(keystone_admin)]#
-  
- On the controller, , it appears there are duplicate PCI device entries in the 
Database:
-  
+ 
+ On the controller, , it appears there are duplicate PCI device entries
+ in the Database:
+ 
  MariaDB [nova]> select hypervisor_hostname,address,count(*) from pci_devices 
JOIN compute_nodes on compute_nodes.id=compute_node_id group by 
hypervisor_hostname,address having count(*) > 1;
  +-+--+--+
  | hypervisor_hostname | address  | count(*) |
  +-+--+--+
  | localhost  | 

[Yahoo-eng-team] [Bug 1821594] Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

2019-06-12 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Eoan)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

** Also affects: nova (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Tags added: sts-sru-needed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821594

Title:
  [SRU] Error in confirm_migration leaves stale allocations and
  'confirming' migration state

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  New
Status in Ubuntu Cloud Archive train series:
  Fix Committed
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Fix Committed
Status in OpenStack Compute (nova) rocky series:
  Fix Committed
Status in OpenStack Compute (nova) stein series:
  Fix Committed
Status in nova package in Ubuntu:
  Fix Committed
Status in nova source package in Bionic:
  New
Status in nova source package in Cosmic:
  New
Status in nova source package in Disco:
  New
Status in nova source package in Eoan:
  Fix Committed

Bug description:
  Description:

  When performing a cold migration, if an exception is raised by the
  driver during confirm_migration (this runs in the source node), the
  migration record is stuck in "confirming" state and the allocations
  against the source node are not removed.

  The instance is fine at the destination in this stage, but the source
  host has allocations that is not possible to clean without going to
  the database or invoking the Placement API via curl. After several
  migration attempts that fail in the same spot, the source node is
  filled with these allocations that prevent new instances from being
  created or instances migrated to this node.

  When confirm_migration fails in this stage, the migrating instance can
  be saved through a hard reboot or a reset state to active.

  Steps to reproduce:

  Unfortunately, I don't have logs of the real root cause of the problem
  inside driver.confirm_migration running libvirt driver. However, the
  stale allocations and migration status problem can be easily
  reproduced by raising an exception in libvirt driver's
  confirm_migration method, and it would affect any driver.

  Expected results:

  Discussed this issue with efried and mriedem over #openstack-nova on
  March 25th, 2019. They confirmed that allocations not being cleared up
  is a bug.

  Actual results:

  Instance is fine at the destination after a reset-state. Source node
  has stale allocations that prevent new instances from being
  created/migrated to the source node. Migration record is stuck in
  "confirming" state.

  Environment:

  I verified this bug on on pike, queens and stein branches. Running
  libvirt KVM driver.

  ===

  [Impact]

  If users attempting to perform cold migrations face any issues when
  the virt driver is running the "Confirm Migration" step, the failure leaves 
stale allocation records in the database, in migration records in "confirming" 
state. The stale allocations are not cleaned up by nova, consuming the user's 
quota indefinitely.

  This bug was confirmed from pike to stein release, and a fix was
  implemented for queens, rocky and stein. It should be backported to
  those releases to prevent the issue from reoccurring.

  This fix prevents new stale allocations being left over, by cleaning
  them up immediately when the failures occur. At the moment, the users
  affected by this bug have to clean their previous stale allocations
  manually.

  [Test Case]

  The root cause for this problem may vary for each driver and
  environment, so to reproduce the bug, it is necessary first to inject
  a failure in the driver's confirm_migration method to cause an
  exception to be raised.

  An example when using libvirt is to add a line:

  raise Exception("TEST")

  in
  
https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012
  and then restart the nova-compute service.

  Then, invoke a cold migration through "openstack 

[Yahoo-eng-team] [Bug 1722584] Re: Return traffic from metadata service may get dropped by hypervisor due to wrong checksum

2019-06-10 Thread Edward Hope-Morley
** Description changed:

- We have a problem with the metadata service not being responsive, when
- the proxied in the router namespace on some of our networking nodes
- after upgrading to Ocata (Running on CentOS 7.4, with the RDO packages).
+ [Impact]
+ Prior addition of code to add checksum rules was found to cause problems with 
newer kernels. Patch subsequently reverted so this request is to backport those 
patches to the ubuntu archives.
  
+ [Test Case]
+ * deploy openstack (>= queens)
+ * create router/network/instance (dvr=false,l3ha=false)
+ * go to router ns on neutron-gateway and check that the following returns 
nothing
+ sudo ip netns exec qrouter- iptables -t mangle -S| grep '--sport 9697 -j 
CHECKSUM --checksum-fill'
+ 
+ [Regression Potential]
+ None expected
+ 
+ [Other Info]
+ This revert patch does not remove rules added by the original patch so manual 
cleanup of those old rules is required.
+ 
+ -
+ We have a problem with the metadata service not being responsive, when the 
proxied in the router namespace on some of our networking nodes after upgrading 
to Ocata (Running on CentOS 7.4, with the RDO packages).
  
  Instance routes traffic to 169.254.169.254 to it's default gateway.
  Default gateway is an OpenStack router in a namespace on a networking node.
  
  - Traffic gets sent from the guest,
  - to the router,
  - iptables routes it to the metadata proxy service,
  - response packet gets routed back, leaving the namespace
  - Hypervisor gets the packet in
  - Checksum of packet is wrong, and the packet gets dropped before putting it 
on the bridge
  
- 
- Based on the following bug 
https://bugs.launchpad.net/openstack-ansible/+bug/1483603, we found that adding 
the following iptable rule in the router namespace made this work again: 
'iptables -t mangle -I POSTROUTING -p tcp --sport 9697 -j CHECKSUM 
--checksum-fill'
+ Based on the following bug https://bugs.launchpad.net/openstack-
+ ansible/+bug/1483603, we found that adding the following iptable rule in
+ the router namespace made this work again: 'iptables -t mangle -I
+ POSTROUTING -p tcp --sport 9697 -j CHECKSUM --checksum-fill'
  
  (NOTE: The rule from the 1st comment to the bug did solve access to the
  metadata service, but the lack of precision introduced other problems
  with the network)

** Description changed:

  [Impact]
  Prior addition of code to add checksum rules was found to cause problems with 
newer kernels. Patch subsequently reverted so this request is to backport those 
patches to the ubuntu archives.
  
  [Test Case]
  * deploy openstack (>= queens)
  * create router/network/instance (dvr=false,l3ha=false)
  * go to router ns on neutron-gateway and check that the following returns 
nothing
- sudo ip netns exec qrouter- iptables -t mangle -S| grep '--sport 9697 -j 
CHECKSUM --checksum-fill'
+ sudo ip netns exec qrouter- iptables -t mangle -S| grep '\--sport 9697 -j 
CHECKSUM --checksum-fill'
  
  [Regression Potential]
  None expected
  
  [Other Info]
  This revert patch does not remove rules added by the original patch so manual 
cleanup of those old rules is required.
  
  -
  We have a problem with the metadata service not being responsive, when the 
proxied in the router namespace on some of our networking nodes after upgrading 
to Ocata (Running on CentOS 7.4, with the RDO packages).
  
  Instance routes traffic to 169.254.169.254 to it's default gateway.
  Default gateway is an OpenStack router in a namespace on a networking node.
  
  - Traffic gets sent from the guest,
  - to the router,
  - iptables routes it to the metadata proxy service,
  - response packet gets routed back, leaving the namespace
  - Hypervisor gets the packet in
  - Checksum of packet is wrong, and the packet gets dropped before putting it 
on the bridge
  
  Based on the following bug https://bugs.launchpad.net/openstack-
  ansible/+bug/1483603, we found that adding the following iptable rule in
  the router namespace made this work again: 'iptables -t mangle -I
  POSTROUTING -p tcp --sport 9697 -j CHECKSUM --checksum-fill'
  
  (NOTE: The rule from the 1st comment to the bug did solve access to the
  metadata service, but the lack of precision introduced other problems
  with the network)

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/train
   Importance: Undecided
   Status: New

** Summary changed:

- Return traffic from metadata service may get dropped by hypervisor due to 
wrong checksum
+ [SRU] Return traffic from metadata service may get dropped by hypervisor 

[Yahoo-eng-team] [Bug 1816468] Re: Acceleration cinder - glance with ceph not working

2019-06-07 Thread Edward Hope-Morley
** Changed in: cinder (Ubuntu Eoan)
   Status: Triaged => Fix Released

** No longer affects: cinder (Ubuntu Eoan)

** Changed in: cinder (Ubuntu Disco)
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816468

Title:
  Acceleration cinder - glance with ceph not working

Status in Cinder:
  Fix Released
Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive rocky series:
  Triaged
Status in Ubuntu Cloud Archive stein series:
  Triaged
Status in Ubuntu Cloud Archive train series:
  Triaged
Status in OpenStack Compute (nova):
  In Progress
Status in cinder package in Ubuntu:
  Fix Released
Status in nova package in Ubuntu:
  Triaged
Status in cinder source package in Cosmic:
  Triaged
Status in nova source package in Cosmic:
  Triaged
Status in cinder source package in Disco:
  Fix Released
Status in nova source package in Disco:
  Triaged
Status in nova source package in Eoan:
  Triaged

Bug description:
  When using cinder, glance with ceph, in a code is support for creating
  volumes from images INSIDE ceph environment as copy-on-write volume.
  This option is saving space in ceph cluster, and increase speed of
  instance spawning because volume is created directly in ceph.   <=
  THIS IS NOT WORKING IN PY3

  If this function is not enabled , image is copying to compute-host
  ..convert ..create volume, and upload to ceph ( which is time
  consuming of course ).

  Problem is , that even if glance-cinder acceleration is turned-on ,
  code is executed as when it is disabled, so ..the same as above , copy
  image , create volume, upload to ceph ... BUT it should create copy-
  on-write volume inside the ceph internally. <= THIS IS A BUG IN PY3

  Glance config ( controller ):

  [DEFAULT]
  show_image_direct_url = true   <= this has to be set to true to 
reproduce issue
  workers = 7
  transport_url = rabbit://openstack:openstack@openstack-db
  [cors]
  [database]
  connection = mysql+pymysql://glance:Eew7shai@openstack-db:3306/glance
  [glance_store]
  stores = file,rbd
  default_store = rbd
  filesystem_store_datadir = /var/lib/glance/images
  rbd_store_pool = images
  rbd_store_user = images
  rbd_store_ceph_conf = /etc/ceph/ceph.conf
  [image_format]
  [keystone_authtoken]
  auth_url = http://openstack-ctrl:35357
  project_name = service
  project_domain_name = default
  username = glance
  user_domain_name = default
  password = Eew7shai
  www_authenticate_uri = http://openstack-ctrl:5000
  auth_uri = http://openstack-ctrl:35357
  cache = swift.cache
  region_name = RegionOne
  auth_type = password
  [matchmaker_redis]
  [oslo_concurrency]
  lock_path = /var/lock/glance
  [oslo_messaging_amqp]
  [oslo_messaging_kafka]
  [oslo_messaging_notifications]
  [oslo_messaging_rabbit]
  [oslo_messaging_zmq]
  [oslo_middleware]
  [oslo_policy]
  [paste_deploy]
  flavor = keystone
  [store_type_location_strategy]
  [task]
  [taskflow_executor]
  [profiler]
  enabled = true
  trace_sqlalchemy = true
  hmac_keys = secret
  connection_string = redis://127.0.0.1:6379
  trace_wsgi_transport = True
  trace_message_store = True
  trace_management_store = True

  Cinder conf (controller) : 
  root@openstack-controller:/tmp# cat /etc/cinder/cinder.conf | grep -v '^#' | 
awk NF 
  [DEFAULT]
  my_ip = 192.168.10.15
  glance_api_servers = http://openstack-ctrl:9292
  auth_strategy = keystone
  enabled_backends = rbd
  osapi_volume_workers = 7
  debug = true
  transport_url = rabbit://openstack:openstack@openstack-db
  [backend]
  [backend_defaults]
  rbd_pool = volumes
  rbd_user = volumes1
  rbd_secret_uuid = b2efeb49-9844-475b-92ad-5df4a3e1300e
  volume_driver = cinder.volume.drivers.rbd.RBDDriver
  [barbican]
  [brcd_fabric_example]
  [cisco_fabric_example]
  [coordination]
  [cors]
  [database]
  connection = mysql+pymysql://cinder:EeRe3ahx@openstack-db:3306/cinder
  [fc-zone-manager]
  [healthcheck]
  [key_manager]
  [keystone_authtoken]
  auth_url = http://openstack-ctrl:35357
  project_name = service
  project_domain_name = default
  username = cinder
  user_domain_name = default
  password = EeRe3ahx
  www_authenticate_uri = http://openstack-ctrl:5000
  auth_uri = http://openstack-ctrl:35357
  cache = swift.cache
  region_name = RegionOne
  auth_type = password
  [matchmaker_redis]
  [nova]
  [oslo_concurrency]
  lock_path = /var/lock/cinder
  [oslo_messaging_amqp]
  [oslo_messaging_kafka]
  [oslo_messaging_notifications]
  [oslo_messaging_rabbit]
  [oslo_messaging_zmq]
  [oslo_middleware]
  [oslo_policy]
  [oslo_reports]
  [oslo_versionedobjects]
  [sample_remote_file_source]
  [service_user]
  [ssl]
  [vault]
  [lvm]
  volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver
  volume_group = cinder-volumes
  iscsi_protocol = iscsi
  iscsi_helper = tgtadm
  [profiler]
  enabled = true
  trace_sqlalchemy = true
  

[Yahoo-eng-team] [Bug 1816468] Re: Acceleration cinder - glance with ceph not working

2019-06-06 Thread Edward Hope-Morley
This also needs fixing in Nova (rbd imagebackend)

** Also affects: nova
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1816468

Title:
  Acceleration cinder - glance with ceph not working

Status in Cinder:
  Fix Released
Status in OpenStack Compute (nova):
  In Progress

Bug description:
  When using cinder, glance with ceph, in a code is support for creating
  volumes from images INSIDE ceph environment as copy-on-write volume.
  This option is saving space in ceph cluster, and increase speed of
  instance spawning because volume is created directly in ceph.   <=
  THIS IS NOT WORKING IN PY3

  If this function is not enabled , image is copying to compute-host
  ..convert ..create volume, and upload to ceph ( which is time
  consuming of course ).

  Problem is , that even if glance-cinder acceleration is turned-on ,
  code is executed as when it is disabled, so ..the same as above , copy
  image , create volume, upload to ceph ... BUT it should create copy-
  on-write volume inside the ceph internally. <= THIS IS A BUG IN PY3

  Glance config ( controller ):

  [DEFAULT]
  show_image_direct_url = true   <= this has to be set to true to 
reproduce issue
  workers = 7
  transport_url = rabbit://openstack:openstack@openstack-db
  [cors]
  [database]
  connection = mysql+pymysql://glance:Eew7shai@openstack-db:3306/glance
  [glance_store]
  stores = file,rbd
  default_store = rbd
  filesystem_store_datadir = /var/lib/glance/images
  rbd_store_pool = images
  rbd_store_user = images
  rbd_store_ceph_conf = /etc/ceph/ceph.conf
  [image_format]
  [keystone_authtoken]
  auth_url = http://openstack-ctrl:35357
  project_name = service
  project_domain_name = default
  username = glance
  user_domain_name = default
  password = Eew7shai
  www_authenticate_uri = http://openstack-ctrl:5000
  auth_uri = http://openstack-ctrl:35357
  cache = swift.cache
  region_name = RegionOne
  auth_type = password
  [matchmaker_redis]
  [oslo_concurrency]
  lock_path = /var/lock/glance
  [oslo_messaging_amqp]
  [oslo_messaging_kafka]
  [oslo_messaging_notifications]
  [oslo_messaging_rabbit]
  [oslo_messaging_zmq]
  [oslo_middleware]
  [oslo_policy]
  [paste_deploy]
  flavor = keystone
  [store_type_location_strategy]
  [task]
  [taskflow_executor]
  [profiler]
  enabled = true
  trace_sqlalchemy = true
  hmac_keys = secret
  connection_string = redis://127.0.0.1:6379
  trace_wsgi_transport = True
  trace_message_store = True
  trace_management_store = True

  Cinder conf (controller) : 
  root@openstack-controller:/tmp# cat /etc/cinder/cinder.conf | grep -v '^#' | 
awk NF 
  [DEFAULT]
  my_ip = 192.168.10.15
  glance_api_servers = http://openstack-ctrl:9292
  auth_strategy = keystone
  enabled_backends = rbd
  osapi_volume_workers = 7
  debug = true
  transport_url = rabbit://openstack:openstack@openstack-db
  [backend]
  [backend_defaults]
  rbd_pool = volumes
  rbd_user = volumes1
  rbd_secret_uuid = b2efeb49-9844-475b-92ad-5df4a3e1300e
  volume_driver = cinder.volume.drivers.rbd.RBDDriver
  [barbican]
  [brcd_fabric_example]
  [cisco_fabric_example]
  [coordination]
  [cors]
  [database]
  connection = mysql+pymysql://cinder:EeRe3ahx@openstack-db:3306/cinder
  [fc-zone-manager]
  [healthcheck]
  [key_manager]
  [keystone_authtoken]
  auth_url = http://openstack-ctrl:35357
  project_name = service
  project_domain_name = default
  username = cinder
  user_domain_name = default
  password = EeRe3ahx
  www_authenticate_uri = http://openstack-ctrl:5000
  auth_uri = http://openstack-ctrl:35357
  cache = swift.cache
  region_name = RegionOne
  auth_type = password
  [matchmaker_redis]
  [nova]
  [oslo_concurrency]
  lock_path = /var/lock/cinder
  [oslo_messaging_amqp]
  [oslo_messaging_kafka]
  [oslo_messaging_notifications]
  [oslo_messaging_rabbit]
  [oslo_messaging_zmq]
  [oslo_middleware]
  [oslo_policy]
  [oslo_reports]
  [oslo_versionedobjects]
  [sample_remote_file_source]
  [service_user]
  [ssl]
  [vault]
  [lvm]
  volume_driver = cinder.volume.drivers.lvm.LVMVolumeDriver
  volume_group = cinder-volumes
  iscsi_protocol = iscsi
  iscsi_helper = tgtadm
  [profiler]
  enabled = true
  trace_sqlalchemy = true
  hmac_keys = secret
  connection_string = redis://127.0.0.1:6379
  trace_wsgi_transport = True
  trace_message_store = True
  trace_management_store = True
  [rbd]
  volume_driver = cinder.volume.drivers.rbd.RBDDriver
  rbd_pool = volumes
  rbd_ceph_conf = /etc/ceph/ceph.conf
  rbd_user = volumes1
  image_volume_cache_enabled = True
  volume_clear = zero
  rbd_max_clone_depth = 5
  rbd_flatten_volume_from_snapshot = False

  
  cinder conf compute node : 

  root@openstack-compute2:~# cat /etc/cinder/cinder.conf | grep -v '^#' | awk NF
  [DEFAULT]
  my_ip = 192.168.10.6
  glance_api_servers = http://openstack-ctrl:9292
  auth_strategy = 

[Yahoo-eng-team] [Bug 1744079] Re: [SRU] disk over-commit still not correctly calculated during live migration

2019-05-23 Thread Edward Hope-Morley
** Changed in: cloud-archive
   Status: Fix Committed => Fix Released

** Tags removed: sts-sru-needed
** Tags added: sts-sru-done

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1744079

Title:
  [SRU] disk over-commit still not correctly calculated during live
  migration

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in Ubuntu Cloud Archive ocata series:
  Fix Released
Status in Ubuntu Cloud Archive pike series:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Xenial:
  Fix Released
Status in nova source package in Bionic:
  Fix Released
Status in nova source package in Cosmic:
  Fix Released
Status in nova source package in Disco:
  Fix Released

Bug description:
  [Impact]
  nova compares disk space with disk_available_least field, which is possible 
to be negative, due to overcommit.

  So the migration may fail because of a "Migration pre-check error:
  Unable to migrate dfcd087a-5dff-439d-8875-2f702f081539: Disk of
  instance is too large(available on destination host:-3221225472 <
  need:22806528)" when trying a migration to another compute that has
  plenty of free space in his disk.

  [Test Case]
  Deploy openstack environment. Make sure there is a negative 
disk_available_least and a adequate free_disk_gb in one test compute node, then 
migrate a VM to it with disk-overcommit (openstack server migrate --live 
 --block-migration --disk-overcommit ). You will 
see above migration pre-check error.

  This is the formula to compute disk_available_least and free_disk_gb.

  disk_free_gb = disk_info_dict['free']
  disk_over_committed = self._get_disk_over_committed_size_total()
  available_least = disk_free_gb * units.Gi - disk_over_committed
  data['disk_available_least'] = available_least / units.Gi

  The following command can be used to query the value of
  disk_available_least

  nova hypervisor-show  |grep disk

  Steps to Reproduce:
  1. set disk_allocation_ratio config option > 1.0 
  2. qemu-img resize cirros-0.3.0-x86_64-disk.img +40G
  3. glance image-create --disk-format qcow2 ...
  4. boot VMs based on resized image
  5. we see disk_available_least becomes negative

  [Regression Potential]
  Minimal - we're just changing from the following line:

  disk_available_gb = dst_compute_info['disk_available_least']

  to the following codes:

  if disk_over_commit:
  disk_available_gb = dst_compute_info['free_disk_gb']
  else:
  disk_available_gb = dst_compute_info['disk_available_least']

  When enabling overcommit, disk_available_least is possible to be
  negative, so we should use free_disk_gb instead of it by backporting
  the following two fixes.

  
https://git.openstack.org/cgit/openstack/nova/commit/?id=e097c001c8e0efe8879da57264fcb7bdfdf2
  
https://git.openstack.org/cgit/openstack/nova/commit/?id=e2cc275063658b23ed88824100919a6dfccb760d

  This is the code path for check_can_live_migrate_destination:

  _migrate_live(os-migrateLive API, migrate_server.py) -> migrate_server
  -> _live_migrate -> _build_live_migrate_task ->
  _call_livem_checks_on_host -> check_can_live_migrate_destination

  BTW, redhat also has a same bug -
  https://bugzilla.redhat.com/show_bug.cgi?id=1477706

  
  [Original Bug Report]
  Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh) 
addressed a bug which prevented live migration to a target host with 
overcommitted disk when made with microversion <2.25. It achieved this, but the 
fix is still not correct. We now do:

  if disk_over_commit:
  disk_available_gb = dst_compute_info['local_gb']

  Unfortunately local_gb is *total* disk, not available disk. We
  actually want free_disk_gb. Fun fact: due to the way we calculate this
  for filesystems, without taking into account reserved space, this can
  also be negative.

  The test we're currently running is: could we fit this guest's
  allocated disks on the target if the target disk was empty. This is at
  least better than it was before, as we don't spuriously fail early. In
  fact, we're effectively disabling a test which is disabled for
  microversion >=2.25 anyway. IOW we should fix it, but it's probably
  not a high priority.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744079/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : 

[Yahoo-eng-team] [Bug 1606741] Re: Metadata service for instances is unavailable when the l3-agent on the compute host is dvr_snat mode

2019-05-10 Thread Edward Hope-Morley
** Changed in: cloud-archive/stein
   Status: New => Fix Released

** Description changed:

+ [Impact] 
+ Currently if you deploy Openstack with dvr and l3ha enabled (and > 1 compute 
host) only instances that are booted on the compute host that is running the VR 
master will have access to metadata. This patch ensures that both master and 
slave VRs have an associated haproxy ns-metadata proccess running local to the 
compute host.
+ 
+ [Test Case]
+ * deploy Openstack with dvr and l3ha enabled with 2 compute hosts
+ * create an ubuntu instance on each compute hosts
+ * check that both are able to access the metadata api (i.e. cloud-init 
completes successfully)
+ * verify that there is an ns-metadata haproxy process running on each compute 
host
+ 
+ [Regression Potential] 
+ None anticipated
+  
+ =
+ 
  In my mitaka environment, there are five nodes here, including
  controller, network1, network2, computer1, computer2 node. I start
  l3-agents with dvr_snat mode in all network and compute nodes and set
  enable_metadata_proxy to true in l3-agent.ini. It works well for most
  neutron services unless the metadata proxy service. When I run command
  "curl http://169.254.169.254; in an instance booting from cirros, it
  returns "curl: couldn't connect to host" and the instance can't fetch
  metadata in its first booting.
  
  * Pre-conditions: start l3-agent with dvr_snat mode in all computer and
  network nodes and set enable_metadata_proxy to true in l3-agent.ini.
  
  * Step-by-step reproduction steps:
  1.create a network and a subnet under this network;
  2.create a router;
  3.add the subnet to the router
  4.create an instance with cirros (or other images) on this subnet
  5.open the console for this instance and run command 'curl 
http://169.254.169.254' in bash, waiting for result.
  
  * Expected output: this command should return the true metadata info
  with the command  'curl http://169.254.169.254'
  
  * Actual output:  the command actually returns "curl: couldn't connect
  to host"
  
  * Version:
    ** Mitaka
    ** All hosts are centos7

** Tags added: sts-sru-needed

** Summary changed:

- Metadata service for instances is unavailable when the l3-agent on the 
compute host  is dvr_snat mode
+ [SRU] Metadata service for instances is unavailable when the l3-agent on the 
compute host  is dvr_snat mode

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606741

Title:
  [SRU] Metadata service for instances is unavailable when the l3-agent
  on the compute host  is dvr_snat mode

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  New
Status in neutron source package in Cosmic:
  New
Status in neutron source package in Disco:
  New
Status in neutron source package in Eoan:
  Fix Released

Bug description:
  [Impact] 
  Currently if you deploy Openstack with dvr and l3ha enabled (and > 1 compute 
host) only instances that are booted on the compute host that is running the VR 
master will have access to metadata. This patch ensures that both master and 
slave VRs have an associated haproxy ns-metadata proccess running local to the 
compute host.

  [Test Case]
  * deploy Openstack with dvr and l3ha enabled with 2 compute hosts
  * create an ubuntu instance on each compute hosts
  * check that both are able to access the metadata api (i.e. cloud-init 
completes successfully)
  * verify that there is an ns-metadata haproxy process running on each compute 
host

  [Regression Potential] 
  None anticipated
   
  =

  In my mitaka environment, there are five nodes here, including
  controller, network1, network2, computer1, computer2 node. I start
  l3-agents with dvr_snat mode in all network and compute nodes and set
  enable_metadata_proxy to true in l3-agent.ini. It works well for most
  neutron services unless the metadata proxy service. When I run command
  "curl http://169.254.169.254; in an instance booting from cirros, it
  returns "curl: couldn't connect to host" and the instance can't fetch
  metadata in its first booting.

  * Pre-conditions: start l3-agent with dvr_snat mode in all computer
  and network nodes and set enable_metadata_proxy to true in
  l3-agent.ini.

  * Step-by-step reproduction steps:
  1.create a network and a subnet under this network;
  2.create a router;
  3.add the subnet to the router
  4.create an instance with cirros (or other images) on this subnet
  5.open the console for 

[Yahoo-eng-team] [Bug 1606741] Re: Metadata service for instances is unavailable when the l3-agent on the compute host is dvr_snat mode

2019-05-10 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/stein
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/queens
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/rocky
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606741

Title:
  [SRU] Metadata service for instances is unavailable when the l3-agent
  on the compute host  is dvr_snat mode

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  New
Status in Ubuntu Cloud Archive rocky series:
  New
Status in Ubuntu Cloud Archive stein series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  New

Bug description:
  [Impact] 
  Currently if you deploy Openstack with dvr and l3ha enabled (and > 1 compute 
host) only instances that are booted on the compute host that is running the VR 
master will have access to metadata. This patch ensures that both master and 
slave VRs have an associated haproxy ns-metadata proccess running local to the 
compute host.

  [Test Case]
  * deploy Openstack with dvr and l3ha enabled with 2 compute hosts
  * create an ubuntu instance on each compute hosts
  * check that both are able to access the metadata api (i.e. cloud-init 
completes successfully)
  * verify that there is an ns-metadata haproxy process running on each compute 
host

  [Regression Potential] 
  None anticipated
   
  =

  In my mitaka environment, there are five nodes here, including
  controller, network1, network2, computer1, computer2 node. I start
  l3-agents with dvr_snat mode in all network and compute nodes and set
  enable_metadata_proxy to true in l3-agent.ini. It works well for most
  neutron services unless the metadata proxy service. When I run command
  "curl http://169.254.169.254; in an instance booting from cirros, it
  returns "curl: couldn't connect to host" and the instance can't fetch
  metadata in its first booting.

  * Pre-conditions: start l3-agent with dvr_snat mode in all computer
  and network nodes and set enable_metadata_proxy to true in
  l3-agent.ini.

  * Step-by-step reproduction steps:
  1.create a network and a subnet under this network;
  2.create a router;
  3.add the subnet to the router
  4.create an instance with cirros (or other images) on this subnet
  5.open the console for this instance and run command 'curl 
http://169.254.169.254' in bash, waiting for result.

  * Expected output: this command should return the true metadata info
  with the command  'curl http://169.254.169.254'

  * Actual output:  the command actually returns "curl: couldn't connect
  to host"

  * Version:
    ** Mitaka
    ** All hosts are centos7

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1606741/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1744079] Re: [SRU] disk over-commit still not correctly calculated during live migration

2019-04-16 Thread Edward Hope-Morley
** Also affects: cloud-archive/mitaka
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1744079

Title:
  [SRU] disk over-commit still not correctly calculated during live
  migration

Status in Ubuntu Cloud Archive:
  Fix Committed
Status in Ubuntu Cloud Archive mitaka series:
  New
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in Ubuntu Cloud Archive pike series:
  Fix Committed
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Xenial:
  Triaged
Status in nova source package in Bionic:
  Fix Released
Status in nova source package in Cosmic:
  Fix Released
Status in nova source package in Disco:
  Fix Released

Bug description:
  [Impact]
  nova compares disk space with disk_available_least field, which is possible 
to be negative, due to overcommit.

  So the migration may fail because of a "Migration pre-check error:
  Unable to migrate dfcd087a-5dff-439d-8875-2f702f081539: Disk of
  instance is too large(available on destination host:-3221225472 <
  need:22806528)" when trying a migration to another compute that has
  plenty of free space in his disk.

  [Test Case]
  Deploy openstack environment. Make sure there is a negative 
disk_available_least and a adequate free_disk_gb in one test compute node, then 
migrate a VM to it with disk-overcommit (openstack server migrate --live 
 --block-migration --disk-overcommit ). You will 
see above migration pre-check error.

  This is the formula to compute disk_available_least and free_disk_gb.

  disk_free_gb = disk_info_dict['free']
  disk_over_committed = self._get_disk_over_committed_size_total()
  available_least = disk_free_gb * units.Gi - disk_over_committed
  data['disk_available_least'] = available_least / units.Gi

  The following command can be used to query the value of
  disk_available_least

  nova hypervisor-show  |grep disk

  Steps to Reproduce:
  1. set disk_allocation_ratio config option > 1.0 
  2. qemu-img resize cirros-0.3.0-x86_64-disk.img +40G
  3. glance image-create --disk-format qcow2 ...
  4. boot VMs based on resized image
  5. we see disk_available_least becomes negative

  [Regression Potential]
  Minimal - we're just changing from the following line:

  disk_available_gb = dst_compute_info['disk_available_least']

  to the following codes:

  if disk_over_commit:
  disk_available_gb = dst_compute_info['free_disk_gb']
  else:
  disk_available_gb = dst_compute_info['disk_available_least']

  When enabling overcommit, disk_available_least is possible to be
  negative, so we should use free_disk_gb instead of it by backporting
  the following two fixes.

  
https://git.openstack.org/cgit/openstack/nova/commit/?id=e097c001c8e0efe8879da57264fcb7bdfdf2
  
https://git.openstack.org/cgit/openstack/nova/commit/?id=e2cc275063658b23ed88824100919a6dfccb760d

  This is the code path for check_can_live_migrate_destination:

  _migrate_live(os-migrateLive API, migrate_server.py) -> migrate_server
  -> _live_migrate -> _build_live_migrate_task ->
  _call_livem_checks_on_host -> check_can_live_migrate_destination

  BTW, redhat also has a same bug -
  https://bugzilla.redhat.com/show_bug.cgi?id=1477706

  
  [Original Bug Report]
  Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh) 
addressed a bug which prevented live migration to a target host with 
overcommitted disk when made with microversion <2.25. It achieved this, but the 
fix is still not correct. We now do:

  if disk_over_commit:
  disk_available_gb = dst_compute_info['local_gb']

  Unfortunately local_gb is *total* disk, not available disk. We
  actually want free_disk_gb. Fun fact: due to the way we calculate this
  for filesystems, without taking into account reserved space, this can
  also be negative.

  The test we're currently running is: could we fit this guest's
  allocated disks on the target if the target disk was empty. This is at
  least better than it was before, as we don't spuriously fail early. In
  fact, we're effectively disabling a test which is disabled for
  microversion >=2.25 anyway. IOW we should fix it, but it's probably
  not a high priority.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744079/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1681627] Re: Page not found error on refreshing bowser (in AngularJS-based detail page)

2019-04-02 Thread Edward Hope-Morley
Pike patch will be included in PR done in bug 1822192

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/pike
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/ocata
   Importance: Undecided
   Status: New

** Changed in: cloud-archive/pike
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1681627

Title:
  Page not found error on refreshing bowser (in AngularJS-based detail
  page)

Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive ocata series:
  New
Status in Ubuntu Cloud Archive pike series:
  Fix Committed
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in Zun UI:
  Fix Released

Bug description:
  Once I get into the container detail view, refresh the browser will
  show a page not found error:

The current URL, ngdetails/OS::Zun::Container/c54ba416-a955-45b2
  -848b-aee57b748e08, didn't match any of these

  Full output: http://paste.openstack.org/show/605296/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1681627/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1818614] Re: Various L3HA functional tests fails often

2019-03-25 Thread Edward Hope-Morley
** Description changed:

+ [Impact]
+ Need to get this added to the Ubuntu packages in order to safeguard against 
missed VRRP transitions due to ip -o monitor not running at the time the 
transition occurs. We have seen many cases in the fields where neutron routers 
end up as active on multiple l3 agents (via neutron api) which leads to a 
number of problems.
+ 
+ [Test Case]
+ * deploy Openstack (any version that supports l3ha)
+ * create HA router with max-l3-agents=2
+ * check neutron l3-agent-list-hosting-router for master location
+ * on both hosts that are running the l3-agent do
+ 
+ pid=`pgrep -f "/usr/bin/neutron-keepalived-state-change 
--router_id=$ROUTER_UUID"`
+ ps -f --ppid $pid
+ pkill -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"
+ ps -f --ppid $pid <<< this should return nothing now
+ pkill -f "/var/lib/neutron/ha_confs/$ROUTER_UUID/keepalived.conf"
+ 
+ * without this patch you should now see both agents reporting the router as 
"active"
+ * with the patch this should not happen (once neutron-keepalived-state-change 
has been restarted)
+ 
+ [Regression Potential]
+ 
+ 
+ 
  Recently many L3 HA related functional tests are failing.
  The common thing in all those errors is fact that it fails when waiting for 
l3 ha router to become master.
  
  Example stack trace:
  
  ft2.12: 
neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle_StringException:
 Traceback (most recent call last):
-   File "neutron/tests/base.py", line 174, in func
- return f(self, *args, **kwargs)
-   File "neutron/tests/base.py", line 174, in func
- return f(self, *args, **kwargs)
-   File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in 
test_ha_router_lifecycle
- self._router_lifecycle(enable_ha=True, router_info=router_info)
-   File "neutron/tests/functional/agent/l3/framework.py", line 274, in 
_router_lifecycle
- common_utils.wait_until_true(lambda: router.ha_state == 'master')
-   File "neutron/common/utils.py", line 690, in wait_until_true
- raise WaitTimeout(_("Timed out after %d seconds") % timeout)
+   File "neutron/tests/base.py", line 174, in func
+ return f(self, *args, **kwargs)
+   File "neutron/tests/base.py", line 174, in func
+ return f(self, *args, **kwargs)
+   File "neutron/tests/functional/agent/l3/test_ha_router.py", line 81, in 
test_ha_router_lifecycle
+ self._router_lifecycle(enable_ha=True, router_info=router_info)
+   File "neutron/tests/functional/agent/l3/framework.py", line 274, in 
_router_lifecycle
+ common_utils.wait_until_true(lambda: router.ha_state == 'master')
+   File "neutron/common/utils.py", line 690, in wait_until_true
+ raise WaitTimeout(_("Timed out after %d seconds") % timeout)
  neutron.common.utils.WaitTimeout: Timed out after 60 seconds
  
  Example failure: http://logs.openstack.org/79/633979/21/check/neutron-
  functional-python27/ce7ef07/logs/testr_results.html.gz
  
  Logstash query:
  
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ha_state%20%3D%3D%20'master')%5C%22

** Description changed:

  [Impact]
  Need to get this added to the Ubuntu packages in order to safeguard against 
missed VRRP transitions due to ip -o monitor not running at the time the 
transition occurs. We have seen many cases in the fields where neutron routers 
end up as active on multiple l3 agents (via neutron api) which leads to a 
number of problems.
  
  [Test Case]
  * deploy Openstack (any version that supports l3ha)
  * create HA router with max-l3-agents=2
  * check neutron l3-agent-list-hosting-router for master location
  * on both hosts that are running the l3-agent do
  
  pid=`pgrep -f "/usr/bin/neutron-keepalived-state-change 
--router_id=$ROUTER_UUID"`
  ps -f --ppid $pid
  pkill -f "/usr/bin/neutron-keepalived-state-change --router_id=$ROUTER_UUID"
  ps -f --ppid $pid <<< this should return nothing now
  pkill -f "/var/lib/neutron/ha_confs/$ROUTER_UUID/keepalived.conf"
  
  * without this patch you should now see both agents reporting the router as 
"active"
  * with the patch this should not happen (once neutron-keepalived-state-change 
has been restarted)
  
  [Regression Potential]
+ None expected.
  
  
  
  Recently many L3 HA related functional tests are failing.
  The common thing in all those errors is fact that it fails when waiting for 
l3 ha router to become master.
  
  Example stack trace:
  
  ft2.12: 
neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle_StringException:
 Traceback (most recent call last):
    File "neutron/tests/base.py", line 174, in func
  return f(self, *args, **kwargs)
    File "neutron/tests/base.py", line 174, in func
  return f(self, *args, **kwargs)
    File 

[Yahoo-eng-team] [Bug 1818239] Re: scheduler: build failure high negative weighting

2019-03-20 Thread Edward Hope-Morley
** Changed in: charm-nova-cloud-controller
Milestone: None => 19.04

** Changed in: charm-nova-cloud-controller
   Status: Fix Committed => Confirmed

** Changed in: charm-nova-cloud-controller
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1818239

Title:
  scheduler: build failure high negative weighting

Status in OpenStack nova-cloud-controller charm:
  Fix Released
Status in OpenStack Compute (nova):
  Incomplete
Status in nova package in Ubuntu:
  Triaged

Bug description:
  Whilst debugging a Queens cloud which seems to be landing all new
  instances on 3 out of 9 hypervisors (which resulted in three very
  heavily overloaded servers) I noticed that the weighting of the build
  failure weighter is -100.0 * number of failures:

  https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495

  This means that a server which has any sort of build failure instantly
  drops to the bottom of the weighed list of hypervisors for scheduling
  of instances.

  Why might a instance fail to build? Could be a timeout due to load,
  might also be due to a bad image (one that won't actually boot under
  qemu).  This second cause could be triggered by an end user of the
  cloud inadvertently causing all instances to be pushed to a small
  subset of hypervisors (which is what I think happened in our case).

  This feels like quite a dangerous default to have given the potential
  to DOS hypervisors intentionally or otherwise.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: nova-scheduler 2:17.0.7-0ubuntu1
  ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
  Uname: Linux 4.15.0-43-generic x86_64
  ApportVersion: 2.20.9-0ubuntu7.5
  Architecture: amd64
  Date: Fri Mar  1 13:57:39 2019
  NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
  PackageArchitecture: all
  ProcEnviron:
   TERM=screen-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=C.UTF-8
   SHELL=/bin/bash
  SourcePackage: nova
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1818239/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1749667] Re: neutron doesn't correctly handle unknown protocols and should whitelist known and handled protocols

2018-11-27 Thread Edward Hope-Morley
** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Changed in: cloud-archive
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1749667

Title:
  neutron doesn't correctly handle unknown protocols and should
  whitelist known and handled protocols

Status in Ubuntu Cloud Archive:
  Fix Released
Status in neutron:
  Fix Released

Bug description:
  We have had problems with openvswitch agent continuously restarting
  and never actually completing setup because of this:

  # Completed by iptables_manager
  ; Stdout: ; Stderr: iptables-restore v1.4.21: multiport only works with TCP, 
UDP, UDPLITE, SCTP and DCCP
  Error occurred at line: 83
  Try `iptables-restore -h' or 'iptables-restore --help' for more information.

  83. -I neutron-openvswi- 69 -s  -p 112 -m multiport --dports 
1:65535 -j RETURN
  ---

  Someone has managed to inject a rule that is, effectively, a DoS.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1749667/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1713499] Re: Cannot delete a neutron network, if the currently configured MTU is lower than the network's MTU

2018-11-22 Thread Edward Hope-Morley
This is now Fix Released for Queens UCA since the patch is in 12.0.4
release and UCA no has 12.0.5 (from bug 1795424)

** Changed in: cloud-archive/queens
   Status: In Progress => Fix Released

** Changed in: neutron (Ubuntu Bionic)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1713499

Title:
  Cannot delete a neutron network, if the currently configured MTU is
  lower than the network's MTU

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive pike series:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  Fix Released

Bug description:
  Currently, the neutron API returns an error [1] when trying to delete
  a neutron network which has a higher MTU than the configured
  MTU[2][3].

  This issue has been noticed in Pike.

  [1] Error: http://paste.openstack.org/show/619627/
  [2] neutron.conf: http://paste.openstack.org/show/619629/
  [3] ml2_conf.ini: http://paste.openstack.org/show/619630/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1713499/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1778771] Re: Backups panel is visible even if enable_backup is False

2018-10-12 Thread Edward Hope-Morley
** Changed in: charm-openstack-dashboard
   Status: In Progress => Invalid

** Changed in: charm-openstack-dashboard
 Assignee: Seyeong Kim (xtrusia) => (unassigned)

** Changed in: charm-openstack-dashboard
Milestone: 18.11 => None

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1778771

Title:
  Backups panel is visible even if enable_backup is False

Status in OpenStack openstack-dashboard charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in Ubuntu Cloud Archive rocky series:
  Fix Released
Status in OpenStack Dashboard (Horizon):
  Fix Released
Status in horizon package in Ubuntu:
  Fix Released
Status in horizon source package in Bionic:
  Triaged
Status in horizon source package in Cosmic:
  Fix Released

Bug description:
  Hi,

  Volumes - Backup panel is visible even if OPENSTACK_CINDER_FEATURES =
  {'enable_backup': False} in local_settings.py

  Meanwhile setting enable_backup to false removes an option to create
  backup of a volume in the volume drop-down options. But panel with
  backups itself stays visible for both admins and users.

  As a work-around I use the following customization script:
  import horizon
  from django.conf import settings
  if not getattr(settings, 'OPENSTACK_CINDER_FEATURES', 
{}).get('enable_backup', False):
  project = horizon.get_dashboard("project")
  backup = project.get_panel("backups")
  project.unregister(backup.__class__)

  And for permanent fix I see the following decision. In 
openstack_dashboard/dashboards/project/backups/panel.py make the following 
changes:
  ...
  +L16: from django.conf import settings
  ...
  +L21: if not getattr(settings, 'OPENSTACK_CINDER_FEATURES', 
{}).get('enable_backup', False):
  +L22: return False
  ...

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-openstack-dashboard/+bug/1778771/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1713499] Re: Cannot delete a neutron network, if the currently configured MTU is lower than the network's MTU

2018-10-01 Thread Edward Hope-Morley
** Changed in: cloud-archive
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1713499

Title:
  Cannot delete a neutron network, if the currently configured MTU is
  lower than the network's MTU

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive pike series:
  New
Status in Ubuntu Cloud Archive queens series:
  New
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Bionic:
  In Progress

Bug description:
  Currently, the neutron API returns an error [1] when trying to delete
  a neutron network which has a higher MTU than the configured
  MTU[2][3].

  This issue has been noticed in Pike.

  [1] Error: http://paste.openstack.org/show/619627/
  [2] neutron.conf: http://paste.openstack.org/show/619629/
  [3] ml2_conf.ini: http://paste.openstack.org/show/619630/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1713499/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


  1   2   >