[Yahoo-eng-team] [Bug 2051729] [NEW] issue dhcp cleaning stale devices process when enable action

2024-01-30 Thread Sahid Orentino
Public bug reported:

When call driver enable is called. the cleanup_stale_devices function is 
invoked to remove stale devices
within the namespace. The method cleanup_stale_devices examines the
ports in the network to prevent the unintentional removal of
legitimate devices.

In a multisegment context, the initial device created might be deleted
during the second iteration. This occurs because the network variable
used in the loop is not a singular reference to the same object,
resulting in its ports not being updated by the ones created during
previous iterations.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2051729

Title:
  issue dhcp cleaning stale devices process when enable action

Status in neutron:
  New

Bug description:
  When call driver enable is called. the cleanup_stale_devices function is 
invoked to remove stale devices
  within the namespace. The method cleanup_stale_devices examines the
  ports in the network to   prevent the unintentional removal of
  legitimate devices.

  In a multisegment context, the initial device created might be deleted
  during the second iteration. This occurs because the network variable
  used in the loop is not a singular reference to the same object,
  resulting in its ports not being updated by the ones created during
  previous iterations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2051729/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1955008] Re: [FT] Failure of "test_floatingip_mac_bindings"

2024-01-30 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/906474
Committed: 
https://opendev.org/openstack/neutron/commit/64fddf4f2d18b134b5cc8348049a3c4f10f69a28
Submitter: "Zuul (22348)"
Branch:master

commit 64fddf4f2d18b134b5cc8348049a3c4f10f69a28
Author: Rodolfo Alonso Hernandez 
Date:   Fri Jan 19 11:41:15 2024 +

[OVN][FT] Retry in case of timeout when executing "ovsdb-client".

The shell command "ovsdb-client", in the functional tests, is prone to
timeouts. This patch adds a tenacity decorator and sets the command
timeout to 3 seconds, that should be more than enough to retrieve one
single register.

Closes-Bug: #1955008
Change-Id: I38626835ca809cc3f2894e5f81fab55cf3f40071


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1955008

Title:
  [FT] Failure of "test_floatingip_mac_bindings"

Status in neutron:
  Fix Released

Bug description:
  Failure of functional test "test_floatingip_mac_bindings".

  Logs:
  
https://d77e4dc62d62e32415c2-4170471c7bb0f477055c0cecff564bc8.ssl.cf2.rackcdn.com/821271/3/check/neutron-
  functional-with-uwsgi/d2e2877/testr_results.html

  Snippet: https://paste.opendev.org/show/811716/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1955008/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2051108] Re: Support for the "bring your own keys" approach for Cinder

2024-01-30 Thread Dan Smith
** Also affects: cinder
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2051108

Title:
  Support for the "bring your own keys" approach for Cinder

Status in Cinder:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  Cinder currently lags support the API to create a volume with a predefined 
(e.g. already stored in Barbican) encryption key. This feature would be useful 
for use cases where end-users should be enabled to store keys later on used to 
encrypt volumes.

  Work flow would be as follow:
  1. End user creates a new key and stores it in OpenStack Barbican
  2. User requests a new volume with volume type "LUKS" and gives an 
"encryption_reference_key_id" (or just "key_id").
  3. Internally the key is copied (like in 
volume_utils.clone_encryption_key_()) and a new "encryption_key_id".

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2051108/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2051244] Re: Documentation of Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent

2024-01-30 Thread Christian Rohmann
** Also affects: ceph
   Importance: Undecided
   Status: New

** Also affects: openstack-ansible
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2051244

Title:
  Documentation of  Ceph auth caps for RBD clients used by Cinder /
  Glance / Nova is missing or inconsistent

Status in Ceph:
  New
Status in Cinder:
  New
Status in Glance:
  New
Status in glance_store:
  New
Status in OpenStack Compute (nova):
  New
Status in openstack-ansible:
  New

Bug description:
  This bug originates from my post to the openstack-discuss ML - 
https://lists.openstack.org/archives/list/openstack-disc...@lists.openstack.org/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
  which was discussed at a cinder-weekly 
(https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

  In short: There seem to be inconsistencies in the correct and required Ceph 
authx permissions for the RBD clients in Cinder, Glance and also Nova.
  While it's nice to have the various deployment tools like openstack-ansible 
([4]) or charm[[5]]) do it somewhat "properly",
  first and foremost this needs to be properly documented in the source 
documentation of Glance and also Cinder and Nova for that matter.

  And achieving this is what this bug report is intended to do.
  The proposed steps are ...

   * determine and discuss the correct caps (least privileges, caps via 
profiles where possible, ...)
   * update the documentation / install guides and the devstack code. Those 
should all serve as references for the correct way of doing things.
   * write an upgrade bullet point to release notes for Caracal, to have 
operators check and align their caps
   * spread the word / open bugs for the deployment tools for them to update 
their config / code accordingly
   * send a PR to have Ceph update their docs


  The long story about the various Ceph (RBD) clients and uses withing
  Glance, Cinder and Nova:

  
  1) Glance

  First there was a simple issue reported for Glance [3].

  When Glance is requested to delete an image it will check if this image has 
depended children, see 
https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
  The children of Glance images usually are (Cinder) volumes, which therefore 
live in a different RBD pool "volumes". But if such children do exist a 500 
error is thrown by Glance API.

  Manually using the RBD client shows the same error:

  > # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images 
children $IMAGE_ID
  >
  > 2023-12-13T16:51:48.131+ 7f198cf4e640 -1 librbd::image::OpenRequest: 
failed to retrieve name: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+ 7f198d74f640 -1 librbd::ImageState: 
0x5639fdd5af60 failed to open image: (1) Operation not permitted
  > rbd: listing children failed: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+ 7f1990c474c0 -1 librbd::api::Image: 
list_descendants: failed to open descendant b7078ed7ace50d from pool 
instances:(1) Operation not permitted

  So it's a permission error. Following either the documentation of Glance [1] 
or Ceph [2] on configuring the ceph auth caps there is no mention of granting 
anything towards the volume pool to Glance.
  So this is what I currently have configured:

  > client.cinder
  > key: REACTED
  > caps: [mgr] profile rbd pool=volumes, profile rbd-read-only 
pool=images
  > caps: [mon] profile rbd
  > caps: [osd] profile rbd pool=volumes, profile rbd-read-only 
pool=images
  >
  > client.glance
  > key: REACTED
  > caps: [mgr] profile rbd pool=images
  > caps: [mon] profile rbd
  > caps: [osd] profile rbd pool=images
  >
  >client.nova
  > key: REACTED
  > caps: [mgr] profile rbd pool=instances, profile rbd pool=images
  > caps: [mon] profile rbd
  > caps: [osd] profile rbd pool=instances, profile rbd pool=images
  >

  When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
  >
  > # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd 
pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, 
profile rbd-read-only pool=volumes'
  >
  the error is gone.
  This is the wrong approach though! Which was established during the 
discussion on the ML:

  
  a) Commit [10] introduced the method "_snapshot_has_external_reference" to 
the yoga
  release to fix [11]. The commit message also briefly states:
  ...

  NOTE: To check this dependency glance osd needs 'read' access to
  cinder and nova side RBD pool.
  ```

  but there is zero mention of this requirement in the release notes for
  Yoga, only for glance_store [13]. Also this (temporary, Yoga only) 
requirement to grant 

[Yahoo-eng-team] [Bug 2045889] Re: [OVN] ML2/OVN mech driver does not set the OVS bridge name in the port VIF details

2024-01-30 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/neutron/+/903494
Committed: 
https://opendev.org/openstack/neutron/commit/baaf240ce3f7802fe1431cc13913b9d93fc7f742
Submitter: "Zuul (22348)"
Branch:master

commit baaf240ce3f7802fe1431cc13913b9d93fc7f742
Author: Rodolfo Alonso Hernandez 
Date:   Sat Aug 19 05:11:55 2023 +

[OVN] Add the bridge name and datapath type to the port VIF details

Same as in ML2/OVS, the ML2/OVN mechanism driver adds to the port
VIF details dictionary the OVS bridge the port is connected to
and the integration bridge datapath type.

Closes-Bug: #2045889
Change-Id: Ifda46c42b9506449a58fbaf312cc71c72d9cf2df


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2045889

Title:
  [OVN] ML2/OVN mech driver does not set the OVS bridge name in the port
  VIF details

Status in neutron:
  Fix Released

Bug description:
  The ML2/OVN mech driver does not set the OVS bridge name nor the datapath 
type in the Neutron DB port VIF details. Example of an ML2/OVS port:
  """
  binding_vif_details: bound_drivers.0='openvswitch', bridge_name='br-int', 
connectivity='l2', datapath_type='system', ovs_hybrid_plug='False', 
port_filter='True'
  """

  Missing parameters:
  * bridge_name
  * datapath_type

  This information is needed by Nova (os-vif library)

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2045889/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2051690] [NEW] when removing net for agent dnsmask constantly tries to restart

2024-01-30 Thread Sahid Orentino
Public bug reported:

When removing network for agent, dnsmask constantly tries to revive.

This has been observed when using multisegment. The external process
monitor is not well unregistered for that service.

This is because the correct helper to get the process identifier is not
used for unregister.

** Affects: neutron
 Importance: Undecided
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2051690

Title:
  when removing net for agent dnsmask constantly tries to restart

Status in neutron:
  In Progress

Bug description:
  When removing network for agent, dnsmask constantly tries to revive.

  This has been observed when using multisegment. The external process
  monitor is not well unregistered for that service.

  This is because the correct helper to get the process identifier is
  not used for unregister.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2051690/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2051685] [NEW] After repeat of incomplete migration nova applies wrong (status=error) migration context in update_available_resource periodic job

2024-01-30 Thread Bence Romsics
Public bug reported:

The original problem observed in a downstream deployment was of
overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking
update_available_resource periodic job.

The following reproduction is not an end-to-end reproduction, but I hope
I can demonstrate where things go wrong.

The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute

Nova is backed by libvirt/qemu/kvm.

devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-2ubuntu6.16
linux 5.15.0-91-generic

# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack 
volume delete

# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 
--property hw_rng:allowed=True --property hw:cpu_policy=dedicated
openstack server create --flavor cirros256-pinned --image 
cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private 
--availability-zone :devstack0 vm0 --wait

# Start a live migration to devstack0a, but simulate a failure. In my 
environment a complete live migration takes around 20 seconds. Using 'sleep 3' 
it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are 
also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a 
sudo systemctl stop devstack@n-cpu

$ openstack server migration list --server vm0 --sort-column 'Created At'
+++-+++--++---++++++-+
| Id | UUID   | Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host  | Status| Server UUID| Old Flavor | New 
Flavor | Type   | Created At | Updated At  |
+++-+++--++---++++++-+
| 33 | c7a42f9e-dfee- | devstack0   | devstack0a | devstack0  | 
devstack0a   | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 |   
  11 | live-migration | 2024-01-   | 2024-01-|
|| 4a2c-b42a- | ||| 
 ||   | 4c12-ad47- ||   
 || 29T12:41:40.00 | 29T12:41:42.00  |
|| a73b1a19c0c9   | ||| 
 ||   | 12b8dd7a7384   ||   
 ||| |
+++-+++--++---++++++-+

# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
++-+-+++--+++-++++-+--+
| Id | UUID| Source Node | Dest Node  | Source Compute | Dest 
Compute | Dest Host  | Status | Server UUID | Old Flavor | New 
Flavor | Type   | Created At  | Updated At   |
++-+-+++--+++-++++-+--+
| 33 | c7a42f9e-dfee-4a2c- | devstack0   | devstack0a | devstack0  | 
devstack0a   | 192.168.122.79 | error  | a2b43180-8ad9-4c12- | 11 | 
11 | live-migration | 2024-01-| 2024-01- |
|| b42a-a73b1a19c0c9   | |||
  ||| ad47-12b8dd7a7384   ||
|| 29T12:41:40.00  | 29T12:42:42.00   |
++-+-+++--+++-++++-+--+

# Wait before restarting n-cpu on devstack0a. I don't think I fully understand 
the factors of when the migration ends up finally in failed or in error status. 
Currently it seems to me if I restart n-cpu too quickly the migration goes to 
the failed state right after restart. B