[Yahoo-eng-team] [Bug 2051729] [NEW] issue dhcp cleaning stale devices process when enable action
Public bug reported: When call driver enable is called. the cleanup_stale_devices function is invoked to remove stale devices within the namespace. The method cleanup_stale_devices examines the ports in the network to prevent the unintentional removal of legitimate devices. In a multisegment context, the initial device created might be deleted during the second iteration. This occurs because the network variable used in the loop is not a singular reference to the same object, resulting in its ports not being updated by the ones created during previous iterations. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2051729 Title: issue dhcp cleaning stale devices process when enable action Status in neutron: New Bug description: When call driver enable is called. the cleanup_stale_devices function is invoked to remove stale devices within the namespace. The method cleanup_stale_devices examines the ports in the network to prevent the unintentional removal of legitimate devices. In a multisegment context, the initial device created might be deleted during the second iteration. This occurs because the network variable used in the loop is not a singular reference to the same object, resulting in its ports not being updated by the ones created during previous iterations. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2051729/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1955008] Re: [FT] Failure of "test_floatingip_mac_bindings"
Reviewed: https://review.opendev.org/c/openstack/neutron/+/906474 Committed: https://opendev.org/openstack/neutron/commit/64fddf4f2d18b134b5cc8348049a3c4f10f69a28 Submitter: "Zuul (22348)" Branch:master commit 64fddf4f2d18b134b5cc8348049a3c4f10f69a28 Author: Rodolfo Alonso Hernandez Date: Fri Jan 19 11:41:15 2024 + [OVN][FT] Retry in case of timeout when executing "ovsdb-client". The shell command "ovsdb-client", in the functional tests, is prone to timeouts. This patch adds a tenacity decorator and sets the command timeout to 3 seconds, that should be more than enough to retrieve one single register. Closes-Bug: #1955008 Change-Id: I38626835ca809cc3f2894e5f81fab55cf3f40071 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1955008 Title: [FT] Failure of "test_floatingip_mac_bindings" Status in neutron: Fix Released Bug description: Failure of functional test "test_floatingip_mac_bindings". Logs: https://d77e4dc62d62e32415c2-4170471c7bb0f477055c0cecff564bc8.ssl.cf2.rackcdn.com/821271/3/check/neutron- functional-with-uwsgi/d2e2877/testr_results.html Snippet: https://paste.opendev.org/show/811716/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1955008/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2051108] Re: Support for the "bring your own keys" approach for Cinder
** Also affects: cinder Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2051108 Title: Support for the "bring your own keys" approach for Cinder Status in Cinder: New Status in OpenStack Compute (nova): New Bug description: Description === Cinder currently lags support the API to create a volume with a predefined (e.g. already stored in Barbican) encryption key. This feature would be useful for use cases where end-users should be enabled to store keys later on used to encrypt volumes. Work flow would be as follow: 1. End user creates a new key and stores it in OpenStack Barbican 2. User requests a new volume with volume type "LUKS" and gives an "encryption_reference_key_id" (or just "key_id"). 3. Internally the key is copied (like in volume_utils.clone_encryption_key_()) and a new "encryption_key_id". To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/2051108/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2051244] Re: Documentation of Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent
** Also affects: ceph Importance: Undecided Status: New ** Also affects: openstack-ansible Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2051244 Title: Documentation of Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent Status in Ceph: New Status in Cinder: New Status in Glance: New Status in glance_store: New Status in OpenStack Compute (nova): New Status in openstack-ansible: New Bug description: This bug originates from my post to the openstack-discuss ML - https://lists.openstack.org/archives/list/openstack-disc...@lists.openstack.org/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/ which was discussed at a cinder-weekly (https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43). In short: There seem to be inconsistencies in the correct and required Ceph authx permissions for the RBD clients in Cinder, Glance and also Nova. While it's nice to have the various deployment tools like openstack-ansible ([4]) or charm[[5]]) do it somewhat "properly", first and foremost this needs to be properly documented in the source documentation of Glance and also Cinder and Nova for that matter. And achieving this is what this bug report is intended to do. The proposed steps are ... * determine and discuss the correct caps (least privileges, caps via profiles where possible, ...) * update the documentation / install guides and the devstack code. Those should all serve as references for the correct way of doing things. * write an upgrade bullet point to release notes for Caracal, to have operators check and align their caps * spread the word / open bugs for the deployment tools for them to update their config / code accordingly * send a PR to have Ceph update their docs The long story about the various Ceph (RBD) clients and uses withing Glance, Cinder and Nova: 1) Glance First there was a simple issue reported for Glance [3]. When Glance is requested to delete an image it will check if this image has depended children, see https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459. The children of Glance images usually are (Cinder) volumes, which therefore live in a different RBD pool "volumes". But if such children do exist a 500 error is thrown by Glance API. Manually using the RBD client shows the same error: > # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images children $IMAGE_ID > > 2023-12-13T16:51:48.131+ 7f198cf4e640 -1 librbd::image::OpenRequest: failed to retrieve name: (1) Operation not permitted > 2023-12-13T16:51:48.131+ 7f198d74f640 -1 librbd::ImageState: 0x5639fdd5af60 failed to open image: (1) Operation not permitted > rbd: listing children failed: (1) Operation not permitted > 2023-12-13T16:51:48.131+ 7f1990c474c0 -1 librbd::api::Image: list_descendants: failed to open descendant b7078ed7ace50d from pool instances:(1) Operation not permitted So it's a permission error. Following either the documentation of Glance [1] or Ceph [2] on configuring the ceph auth caps there is no mention of granting anything towards the volume pool to Glance. So this is what I currently have configured: > client.cinder > key: REACTED > caps: [mgr] profile rbd pool=volumes, profile rbd-read-only pool=images > caps: [mon] profile rbd > caps: [osd] profile rbd pool=volumes, profile rbd-read-only pool=images > > client.glance > key: REACTED > caps: [mgr] profile rbd pool=images > caps: [mon] profile rbd > caps: [osd] profile rbd pool=images > >client.nova > key: REACTED > caps: [mgr] profile rbd pool=instances, profile rbd pool=images > caps: [mon] profile rbd > caps: [osd] profile rbd pool=instances, profile rbd pool=images > When granting the glance client e.g. "rbd-read-only" to the volumes pool via: > > # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, profile rbd-read-only pool=volumes' > the error is gone. This is the wrong approach though! Which was established during the discussion on the ML: a) Commit [10] introduced the method "_snapshot_has_external_reference" to the yoga release to fix [11]. The commit message also briefly states: ... NOTE: To check this dependency glance osd needs 'read' access to cinder and nova side RBD pool. ``` but there is zero mention of this requirement in the release notes for Yoga, only for glance_store [13]. Also this (temporary, Yoga only) requirement to grant
[Yahoo-eng-team] [Bug 2045889] Re: [OVN] ML2/OVN mech driver does not set the OVS bridge name in the port VIF details
Reviewed: https://review.opendev.org/c/openstack/neutron/+/903494 Committed: https://opendev.org/openstack/neutron/commit/baaf240ce3f7802fe1431cc13913b9d93fc7f742 Submitter: "Zuul (22348)" Branch:master commit baaf240ce3f7802fe1431cc13913b9d93fc7f742 Author: Rodolfo Alonso Hernandez Date: Sat Aug 19 05:11:55 2023 + [OVN] Add the bridge name and datapath type to the port VIF details Same as in ML2/OVS, the ML2/OVN mechanism driver adds to the port VIF details dictionary the OVS bridge the port is connected to and the integration bridge datapath type. Closes-Bug: #2045889 Change-Id: Ifda46c42b9506449a58fbaf312cc71c72d9cf2df ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2045889 Title: [OVN] ML2/OVN mech driver does not set the OVS bridge name in the port VIF details Status in neutron: Fix Released Bug description: The ML2/OVN mech driver does not set the OVS bridge name nor the datapath type in the Neutron DB port VIF details. Example of an ML2/OVS port: """ binding_vif_details: bound_drivers.0='openvswitch', bridge_name='br-int', connectivity='l2', datapath_type='system', ovs_hybrid_plug='False', port_filter='True' """ Missing parameters: * bridge_name * datapath_type This information is needed by Nova (os-vif library) To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2045889/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2051690] [NEW] when removing net for agent dnsmask constantly tries to restart
Public bug reported: When removing network for agent, dnsmask constantly tries to revive. This has been observed when using multisegment. The external process monitor is not well unregistered for that service. This is because the correct helper to get the process identifier is not used for unregister. ** Affects: neutron Importance: Undecided Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2051690 Title: when removing net for agent dnsmask constantly tries to restart Status in neutron: In Progress Bug description: When removing network for agent, dnsmask constantly tries to revive. This has been observed when using multisegment. The external process monitor is not well unregistered for that service. This is because the correct helper to get the process identifier is not used for unregister. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2051690/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2051685] [NEW] After repeat of incomplete migration nova applies wrong (status=error) migration context in update_available_resource periodic job
Public bug reported: The original problem observed in a downstream deployment was of overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking update_available_resource periodic job. The following reproduction is not an end-to-end reproduction, but I hope I can demonstrate where things go wrong. The environment is a multi-node devstack: devstack0 - all-in-one devstack0a - compute Nova is backed by libvirt/qemu/kvm. devstack 6b0f055b nova on devstack0 39f560d673 nova on devstack0a a72f7eaac7 libvirt 8.0.0-1ubuntu7.8 qemu 1:6.2+dfsg-2ubuntu6.16 linux 5.15.0-91-generic # Clean up if not the first run. openstack server list -f value -c ID | xargs -r openstack server delete --wait openstack volume list --status available -f value -c ID | xargs -r openstack volume delete # Create a server on devstack0. openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated openstack server create --flavor cirros256-pinned --image cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private --availability-zone :devstack0 vm0 --wait # Start a live migration to devstack0a, but simulate a failure. In my environment a complete live migration takes around 20 seconds. Using 'sleep 3' it usually breaks in the 'preparing' status. # As far as I understand other kinds of migration (like cold migration) are also affected. openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a sudo systemctl stop devstack@n-cpu $ openstack server migration list --server vm0 --sort-column 'Created At' +++-+++--++---++++++-+ | Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status| Server UUID| Old Flavor | New Flavor | Type | Created At | Updated At | +++-+++--++---++++++-+ | 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01-| || 4a2c-b42a- | ||| || | 4c12-ad47- || || 29T12:41:40.00 | 29T12:41:42.00 | || a73b1a19c0c9 | ||| || | 12b8dd7a7384 || ||| | +++-+++--++---++++++-+ # After some timeout (around 60 s) the migration goes to 'error' status. $ openstack server migration list --server vm0 --sort-column 'Created At' ++-+-+++--+++-++++-+--+ | Id | UUID| Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At | ++-+-+++--+++-++++-+--+ | 33 | c7a42f9e-dfee-4a2c- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9-4c12- | 11 | 11 | live-migration | 2024-01-| 2024-01- | || b42a-a73b1a19c0c9 | ||| ||| ad47-12b8dd7a7384 || || 29T12:41:40.00 | 29T12:42:42.00 | ++-+-+++--+++-++++-+--+ # Wait before restarting n-cpu on devstack0a. I don't think I fully understand the factors of when the migration ends up finally in failed or in error status. Currently it seems to me if I restart n-cpu too quickly the migration goes to the failed state right after restart. B