Hi Lei,
On 03/20, Lei Yang wrote:
> Hi Dragos, Si-Wei
>
> 1. I applied [0] [1] [2] to the downstream kernel then tested
> hotplug/unplug, this bug still exists.
>
> [0] 35025963326e ("vdpa/mlx5: Fix suboptimal range on iotlb iteration")
> [1] 29ce8b8a4fa7 ("vdpa/mlx5: Fix PA offset with unaligned starting iotlb
> map")
> [2] a6097e0a54a5 ("vdpa/mlx5: Fix oversized null mkey longer than 32bit")
>
> 2. Si-Wei mentioned two patches [1] [2] have been merged into qemu
> master branch, so based on the test result it can not help fix this
> bug.
> [1] db0d4017f9b9 ("net: parameterize the removing client from nc list")
> [2] e7891c575fb2 ("net: move backend cleanup to NIC cleanup")
>
> 3. I found triggers for the unhealthy report from firmware step is
> just boot up guest when using the current patches qemu. The host dmesg
> will print unhealthy info immediately after booting up the guest.
>
Did you set the locked memory to ulimite before (ulimit -l unlimited)?
This could also be the cause for the FW issue.
Thanks,
Dragos
> Thanks
> Lei
>
>
> On Wed, Mar 19, 2025 at 8:14 AM Si-Wei Liu <[email protected]> wrote:
> >
> > Hi Lei,
> >
> > On 3/18/2025 7:06 AM, Lei Yang wrote:
> > > On Tue, Mar 18, 2025 at 10:15 AM Jason Wang <[email protected]> wrote:
> > >> On Tue, Mar 18, 2025 at 9:55 AM Lei Yang <[email protected]> wrote:
> > >>> Hi Jonah
> > >>>
> > >>> I tested this series with the vhost_vdpa device based on mellanox
> > >>> ConnectX-6 DX nic and hit the host kernel crash. This problem can be
> > >>> easier to reproduce under the hotplug/unplug device scenario.
> > >>> For the core dump messages please review the attachment.
> > >>> FW version:
> > >>> # flint -d 0000:0d:00.0 q |grep Version
> > >>> FW Version: 22.44.1036
> > >>> Product Version: 22.44.1036
> > >> The trace looks more like a mlx5e driver bug other than vDPA?
> > >>
> > >> [ 3256.256707] Call Trace:
> > >> [ 3256.256708] <IRQ>
> > >> [ 3256.256709] ? show_trace_log_lvl+0x1c4/0x2df
> > >> [ 3256.256714] ? show_trace_log_lvl+0x1c4/0x2df
> > >> [ 3256.256715] ? __build_skb+0x4a/0x60
> > >> [ 3256.256719] ? __die_body.cold+0x8/0xd
> > >> [ 3256.256720] ? die_addr+0x39/0x60
> > >> [ 3256.256725] ? exc_general_protection+0x1ec/0x420
> > >> [ 3256.256729] ? asm_exc_general_protection+0x22/0x30
> > >> [ 3256.256736] ? __build_skb_around+0x8c/0xf0
> > >> [ 3256.256738] __build_skb+0x4a/0x60
> > >> [ 3256.256740] build_skb+0x11/0xa0
> > >> [ 3256.256743] mlx5e_skb_from_cqe_mpwrq_linear+0x156/0x280 [mlx5_core]
> > >> [ 3256.256872] mlx5e_handle_rx_cqe_mpwrq_rep+0xcb/0x1e0 [mlx5_core]
> > >> [ 3256.256964] mlx5e_rx_cq_process_basic_cqe_comp+0x39f/0x3c0
> > >> [mlx5_core]
> > >> [ 3256.257053] mlx5e_poll_rx_cq+0x3a/0xc0 [mlx5_core]
> > >> [ 3256.257139] mlx5e_napi_poll+0xe2/0x710 [mlx5_core]
> > >> [ 3256.257226] __napi_poll+0x29/0x170
> > >> [ 3256.257229] net_rx_action+0x29c/0x370
> > >> [ 3256.257231] handle_softirqs+0xce/0x270
> > >> [ 3256.257236] __irq_exit_rcu+0xa3/0xc0
> > >> [ 3256.257238] common_interrupt+0x80/0xa0
> > >>
> > > Hi Jason
> > >
> > >> Which kernel tree did you use? Can you please try net.git?
> > > I used the latest 9.6 downstream kernel and upstream qemu (applied
> > > this series of patches) to test this scenario.
> > > First based on my test result this bug is related to this series of
> > > patches, the conclusions are based on the following test results(All
> > > test results are based on the above mentioned nic driver):
> > > Case 1: downstream kernel + downstream qemu-kvm - pass
> > > Case 2: downstream kernel + upstream qemu (doesn't included this
> > > series of patches) - pass
> > > Case 3: downstream kernel + upstream qemu (included this series of
> > > patches) - failed, reproduce ratio 100%
> > Just as Dragos replied earlier, the firmware was already in a bogus
> > state before the panic that I also suspect it has something to do with
> > various bugs in the downstream kernel. You have to apply the 3 patches
> > to the downstream kernel before you may kick of the relevant tests
> > again. Please pay special attention to which specific command or step
> > that triggers the unhealthy report from firmware, and let us know if you
> > still run into any of them.
> >
> > In addition, as you seem to be testing the device hot plug and unplug
> > use cases, for which the latest qemu should have related fixes
> > below[1][2], but in case they are missed somehow it might also end up
> > with bad firmware state to some extend. Just fyi.
> >
> > [1] db0d4017f9b9 ("net: parameterize the removing client from nc list")
> > [2] e7891c575fb2 ("net: move backend cleanup to NIC cleanup")
> >
> > Thanks,
> > -Siwei
> > >
> > > Then I also tried to test it with the net.git tree, but it will hit
> > > the host kernel panic after compiling when rebooting the host. For the
> > > call trace info please review following messages:
> > > [ 9.902851] No filesystem could mount root, tried:
> > > [ 9.902851]
> > > [ 9.909248] Kernel panic - not syncing: VFS: Unable to mount root
> > > fs on "/dev/mapper/rhel_dell--per760--12-root" or unknown-block(0,0)
> > > [ 9.921335] CPU: 16 UID: 0 PID: 1 Comm: swapper/0 Not tainted
> > > 6.14.0-rc6+ #3
> > > [ 9.928398] Hardware name: Dell Inc. PowerEdge R760/0NH8MJ, BIOS
> > > 1.3.2 03/28/2023
> > > [ 9.935876] Call Trace:
> > > [ 9.938332] <TASK>
> > > [ 9.940436] panic+0x356/0x380
> > > [ 9.943513] mount_root_generic+0x2e7/0x300
> > > [ 9.947717] prepare_namespace+0x65/0x270
> > > [ 9.951731] kernel_init_freeable+0x2e2/0x310
> > > [ 9.956105] ? __pfx_kernel_init+0x10/0x10
> > > [ 9.960221] kernel_init+0x16/0x1d0
> > > [ 9.963715] ret_from_fork+0x2d/0x50
> > > [ 9.967303] ? __pfx_kernel_init+0x10/0x10
> > > [ 9.971404] ret_from_fork_asm+0x1a/0x30
> > > [ 9.975348] </TASK>
> > > [ 9.977555] Kernel Offset: 0xc00000 from 0xffffffff81000000
> > > (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > > [ 10.101881] ---[ end Kernel panic - not syncing: VFS: Unable to
> > > mount root fs on "/dev/mapper/rhel_dell--per760--12-root" or
> > > unknown-block(0,0) ]---
> > >
> > > # git log -1
> > > commit 4003c9e78778e93188a09d6043a74f7154449d43 (HEAD -> main,
> > > origin/main, origin/HEAD)
> > > Merge: 8f7617f45009 2409fa66e29a
> > > Author: Linus Torvalds <[email protected]>
> > > Date: Thu Mar 13 07:58:48 2025 -1000
> > >
> > > Merge tag 'net-6.14-rc7' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > >
> > >
> > > Thanks
> > >
> > > Lei
> > >> Thanks
> > >>
> > >>> Best Regards
> > >>> Lei
> > >>>
> > >>> On Fri, Mar 14, 2025 at 9:04 PM Jonah Palmer <[email protected]>
> > >>> wrote:
> > >>>> Current memory operations like pinning may take a lot of time at the
> > >>>> destination. Currently they are done after the source of the
> > >>>> migration is
> > >>>> stopped, and before the workload is resumed at the destination. This
> > >>>> is a
> > >>>> period where neigher traffic can flow, nor the VM workload can continue
> > >>>> (downtime).
> > >>>>
> > >>>> We can do better as we know the memory layout of the guest RAM at the
> > >>>> destination from the moment that all devices are initializaed. So
> > >>>> moving that operation allows QEMU to communicate the kernel the maps
> > >>>> while the workload is still running in the source, so Linux can start
> > >>>> mapping them.
> > >>>>
> > >>>> As a small drawback, there is a time in the initialization where QEMU
> > >>>> cannot respond to QMP etc. By some testing, this time is about
> > >>>> 0.2seconds. This may be further reduced (or increased) depending on
> > >>>> the
> > >>>> vdpa driver and the platform hardware, and it is dominated by the cost
> > >>>> of memory pinning.
> > >>>>
> > >>>> This matches the time that we move out of the called downtime window.
> > >>>> The downtime is measured as checking the trace timestamp from the
> > >>>> moment
> > >>>> the source suspend the device to the moment the destination starts the
> > >>>> eight and last virtqueue pair. For a 39G guest, it goes from ~2.2526
> > >>>> secs to 2.0949.
> > >>>>
> > >>>> Future directions on top of this series may include to move more
> > >>>> things ahead
> > >>>> of the migration time, like set DRIVER_OK or perform actual iterative
> > >>>> migration
> > >>>> of virtio-net devices.
> > >>>>
> > >>>> Comments are welcome.
> > >>>>
> > >>>> This series is a different approach of series [1]. As the title does
> > >>>> not
> > >>>> reflect the changes anymore, please refer to the previous one to know
> > >>>> the
> > >>>> series history.
> > >>>>
> > >>>> This series is based on [2], it must be applied after it.
> > >>>>
> > >>>> [Jonah Palmer]
> > >>>> This series was rebased after [3] was pulled in, as [3] was a
> > >>>> prerequisite
> > >>>> fix for this series.
> > >>>>
> > >>>> v3:
> > >>>> ---
> > >>>> * Rebase
> > >>>>
> > >>>> v2:
> > >>>> ---
> > >>>> * Move the memory listener registration to vhost_vdpa_set_owner
> > >>>> function.
> > >>>> * Move the iova_tree allocation to net_vhost_vdpa_init.
> > >>>>
> > >>>> v1 at
> > >>>> https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.
> > >>>>
> > >>>> [1]
> > >>>> https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
> > >>>> [2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
> > >>>> [3]
> > >>>> https://lore.kernel.org/qemu-devel/[email protected]/
> > >>>>
> > >>>> Eugenio Pérez (7):
> > >>>> vdpa: check for iova tree initialized at net_client_start
> > >>>> vdpa: reorder vhost_vdpa_set_backend_cap
> > >>>> vdpa: set backend capabilities at vhost_vdpa_init
> > >>>> vdpa: add listener_registered
> > >>>> vdpa: reorder listener assignment
> > >>>> vdpa: move iova_tree allocation to net_vhost_vdpa_init
> > >>>> vdpa: move memory listener register to vhost_vdpa_init
> > >>>>
> > >>>> hw/virtio/vhost-vdpa.c | 98
> > >>>> ++++++++++++++++++++++------------
> > >>>> include/hw/virtio/vhost-vdpa.h | 22 +++++++-
> > >>>> net/vhost-vdpa.c | 34 ++----------
> > >>>> 3 files changed, 88 insertions(+), 66 deletions(-)
> > >>>>
> > >>>> --
> > >>>> 2.43.5
> > >>>>
> > >>>>
> >
>