Hi Lei,

On 3/18/2025 7:06 AM, Lei Yang wrote:
On Tue, Mar 18, 2025 at 10:15 AM Jason Wang <[email protected]> wrote:
On Tue, Mar 18, 2025 at 9:55 AM Lei Yang <[email protected]> wrote:
Hi Jonah

I tested this series with the vhost_vdpa device based on mellanox
ConnectX-6 DX nic and hit the host kernel crash. This problem can be
easier to reproduce under the hotplug/unplug device scenario.
For the core dump messages please review the attachment.
FW version:
#  flint -d 0000:0d:00.0 q |grep Version
FW Version:            22.44.1036
Product Version:       22.44.1036
The trace looks more like a mlx5e driver bug other than vDPA?

[ 3256.256707] Call Trace:
[ 3256.256708]  <IRQ>
[ 3256.256709]  ? show_trace_log_lvl+0x1c4/0x2df
[ 3256.256714]  ? show_trace_log_lvl+0x1c4/0x2df
[ 3256.256715]  ? __build_skb+0x4a/0x60
[ 3256.256719]  ? __die_body.cold+0x8/0xd
[ 3256.256720]  ? die_addr+0x39/0x60
[ 3256.256725]  ? exc_general_protection+0x1ec/0x420
[ 3256.256729]  ? asm_exc_general_protection+0x22/0x30
[ 3256.256736]  ? __build_skb_around+0x8c/0xf0
[ 3256.256738]  __build_skb+0x4a/0x60
[ 3256.256740]  build_skb+0x11/0xa0
[ 3256.256743]  mlx5e_skb_from_cqe_mpwrq_linear+0x156/0x280 [mlx5_core]
[ 3256.256872]  mlx5e_handle_rx_cqe_mpwrq_rep+0xcb/0x1e0 [mlx5_core]
[ 3256.256964]  mlx5e_rx_cq_process_basic_cqe_comp+0x39f/0x3c0 [mlx5_core]
[ 3256.257053]  mlx5e_poll_rx_cq+0x3a/0xc0 [mlx5_core]
[ 3256.257139]  mlx5e_napi_poll+0xe2/0x710 [mlx5_core]
[ 3256.257226]  __napi_poll+0x29/0x170
[ 3256.257229]  net_rx_action+0x29c/0x370
[ 3256.257231]  handle_softirqs+0xce/0x270
[ 3256.257236]  __irq_exit_rcu+0xa3/0xc0
[ 3256.257238]  common_interrupt+0x80/0xa0

Hi Jason

Which kernel tree did you use? Can you please try net.git?
I used the latest 9.6 downstream kernel and upstream qemu (applied
this series of patches) to test this scenario.
First based on my test result this bug is related to this series of
patches, the conclusions are based on the following test results(All
test results are based on the above mentioned nic driver):
Case 1: downstream kernel + downstream qemu-kvm  -  pass
Case 2: downstream kernel + upstream qemu (doesn't included this
series of patches)  -  pass
Case 3: downstream kernel + upstream qemu (included this series of
patches)  - failed, reproduce ratio 100%
Just as Dragos replied earlier, the firmware was already in a bogus state before the panic that I also suspect it has something to do with various bugs in the downstream kernel. You have to apply the 3 patches to the downstream kernel before you may kick of the relevant tests again. Please pay special attention to which specific command or step that triggers the unhealthy report from firmware, and let us know if you still run into any of them.

In addition, as you seem to be testing the device hot plug and unplug use cases, for which the latest qemu should have related fixes below[1][2], but in case they are missed somehow it might also end up with bad firmware state to some extend. Just fyi.

[1] db0d4017f9b9 ("net: parameterize the removing client from nc list")
[2] e7891c575fb2 ("net: move backend cleanup to NIC cleanup")

Thanks,
-Siwei

Then I also tried to test it with the net.git tree, but it will hit
the host kernel panic after compiling when rebooting the host. For the
call trace info please review following messages:
[    9.902851] No filesystem could mount root, tried:
[    9.902851]
[    9.909248] Kernel panic - not syncing: VFS: Unable to mount root
fs on "/dev/mapper/rhel_dell--per760--12-root" or unknown-block(0,0)
[    9.921335] CPU: 16 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc6+ #3
[    9.928398] Hardware name: Dell Inc. PowerEdge R760/0NH8MJ, BIOS
1.3.2 03/28/2023
[    9.935876] Call Trace:
[    9.938332]  <TASK>
[    9.940436]  panic+0x356/0x380
[    9.943513]  mount_root_generic+0x2e7/0x300
[    9.947717]  prepare_namespace+0x65/0x270
[    9.951731]  kernel_init_freeable+0x2e2/0x310
[    9.956105]  ? __pfx_kernel_init+0x10/0x10
[    9.960221]  kernel_init+0x16/0x1d0
[    9.963715]  ret_from_fork+0x2d/0x50
[    9.967303]  ? __pfx_kernel_init+0x10/0x10
[    9.971404]  ret_from_fork_asm+0x1a/0x30
[    9.975348]  </TASK>
[    9.977555] Kernel Offset: 0xc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   10.101881] ---[ end Kernel panic - not syncing: VFS: Unable to
mount root fs on "/dev/mapper/rhel_dell--per760--12-root" or
unknown-block(0,0) ]---

# git log -1
commit 4003c9e78778e93188a09d6043a74f7154449d43 (HEAD -> main,
origin/main, origin/HEAD)
Merge: 8f7617f45009 2409fa66e29a
Author: Linus Torvalds <[email protected]>
Date:   Thu Mar 13 07:58:48 2025 -1000

     Merge tag 'net-6.14-rc7' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net


Thanks

Lei
Thanks

Best Regards
Lei

On Fri, Mar 14, 2025 at 9:04 PM Jonah Palmer <[email protected]> wrote:
Current memory operations like pinning may take a lot of time at the
destination.  Currently they are done after the source of the migration is
stopped, and before the workload is resumed at the destination.  This is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).

We can do better as we know the memory layout of the guest RAM at the
destination from the moment that all devices are initializaed.  So
moving that operation allows QEMU to communicate the kernel the maps
while the workload is still running in the source, so Linux can start
mapping them.

As a small drawback, there is a time in the initialization where QEMU
cannot respond to QMP etc.  By some testing, this time is about
0.2seconds.  This may be further reduced (or increased) depending on the
vdpa driver and the platform hardware, and it is dominated by the cost
of memory pinning.

This matches the time that we move out of the called downtime window.
The downtime is measured as checking the trace timestamp from the moment
the source suspend the device to the moment the destination starts the
eight and last virtqueue pair.  For a 39G guest, it goes from ~2.2526
secs to 2.0949.

Future directions on top of this series may include to move more things ahead
of the migration time, like set DRIVER_OK or perform actual iterative migration
of virtio-net devices.

Comments are welcome.

This series is a different approach of series [1]. As the title does not
reflect the changes anymore, please refer to the previous one to know the
series history.

This series is based on [2], it must be applied after it.

[Jonah Palmer]
This series was rebased after [3] was pulled in, as [3] was a prerequisite
fix for this series.

v3:
---
* Rebase

v2:
---
* Move the memory listener registration to vhost_vdpa_set_owner function.
* Move the iova_tree allocation to net_vhost_vdpa_init.

v1 at https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02136.html.

[1] 
https://patchwork.kernel.org/project/qemu-devel/cover/[email protected]/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg05910.html
[3] 
https://lore.kernel.org/qemu-devel/[email protected]/

Eugenio Pérez (7):
   vdpa: check for iova tree initialized at net_client_start
   vdpa: reorder vhost_vdpa_set_backend_cap
   vdpa: set backend capabilities at vhost_vdpa_init
   vdpa: add listener_registered
   vdpa: reorder listener assignment
   vdpa: move iova_tree allocation to net_vhost_vdpa_init
   vdpa: move memory listener register to vhost_vdpa_init

  hw/virtio/vhost-vdpa.c         | 98 ++++++++++++++++++++++------------
  include/hw/virtio/vhost-vdpa.h | 22 +++++++-
  net/vhost-vdpa.c               | 34 ++----------
  3 files changed, 88 insertions(+), 66 deletions(-)

--
2.43.5




Reply via email to