> From: Jan Kiszka <[email protected]>
>
> This resolves the follow splat and lock-up when running with PREEMPT_RT
> enabled on Hyper-V:
Hi Jan,
It's interesting to know the use-case of running a RT kernel over Hyper-V.
Can you give an example?
As far as I know, Hyper-V makes no RT guarantees of scheduling VPs for a VM.
Thanks,
Long
>
> [ 415.140818] BUG: scheduling while atomic: stress-ng-
> iomix/1048/0x00000002 [ 415.140822] INFO: lockdep is turned off.
> [ 415.140823] Modules linked in: intel_rapl_msr intel_rapl_common
> intel_uncore_frequency_common intel_pmc_core pmt_telemetry
> pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec
> ghash_clmulni_intel aesni_intel rapl binfmt_misc nls_ascii nls_cp437 vfat fat
> snd_pcm hyperv_drm snd_timer drm_client_lib drm_shmem_helper snd sg
> soundcore drm_kms_helper pcspkr hv_balloon hv_utils evdev joydev drm
> configfs efi_pstore nfnetlink vsock_loopback
> vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport
> vsock vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod
> cdrom hv_storvsc serio_raw hid_generic scsi_transport_fc hid_hyperv
> scsi_mod hid hv_netvsc hyperv_keyboard scsi_common [ 415.140846]
> Preemption disabled at:
> [ 415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0
> [hv_storvsc] [ 415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix
> Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)} [ 415.140856] Hardware
> name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V
> UEFI Release v4.1 09/04/2024 [ 415.140857] Call Trace:
> [ 415.140861] <TASK>
> [ 415.140861] ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
> [ 415.140863] dump_stack_lvl+0x91/0xb0 [ 415.140870]
> __schedule_bug+0x9c/0xc0 [ 415.140875] __schedule+0xdf6/0x1300
> [ 415.140877] ? rtlock_slowlock_locked+0x56c/0x1980
> [ 415.140879] ? rcu_is_watching+0x12/0x60 [ 415.140883]
> schedule_rtlock+0x21/0x40 [ 415.140885]
> rtlock_slowlock_locked+0x502/0x1980
> [ 415.140891] rt_spin_lock+0x89/0x1e0
> [ 415.140893] hv_ringbuffer_write+0x87/0x2a0 [ 415.140899]
> vmbus_sendpacket_mpb_desc+0xb6/0xe0
> [ 415.140900] ? rcu_is_watching+0x12/0x60 [ 415.140902]
> storvsc_queuecommand+0x669/0xbe0 [hv_storvsc] [ 415.140904] ?
> HARDIRQ_verbose+0x10/0x10 [ 415.140908] ? __rq_qos_issue+0x28/0x40
> [ 415.140911] scsi_queue_rq+0x760/0xd80 [scsi_mod] [ 415.140926]
> __blk_mq_issue_directly+0x4a/0xc0 [ 415.140928]
> blk_mq_issue_direct+0x87/0x2b0 [ 415.140931]
> blk_mq_dispatch_queue_requests+0x120/0x440
> [ 415.140933] blk_mq_flush_plug_list+0x7a/0x1a0 [ 415.140935]
> __blk_flush_plug+0xf4/0x150 [ 415.140940] __submit_bio+0x2b2/0x5c0
> [ 415.140944] ? submit_bio_noacct_nocheck+0x272/0x360
> [ 415.140946] submit_bio_noacct_nocheck+0x272/0x360
> [ 415.140951] ext4_read_bh_lock+0x3e/0x60 [ext4] [ 415.140995]
> ext4_block_write_begin+0x396/0x650 [ext4] [ 415.141018] ?
> __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4] [ 415.141038]
> ext4_da_write_begin+0x1c4/0x350 [ext4] [ 415.141060]
> generic_perform_write+0x14e/0x2c0 [ 415.141065]
> ext4_buffered_write_iter+0x6b/0x120 [ext4] [ 415.141083]
> vfs_write+0x2ca/0x570 [ 415.141087] ksys_write+0x76/0xf0
> [ 415.141089] do_syscall_64+0x99/0x1490 [ 415.141093] ?
> rcu_is_watching+0x12/0x60 [ 415.141095] ?
> finish_task_switch.isra.0+0xdf/0x3d0
> [ 415.141097] ? rcu_is_watching+0x12/0x60 [ 415.141098] ?
> lock_release+0x1f0/0x2a0 [ 415.141100] ? rcu_is_watching+0x12/0x60
> [ 415.141101] ? finish_task_switch.isra.0+0xe4/0x3d0
> [ 415.141103] ? rcu_is_watching+0x12/0x60 [ 415.141104] ?
> __schedule+0xb34/0x1300 [ 415.141106] ?
> hrtimer_try_to_cancel+0x1d/0x170 [ 415.141109] ?
> do_nanosleep+0x8b/0x160 [ 415.141111] ?
> hrtimer_nanosleep+0x89/0x100 [ 415.141114] ?
> __pfx_hrtimer_wakeup+0x10/0x10 [ 415.141116] ?
> xfd_validate_state+0x26/0x90 [ 415.141118] ? rcu_is_watching+0x12/0x60
> [ 415.141120] ? do_syscall_64+0x1e0/0x1490 [ 415.141121] ?
> do_syscall_64+0x1e0/0x1490 [ 415.141123] ? rcu_is_watching+0x12/0x60
> [ 415.141124] ? do_syscall_64+0x1e0/0x1490 [ 415.141125] ?
> do_syscall_64+0x1e0/0x1490 [ 415.141127] ? irqentry_exit+0x140/0x7e0
> [ 415.141129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> get_cpu() disables preemption while the spinlock hv_ringbuffer_write is using
> is converted to an rt-mutex under PREEMPT_RT.
>
> Signed-off-by: Jan Kiszka <[email protected]>
> ---
>
> This is likely just the tip of an iceberg, see specifically [1], but if you
> never start
> addressing it, it will continue to crash ships, even if those are only on test
> cruises (we are fully aware that Hyper-V provides no RT guarantees for
> guests). A pragmatic alternative to that would be a simple
>
> config HYPERV
> depends on !PREEMPT_RT
>
> Please share your thoughts if this fix is worth it, or if we should better
> stop
> looking at the next splats that show up after it. We are currently
> considering to
> thread some of the hv platform IRQs under PREEMPT_RT as potential next
> step.
>
> TIA!
>
> [1]
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fall%2F20230809-b4-rt_preempt-fix-v1-0-
> 7283bbdc8b14%40gmail.com%2F&data=05%7C02%7Clongli%40microsoft.c
> om%7C9bcc663272304e06251908de5f42fe3b%7C72f988bf86f141af91ab2
> d7cd011db47%7C1%7C0%7C639052938514762134%7CUnknown%7CTWF
> pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW
> 4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=WyFA
> %2FIUPpZDcayM%2Fj7Ky8%2Bm93bey239zVWguDspSbdo%3D&reserved=0
>
> drivers/scsi/storvsc_drv.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> b43d876747b7..68c837146b9e 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1855,8 +1855,9 @@ static int storvsc_queuecommand(struct Scsi_Host
> *host, struct scsi_cmnd *scmnd)
> cmd_request->payload_sz = payload_sz;
>
> /* Invokes the vsc to start an IO */
> - ret = storvsc_do_io(dev, cmd_request, get_cpu());
> - put_cpu();
> + migrate_disable();
> + ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
> + migrate_enable();
>
> if (ret)
> scsi_dma_unmap(scmnd);
> --
> 2.51.0