Re: [PATCH] target/i386: Add more features enumerated by CPUID.7.2.EDX
>> diff --git a/target/i386/cpu.c b/target/i386/cpu.c >> index 85ef7452c0..18ba958f46 100644 >> --- a/target/i386/cpu.c >> +++ b/target/i386/cpu.c >> @@ -1148,8 +1148,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = { >> [FEAT_7_2_EDX] = { >> .type = CPUID_FEATURE_WORD, >> .feat_names = { >> -NULL, NULL, NULL, NULL, >> -NULL, "mcdt-no", NULL, NULL, >> +"intel-psfd", "ipred-ctrl", "rrsba-ctrl", "ddpd-u", >> +"bhi-ctrl", "mcdt-no", NULL, NULL, > >IIUC, these bits depend on "spec-ctrl", which indicates the presence of >IA32_SPEC_CTRL. > >Then I think we'd better add dependencies in feature_dependencies[]. (+ kvm mailing list) Thanks for pointing that out. It seems that any of these bits imply the presence of IA32_SPEC_CTRL. According to SDM vol4, chapter 2, table 2.2, the 'Comment' column for the IA32_SPEC_CTRL MSR states: If any one of the enumeration conditions for defined bit field positions holds. So, it might be more appropriate to fix KVM's handling of the IA32_SPEC_CTRL MSR (i.e., guest_has_spec_ctrl_msr()). what do you think? > >-Zhao > >> NULL, NULL, NULL, NULL, >> NULL, NULL, NULL, NULL, >> NULL, NULL, NULL, NULL, >> -- >> 2.46.1 >> >>
[PATCH] target/i386: Add more features enumerated by CPUID.7.2.EDX
Following 5 bits in CPUID.7.2.EDX are supported by KVM. Add their supports in QEMU. Each of them indicates certain bits of IA32_SPEC_CTRL are supported. Those bits can control CPU speculation behavior which can be used to defend against side-channel attacks. bit0: intel-psfd if 1, indicates bit 7 of the IA32_SPEC_CTRL MSR is supported. Bit 7 of this MSR disables Fast Store Forwarding Predictor without disabling Speculative Store Bypass bit1: ipred-ctrl If 1, indicates bits 3 and 4 of the IA32_SPEC_CTRL MSR are supported. Bit 3 of this MSR enables IPRED_DIS control for CPL3. Bit 4 of this MSR enables IPRED_DIS control for CPL0/1/2 bit2: rrsba-ctrl If 1, indicates bits 5 and 6 of the IA32_SPEC_CTRL MSR are supported. Bit 5 of this MSR disables RRSBA behavior for CPL3. Bit 6 of this MSR disables RRSBA behavior for CPL0/1/2 bit3: ddpd-u If 1, indicates bit 8 of the IA32_SPEC_CTRL MSR is supported. Bit 8 of this MSR disables Data Dependent Prefetcher. bit4: bhi-ctrl if 1, indicates bit 10 of the IA32_SPEC_CTRL MSR is supported. Bit 10 of this MSR enables BHI_DIS_S behavior. Signed-off-by: Chao Gao --- target/i386/cpu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 85ef7452c0..18ba958f46 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -1148,8 +1148,8 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = { [FEAT_7_2_EDX] = { .type = CPUID_FEATURE_WORD, .feat_names = { -NULL, NULL, NULL, NULL, -NULL, "mcdt-no", NULL, NULL, +"intel-psfd", "ipred-ctrl", "rrsba-ctrl", "ddpd-u", +"bhi-ctrl", "mcdt-no", NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, -- 2.46.1
[PATCH] util/aio: Defer disabling poll mode as long as possible
When we measure FIO read performance (cache=writethrough, bs=4k, iodepth=64) in VMs, ~80K/s notifications (e.g., EPT_MISCONFIG) are observed from guest to qemu. It turns out those frequent notificatons are caused by interference from worker threads. Worker threads queue bottom halves after completing IO requests. Pending bottom halves may lead to either aio_compute_timeout() zeros timeout and pass it to try_poll_mode() or run_poll_handlers() returns no progress after noticing pending aio_notify() events. Both cause run_poll_handlers() to call poll_set_started(false) to disable poll mode. However, for both cases, as timeout is already zeroed, the event loop (i.e., aio_poll()) just processes bottom halves and then starts the next event loop iteration. So, disabling poll mode has no value but leads to unnecessary notifications from guest. To minimize unnecessary notifications from guest, defer disabling poll mode to when the event loop is about to be blocked. With this patch applied, FIO seq-read performance (bs=4k, iodepth=64, cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS. Suggested-by: Stefan Hajnoczi Signed-off-by: Chao Gao --- util/aio-posix.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/util/aio-posix.c b/util/aio-posix.c index 731f3826c0..6cc6256d53 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -585,18 +585,16 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) { +/* + * Enable poll mode. It pairs with the poll_set_started() in + * aio_poll() which disables poll mode. + */ poll_set_started(ctx, ready_list, true); if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) { return true; } } - -if (poll_set_started(ctx, ready_list, false)) { -*timeout = 0; -return true; -} - return false; } @@ -657,6 +655,17 @@ bool aio_poll(AioContext *ctx, bool blocking) * system call---a single round of run_poll_handlers_once suffices. */ if (timeout || ctx->fdmon_ops->need_wait(ctx)) { +/* + * Disable poll mode. poll mode should be disabled before the call + * of ctx->fdmon_ops->wait() so that guest's notification can wake + * up IO threads when some work becomes pending. It is essential to + * avoid hangs or unnecessary latency. + */ +if (poll_set_started(ctx, &ready_list, false)) { +timeout = 0; +progress = true; +} + ctx->fdmon_ops->wait(ctx, &ready_list, timeout); } -- 2.25.1
Re: [RFC v1] util/aio: Keep notification disabled as much as possible
On Thu, Jul 07, 2022 at 10:04:23AM +0100, Stefan Hajnoczi wrote: > >Does this patch solve the issue? The idea is to defer >poll_set_started(false) for as long as possible. Good idea! It is straightforward. > >diff --git a/util/aio-posix.c b/util/aio-posix.c >index 731f3826c0..536f8b2534 100644 >--- a/util/aio-posix.c >+++ b/util/aio-posix.c >@@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList >*ready_list, > return true; > } > } >- >-if (poll_set_started(ctx, ready_list, false)) { >-*timeout = 0; >-return true; >-} >- > return false; > } > >@@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking) > * system call---a single round of run_poll_handlers_once suffices. > */ > if (timeout || ctx->fdmon_ops->need_wait(ctx)) { >+if (poll_set_started(ctx, &ready_list, false)) { >+timeout = 0; >+progress = true; In this case, is it ok to skip the call of ->wait() below? If yes, maybe put the call in the "else" path. >+} >+ > ctx->fdmon_ops->wait(ctx, &ready_list, timeout); > } > Anyway, Reviewed-by: Chao Gao And my tests show your change works well. The number of notifications is significantly reduced from ~80K/s to tens. Tested-by: Chao Gao
Re: [RFC v1] util/aio: Keep notification disabled as much as possible
On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote: >On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote: >> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) >> in >> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to >> qemu. > >It's not clear to me what caused the frequent poll_set_started(ctx, >false) calls and whether this patch is correct. I have posted some Me either. That's why the patch was marked RFC. >questions below about the nature of this issue. > >If ctx->fdmon_ops->wait() is called while polling is still enabled then >hangs or unnecessary latency can occur. For example, consider an fd >handler that temporarily suppresses fd activity between poll start/end. >The thread would be blocked in ->wait() and the fd will never become >readable. Even if a hang doesn't occur because there is a timeout, there >would be extra latency because the fd doesn't become readable and we >have to wait for the timeout to expire so we can poll again. So we must >be sure it's safe to leave polling enabled across the event loop and I'm >not sure if this patch is okay. Thanks for the explanation. in aio_poll(), if (timeout || ctx->fdmon_ops->need_wait(ctx)) { ctx->fdmon_ops->wait(ctx, &ready_list, timeout); } if @timeout is zero, then ctx->fdmon_ops->wait() won't be called. In case #2 and #3 below, it is guaranteed that @timeout is zero after try_poll_mode() return. So, I think it is safe to keep polling enabled for these two cases. > >> >> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable >> virtqueue notification in below 4 cases: >> >> 1. ctx->poll_ns is 0 >> 2. a zero timeout is passed to try_poll_mode() >> 3. polling succeeded but reported as no progress >> 4. polling failed and reported as no progress >> >> To minimize unnecessary guest notifications, keep notification disabled when >> it is possible, i.e., polling is enabled and last polling doesn't fail. > >What is the exact definition of polling success/failure? Polling success: found some events pending. Polling failure: timeout. success/failure are used because I saw below comment in run_poll_handlers_once(), then I thought they are well-known terms. /* * Polling was successful, exit try_poll_mode immediately * to adjust the next polling time. */ > >> >> Keep notification disabled for case #2 and #3; handle case #2 simply by a >> call > >Did you see case #2 happening often? What was the cause? I think so. I can add some tracepoint and collect statistics. IMO (of course, I can be totally wrong), the cause is: when a worker thread in thread poll completes an IO request, a bottom half is queued by work_thread()->qemu_bh_schedule(). Pending bottom halves lead to aio_compute_timeout() setting timeout to zero and then a zero timeout is passed to try_poll_mode(). > >> of run_poll_handlers_once() and for case #3, differentiate successful/failed >> polling and skip the call of poll_set_started(ctx,false) for successful ones. > >This is probably the most interesting case. When polling detects an >event, that's considered "progress", except for aio_notify() events. >aio_notify() is an internal event for restarting the event loop when >something has changed (e.g. fd handlers have been added/removed). I >wouldn't expect it to intefere polling frequently since that requires >another QEMU thread doing something to the AioContext, which should be >rare. > >Was aio_notify() intefering with polling in your case? Do you know why? Yes. It was. The reason is the same: after finishing IO requests, worker threads queue bottom halves during which aio_notify() is called.
[RFC v1] util/aio: Keep notification disabled as much as possible
When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu. Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable virtqueue notification in below 4 cases: 1. ctx->poll_ns is 0 2. a zero timeout is passed to try_poll_mode() 3. polling succeeded but reported as no progress 4. polling failed and reported as no progress To minimize unnecessary guest notifications, keep notification disabled when it is possible, i.e., polling is enabled and last polling doesn't fail. Keep notification disabled for case #2 and #3; handle case #2 simply by a call of run_poll_handlers_once() and for case #3, differentiate successful/failed polling and skip the call of poll_set_started(ctx,false) for successful ones. With this patch applied, FIO seq-read performance (bs=4k, iodepth=64, cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS. Below are changes in VM-exit statistics: w/o this patch (duration:10s): VM-EXITSamples Samples% Time%Min TimeMax Time Avg time EPT_MISCONFIG 79744099.34%98.58% 0.39us 57.92us 0.60us ( +- 0.05% ) MSR_WRITE 3672 0.46% 1.15% 0.88us 4.97us 1.52us ( +- 0.53% ) EXTERNAL_INTERRUPT 1638 0.20% 0.27% 0.59us 11.04us 0.81us ( +- 1.71% ) w/ this patch (duration:10s): VM-EXITSamples Samples% Time%Min TimeMax Time Avg time MSR_WRITE 352484.11%87.17% 0.86us 4.43us 1.74us ( +- 0.60% ) EXTERNAL_INTERRUPT51512.29%10.05% 0.64us 8.95us 1.37us ( +- 1.54% ) EPT_MISCONFIG151 3.60% 2.79% 0.40us 52.07us 1.30us ( +- 31.44% ) Signed-off-by: Chao Gao --- util/aio-posix.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/util/aio-posix.c b/util/aio-posix.c index 731f3826c0..bd2076145b 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -519,7 +519,7 @@ static bool remove_idle_poll_handlers(AioContext *ctx, * Returns: true if progress was made, false otherwise */ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, - int64_t max_ns, int64_t *timeout) + int64_t max_ns, int64_t *timeout, bool *success) { bool progress; int64_t start_time, elapsed_time; @@ -553,6 +553,8 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, progress = true; } +*success = !*timeout; + /* If time has passed with no successful polling, adjust *timeout to * keep the same ending time. */ @@ -577,22 +579,28 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, int64_t *timeout) { -int64_t max_ns; +int64_t max_ns, start_time; +bool success = false; if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) { return false; } +if (!*timeout) { +start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); +return run_poll_handlers_once(ctx, ready_list, start_time, timeout); +} + max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) { poll_set_started(ctx, ready_list, true); -if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) { +if (run_poll_handlers(ctx, ready_list, max_ns, timeout, &success)) { return true; } } -if (poll_set_started(ctx, ready_list, false)) { +if (!success && poll_set_started(ctx, ready_list, false)) { *timeout = 0; return true; } -- 2.25.1
Re: [QEMU PATCH] x86: Set maximum APIC ID to KVM prior to vCPU creation
On Fri, May 20, 2022 at 02:39:28PM +0800, Zeng Guang wrote: >Specify maximum possible APIC ID assigned for current VM session prior to >the creation of vCPUs. KVM need set up VM-scoped data structure indexed by >the APIC ID, e.g. Posted-Interrupt Descriptor table to support Intel IPI >virtualization. > >It can be achieved by calling KVM_ENABLE_CAP for KVM_CAP_MAX_VCPU_ID >capability once KVM has already enabled it. Otherwise, simply prompts >that KVM doesn't support this capability yet. > >Signed-off-by: Zeng Guang >--- > hw/i386/x86.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > >diff --git a/hw/i386/x86.c b/hw/i386/x86.c >index 4cf107baea..ff74492325 100644 >--- a/hw/i386/x86.c >+++ b/hw/i386/x86.c >@@ -106,7 +106,7 @@ out: > > void x86_cpus_init(X86MachineState *x86ms, int default_cpu_version) > { >-int i; >+int i, ret; > const CPUArchIdList *possible_cpus; > MachineState *ms = MACHINE(x86ms); > MachineClass *mc = MACHINE_GET_CLASS(x86ms); >@@ -123,6 +123,13 @@ void x86_cpus_init(X86MachineState *x86ms, int >default_cpu_version) > */ > x86ms->apic_id_limit = x86_cpu_apic_id_from_index(x86ms, > ms->smp.max_cpus - 1) + > 1; >+ >+ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPU_ID, >+0, x86ms->apic_id_limit); >+if (ret < 0) { >+error_report("kvm: Set max vcpu id not supported: %s", >strerror(-ret)); >+} This piece of code is specific to KVM. Please move it to kvm-all.c and invoke a wrapper function here. As kvm accelerator isn't necessarily enabled, the function call should be guarded by kvm_enabled(). And I think the error message can be omitted because the failure doesn't impact functionality; just a few more pages will be allocated by KVM.
Re: [RFC] KVM / QEMU: Introduce Interface for Querying APICv Info
On Fri, May 20, 2022 at 10:30:40AM +0700, Suthikulpanit, Suravee wrote: >Hi All, > >Currently, we don't have a good way to check whether APICV is active on a VM. >Normally, For AMD SVM AVIC, users either have to check for trace point, or >using >"perf kvm stat live" to catch AVIC-related #VMEXIT. > >For KVM, I would like to propose introducing a new IOCTL interface (i.e. >KVM_GET_APICV_INFO), >where user-space tools (e.g. QEMU monitor) can query run-time information of >APICv for VM and vCPUs >such as APICv inhibit reason flags. > >For QEMU, we can leverage the "info lapic" command, and append the APICV >information after >all LAPIC register information: > >For example: > >- Begin Snippet - >(qemu) info lapic 0 >dumping local APIC state for CPU 0 > >LVT0 0x00010700 active-hi edge masked ExtINT (vec 0) >LVT1 0x0400 active-hi edge NMI >LVTPC0x0001 active-hi edge masked Fixed (vec 0) >LVTERR 0x00fe active-hi edge Fixed (vec >254) >LVTTHMR 0x0001 active-hi edge masked Fixed (vec 0) >LVTT 0x000400ee active-hi edge tsc-deadline Fixed (vec >238) >TimerDCR=0x0 (divide by 2) initial_count = 0 current_count = 0 >SPIV 0x01ff APIC enabled, focus=off, spurious vec 255 >ICR 0x00fd physical edge de-assert no-shorthand >ICR2 0x0005 cpu 5 (X2APIC ID) >ESR 0x >ISR (none) >IRR (none) > >APR 0x00 TPR 0x00 DFR 0x0f LDR 0x00PPR 0x00 > >APICV vm inhibit: 0x10 <-- HERE >APICV vcpu inhibit: 0 <-- HERE > >-- End Snippet -- > >Otherwise, we can have APICv-specific info command (e.g. info apicv). I think this information can be added to kvm per-vm/vcpu debugfs. Then no qemu change is needed.
Re: [PATCH 0/3] Disable vhost device IOTLB is IOMMU is not enabled
On Wed, Aug 04, 2021 at 11:48:00AM +0800, Jason Wang wrote: >Hi: > >We currently try to enable device IOTLB when iommu_platform is >set. This may lead unnecessary trasnsactions between qemu and vhost >when vIOMMU is not used (which is the typical case for the encrypted >VM). > >So patch tries to use transport specific method to detect the enalbing >of vIOMMU and enable the device IOTLB only if vIOMMU is enalbed. > >Please review. Tested-by: Chao Gao Tested with TDX; this series fixes the performance issue we saw in a TD when vhost was enabled. Thanks Chao > >Thanks > >Jason Wang (3): > virtio-bus: introduce iommu_enabled() > virtio-pci: implement iommu_enabled() > vhost: correctly detect the enabling IOMMU > > hw/virtio/vhost.c | 2 +- > hw/virtio/virtio-bus.c | 14 ++ > hw/virtio/virtio-pci.c | 14 ++ > include/hw/virtio/virtio-bus.h | 4 +++- > 4 files changed, 32 insertions(+), 2 deletions(-) > >-- >2.25.1 >
Re: [PATCH] vhost: use large iotlb entry if no IOMMU translation is needed
On Tue, Aug 03, 2021 at 12:43:58PM +0800, Jason Wang wrote: > >在 2021/8/3 下午12:29, Chao Gao 写道: >> Ping. Could someone help to review this patch? >> >> Thanks >> Chao >> >> On Wed, Jul 21, 2021 at 03:54:02PM +0800, Chao Gao wrote: >> > If guest enables IOMMU_PLATFORM for virtio-net, severe network >> > performance drop is observed even if there is no IOMMU. > > >We see such reports internally and we're testing a patch series to disable >vhost IOTLB in this case. > >Will post a patch soon. OK. put me in the CC list. I would like to test with TDX to ensure your patch fix the performance issue I am facing. > > > >> > And disabling >> > vhost can mitigate the perf issue. Finally, we found the culprit is >> > frequent iotlb misses: kernel vhost-net has 2048 entries and each >> > entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache >> > translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M >> > memory for DMA, there are some iotlb misses. >> > >> > If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru >> > mode, we can optimistically use large, unaligned iotlb entries instead >> > of 4K-aligned entries to reduce iotlb pressure. > > >Instead of introducing new general facilities like unaligned IOTLB entry. I >wonder if we optimize the vtd_iommu_translate() to use e.g 1G instead? using 1G iotlb entry looks feasible. > > } else { > /* DMAR disabled, passthrough, use 4k-page*/ > iotlb.iova = addr & VTD_PAGE_MASK_4K; > iotlb.translated_addr = addr & VTD_PAGE_MASK_4K; > iotlb.addr_mask = ~VTD_PAGE_MASK_4K; > iotlb.perm = IOMMU_RW; > success = true; > } > > >> > Actually, vhost-net >> > in kernel supports unaligned iotlb entry. The alignment requirement is >> > imposed by address_space_get_iotlb_entry() and flatview_do_translate(). > > >For the passthrough case, is there anyway to detect them and then disable >device IOTLB in those case? yes. I guess so; qemu knows the presence and status of iommu. Currently, in flatview_do_translate(), memory_region_get_iommu() tells whether a memory region is behind an iommu. Thanks Chao
Re: [PATCH] vhost: use large iotlb entry if no IOMMU translation is needed
Ping. Could someone help to review this patch? Thanks Chao On Wed, Jul 21, 2021 at 03:54:02PM +0800, Chao Gao wrote: >If guest enables IOMMU_PLATFORM for virtio-net, severe network >performance drop is observed even if there is no IOMMU. And disabling >vhost can mitigate the perf issue. Finally, we found the culprit is >frequent iotlb misses: kernel vhost-net has 2048 entries and each >entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache >translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M >memory for DMA, there are some iotlb misses. > >If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru >mode, we can optimistically use large, unaligned iotlb entries instead >of 4K-aligned entries to reduce iotlb pressure. Actually, vhost-net >in kernel supports unaligned iotlb entry. The alignment requirement is >imposed by address_space_get_iotlb_entry() and flatview_do_translate(). > >Introduce IOMMUTLBEntryUnaligned which has a @len field to specify the >iotlb size to abstract a generic iotlb entry: aligned (original >IOMMUTLBEntry) and unaligned entry. flatview_do_translate() now >returns a magic value in @page_mask_out if no IOMMU translation is >needed. Then, address_space_get_iotbl_entry() can accordingly return a >page-aligned iotlb entry or the whole memory region section where the >iova resides as a large iotlb entry. > >Signed-off-by: Chao Gao >--- > hw/virtio/vhost.c | 6 +++--- > include/exec/memory.h | 16 ++-- > softmmu/physmem.c | 37 + > 3 files changed, 46 insertions(+), 13 deletions(-) > >diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c >index e8f85a5d2d..6745caa129 100644 >--- a/hw/virtio/vhost.c >+++ b/hw/virtio/vhost.c >@@ -1010,7 +1010,7 @@ static int vhost_memory_region_lookup(struct vhost_dev >*hdev, > > int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write) > { >-IOMMUTLBEntry iotlb; >+IOMMUTLBEntryUnaligned iotlb; > uint64_t uaddr, len; > int ret = -EFAULT; > >@@ -1031,8 +1031,8 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, >uint64_t iova, int write) > goto out; > } > >-len = MIN(iotlb.addr_mask + 1, len); >-iova = iova & ~iotlb.addr_mask; >+len = MIN(iotlb.len, len); >+iova = iotlb.iova; > > ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, > len, iotlb.perm); >diff --git a/include/exec/memory.h b/include/exec/memory.h >index c3d417d317..3f04e8fe88 100644 >--- a/include/exec/memory.h >+++ b/include/exec/memory.h >@@ -94,6 +94,7 @@ struct MemoryRegionSection { > }; > > typedef struct IOMMUTLBEntry IOMMUTLBEntry; >+typedef struct IOMMUTLBEntryUnaligned IOMMUTLBEntryUnaligned; > > /* See address_space_translate: bit 0 is read, bit 1 is write. */ > typedef enum { >@@ -113,6 +114,15 @@ struct IOMMUTLBEntry { > IOMMUAccessFlags perm; > }; > >+/* IOMMUTLBEntryUnaligned may be not page-aligned */ >+struct IOMMUTLBEntryUnaligned { >+AddressSpace*target_as; >+hwaddr iova; >+hwaddr translated_addr; >+hwaddr len; >+IOMMUAccessFlags perm; >+}; >+ > /* > * Bitmap for different IOMMUNotifier capabilities. Each notifier can > * register with one or multiple IOMMU Notifier capability bit(s). >@@ -2653,8 +2663,10 @@ void address_space_cache_destroy(MemoryRegionCache >*cache); > /* address_space_get_iotlb_entry: translate an address into an IOTLB > * entry. Should be called from an RCU critical section. > */ >-IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr, >-bool is_write, MemTxAttrs attrs); >+IOMMUTLBEntryUnaligned address_space_get_iotlb_entry(AddressSpace *as, >+ hwaddr addr, >+ bool is_write, >+ MemTxAttrs attrs); > > /* address_space_translate: translate an address range into an address space > * into a MemoryRegion and an address range into that section. Should be >diff --git a/softmmu/physmem.c b/softmmu/physmem.c >index 3c1912a1a0..469963f754 100644 >--- a/softmmu/physmem.c >+++ b/softmmu/physmem.c >@@ -143,6 +143,8 @@ typedef struct subpage_t { > > #define PHYS_SECTION_UNASSIGNED 0 > >+#define PAGE_MASK_NOT_BEHIND_IOMMU ((hwaddr)-1) >+ > static void io_mem_init(void); > static void memory_map_init(void); > static void tcg_log_global_after_sync(MemoryListener *listener); >@@ -470,7 +472,9 @@ unassigned:
[PATCH] vhost: use large iotlb entry if no IOMMU translation is needed
If guest enables IOMMU_PLATFORM for virtio-net, severe network performance drop is observed even if there is no IOMMU. And disabling vhost can mitigate the perf issue. Finally, we found the culprit is frequent iotlb misses: kernel vhost-net has 2048 entries and each entry is 4K (qemu uses 4K for i386 if no IOMMU); vhost-net can cache translations for up to 8M (i.e. 4K*2048) IOVAs. If guest uses >8M memory for DMA, there are some iotlb misses. If there is no IOMMU or IOMMU is disabled or IOMMU works in pass-thru mode, we can optimistically use large, unaligned iotlb entries instead of 4K-aligned entries to reduce iotlb pressure. Actually, vhost-net in kernel supports unaligned iotlb entry. The alignment requirement is imposed by address_space_get_iotlb_entry() and flatview_do_translate(). Introduce IOMMUTLBEntryUnaligned which has a @len field to specify the iotlb size to abstract a generic iotlb entry: aligned (original IOMMUTLBEntry) and unaligned entry. flatview_do_translate() now returns a magic value in @page_mask_out if no IOMMU translation is needed. Then, address_space_get_iotbl_entry() can accordingly return a page-aligned iotlb entry or the whole memory region section where the iova resides as a large iotlb entry. Signed-off-by: Chao Gao --- hw/virtio/vhost.c | 6 +++--- include/exec/memory.h | 16 ++-- softmmu/physmem.c | 37 + 3 files changed, 46 insertions(+), 13 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index e8f85a5d2d..6745caa129 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -1010,7 +1010,7 @@ static int vhost_memory_region_lookup(struct vhost_dev *hdev, int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write) { -IOMMUTLBEntry iotlb; +IOMMUTLBEntryUnaligned iotlb; uint64_t uaddr, len; int ret = -EFAULT; @@ -1031,8 +1031,8 @@ int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write) goto out; } -len = MIN(iotlb.addr_mask + 1, len); -iova = iova & ~iotlb.addr_mask; +len = MIN(iotlb.len, len); +iova = iotlb.iova; ret = vhost_backend_update_device_iotlb(dev, iova, uaddr, len, iotlb.perm); diff --git a/include/exec/memory.h b/include/exec/memory.h index c3d417d317..3f04e8fe88 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -94,6 +94,7 @@ struct MemoryRegionSection { }; typedef struct IOMMUTLBEntry IOMMUTLBEntry; +typedef struct IOMMUTLBEntryUnaligned IOMMUTLBEntryUnaligned; /* See address_space_translate: bit 0 is read, bit 1 is write. */ typedef enum { @@ -113,6 +114,15 @@ struct IOMMUTLBEntry { IOMMUAccessFlags perm; }; +/* IOMMUTLBEntryUnaligned may be not page-aligned */ +struct IOMMUTLBEntryUnaligned { +AddressSpace*target_as; +hwaddr iova; +hwaddr translated_addr; +hwaddr len; +IOMMUAccessFlags perm; +}; + /* * Bitmap for different IOMMUNotifier capabilities. Each notifier can * register with one or multiple IOMMU Notifier capability bit(s). @@ -2653,8 +2663,10 @@ void address_space_cache_destroy(MemoryRegionCache *cache); /* address_space_get_iotlb_entry: translate an address into an IOTLB * entry. Should be called from an RCU critical section. */ -IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr, -bool is_write, MemTxAttrs attrs); +IOMMUTLBEntryUnaligned address_space_get_iotlb_entry(AddressSpace *as, + hwaddr addr, + bool is_write, + MemTxAttrs attrs); /* address_space_translate: translate an address range into an address space * into a MemoryRegion and an address range into that section. Should be diff --git a/softmmu/physmem.c b/softmmu/physmem.c index 3c1912a1a0..469963f754 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -143,6 +143,8 @@ typedef struct subpage_t { #define PHYS_SECTION_UNASSIGNED 0 +#define PAGE_MASK_NOT_BEHIND_IOMMU ((hwaddr)-1) + static void io_mem_init(void); static void memory_map_init(void); static void tcg_log_global_after_sync(MemoryListener *listener); @@ -470,7 +472,9 @@ unassigned: * @page_mask_out: page mask for the translated address. This *should only be meaningful for IOMMU translated *addresses, since there may be huge pages that this bit - *would tell. It can be @NULL if we don't care about it. + *would tell. If the returned memory region section isn't + *behind an IOMMU, PAGE_MASK_NOT_BEHIND_IOMMU is return. + *It can be @NULL if we don't care about it. * @is_write: whether the translation operation is for write * @is_mmio: whether
Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability
On Thu, Aug 29, 2019 at 12:21:11PM +0200, Roger Pau Monné wrote: >On Thu, Aug 29, 2019 at 05:02:27PM +0800, Chao Gao wrote: >> Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's >> perspective, assigned devices cannot be reset. But some devices rely on PCI >> reset to recover from hardware hangs. When being assigned to a VM, those >> devices cannot be reset and won't work any longer if a hardware hang occurs. >> We have to reboot VM to trigger PCI reset on host to recover the device. >> >> This patch exposes FLR capability to VMs if the assigned device can be reset >> on >> host. When VM initiates an FLR to a device, qemu cleans up the device state, >> (including disabling of intx and/or MSI and unmapping BARs from guest, >> deleting >> emulated registers), then initiate PCI reset through 'reset' knob under the >> device's sysfs, finally initialize the device again. > >I think you likely need to deassign the device from the VM, perform >the reset, and then assign the device again, so that there's no Xen >internal state carried over prior to the reset? Yes. It is the safest way. But here I want to present the feature as FLR (such that the device driver in guest can issue PCI reset whenever needed and no change is needed to device driver). Current device deassignment notifies guest that the device is going to be removed It is not the standard PCI reset. Is it possible to make guest unaware of the device deassignment to emulate a standard PCI reset? In my mind, we can expose do_pci_remove/add to qemu or rewrite them in qemu (but don't remove the device from guest's PCI hierarchy). Do you think it is the right direction? Thanks Chao
Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability
On Thu, Aug 29, 2019 at 12:03:44PM +0200, Jan Beulich wrote: >On 29.08.2019 11:02, Chao Gao wrote: >> Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's >> perspective, assigned devices cannot be reset. But some devices rely on PCI >> reset to recover from hardware hangs. When being assigned to a VM, those >> devices cannot be reset and won't work any longer if a hardware hang occurs. >> We have to reboot VM to trigger PCI reset on host to recover the device. > >Did you consider a hot-unplug, reset (by host), hot-plug cycle instead? Yes. I considered this means. But it needs host to initiate this action. However, when a device needs reset is determined by the device driver in VM. So in practice, VM still needs a way to notify host to do unplug/reset/plug. As the standard FLR capability can meet the requirement, I don't try to invent one. > >> +static int xen_pt_devctl_reg_write(XenPCIPassthroughState *s, >> + XenPTReg *cfg_entry, uint16_t *val, >> + uint16_t dev_value, uint16_t valid_mask) >> +{ >> +if (s->real_device.is_resetable && (*val & PCI_EXP_DEVCTL_BCR_FLR)) { >> +xen_pt_reset(s); >> +} >> +return xen_pt_word_reg_write(s, cfg_entry, val, dev_value, valid_mask); > >I think you also need to clear the bit before handing on the request, >such that reads will always observe it clear. Will do. Thanks Chao
[Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability
Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's perspective, assigned devices cannot be reset. But some devices rely on PCI reset to recover from hardware hangs. When being assigned to a VM, those devices cannot be reset and won't work any longer if a hardware hang occurs. We have to reboot VM to trigger PCI reset on host to recover the device. This patch exposes FLR capability to VMs if the assigned device can be reset on host. When VM initiates an FLR to a device, qemu cleans up the device state, (including disabling of intx and/or MSI and unmapping BARs from guest, deleting emulated registers), then initiate PCI reset through 'reset' knob under the device's sysfs, finally initialize the device again. Signed-off-by: Chao Gao --- Do we need to introduce an attribute, like "permissive" to explicitly enable FLR capability emulation? During PCI reset, interrupts and BARs are unmapped from the guest. It seems that guest cannot interact with the device directly except access to device's configuration space which is emulated by qemu. If proper method can be used to prevent qemu accessing the physical device there is no new security hole caused by the FLR emulation. VM's FLR may be backed by any reset function on host to the physical device, for example: FLR, D3softreset, secondary bus reset. Not sure it is fine to mix them. Given Linux kernel just uses an unified API to reset device and caller cannot choose a specific one, it might be OK. --- hw/xen/xen-host-pci-device.c | 30 ++ hw/xen/xen-host-pci-device.h | 3 +++ hw/xen/xen_pt.c | 9 + hw/xen/xen_pt.h | 1 + hw/xen/xen_pt_config_init.c | 30 +++--- 5 files changed, 70 insertions(+), 3 deletions(-) diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c index 1b44dcafaf..d549656f42 100644 --- a/hw/xen/xen-host-pci-device.c +++ b/hw/xen/xen-host-pci-device.c @@ -198,6 +198,35 @@ static bool xen_host_pci_dev_is_virtfn(XenHostPCIDevice *d) return !stat(path, &buf); } +static bool xen_host_pci_resetable(XenHostPCIDevice *d) +{ +char path[PATH_MAX]; + +xen_host_pci_sysfs_path(d, "reset", path, sizeof(path)); + +return !access(path, W_OK); +} + +void xen_host_pci_reset(XenHostPCIDevice *d) +{ +char path[PATH_MAX]; +int fd; + +xen_host_pci_sysfs_path(d, "reset", path, sizeof(path)); + +fd = open(path, O_WRONLY); +if (fd == -1) { +XEN_HOST_PCI_LOG("Xen host pci reset: open error\n"); +return; +} + +if (write(fd, "1", 1) != 1) { +XEN_HOST_PCI_LOG("Xen host pci reset: write error\n"); +} + +return; +} + static void xen_host_pci_config_open(XenHostPCIDevice *d, Error **errp) { char path[PATH_MAX]; @@ -377,6 +406,7 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t domain, d->class_code = v; d->is_virtfn = xen_host_pci_dev_is_virtfn(d); +d->is_resetable = xen_host_pci_resetable(d); return; diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h index 4d8d34ecb0..cacf9b3df8 100644 --- a/hw/xen/xen-host-pci-device.h +++ b/hw/xen/xen-host-pci-device.h @@ -32,6 +32,7 @@ typedef struct XenHostPCIDevice { XenHostPCIIORegion rom; bool is_virtfn; +bool is_resetable; int config_fd; } XenHostPCIDevice; @@ -55,4 +56,6 @@ int xen_host_pci_set_block(XenHostPCIDevice *d, int pos, uint8_t *buf, int xen_host_pci_find_ext_cap_offset(XenHostPCIDevice *s, uint32_t cap); +void xen_host_pci_reset(XenHostPCIDevice *d); + #endif /* XEN_HOST_PCI_DEVICE_H */ diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c index 8fbaf2eae9..d750367c0a 100644 --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -938,6 +938,15 @@ static void xen_pt_unregister_device(PCIDevice *d) xen_pt_destroy(d); } +void xen_pt_reset(XenPCIPassthroughState *s) +{ +PCIDevice *d = PCI_DEVICE(s); + +xen_pt_unregister_device(d); +xen_host_pci_reset(&s->real_device); +xen_pt_realize(d, NULL); +} + static Property xen_pci_passthrough_properties[] = { DEFINE_PROP_PCI_HOST_DEVADDR("hostaddr", XenPCIPassthroughState, hostaddr), DEFINE_PROP_BOOL("permissive", XenPCIPassthroughState, permissive, false), diff --git a/hw/xen/xen_pt.h b/hw/xen/xen_pt.h index 9167bbaf6d..ed05bc0d39 100644 --- a/hw/xen/xen_pt.h +++ b/hw/xen/xen_pt.h @@ -332,4 +332,5 @@ int xen_pt_register_vga_regions(XenHostPCIDevice *dev); int xen_pt_unregister_vga_regions(XenHostPCIDevice *dev); void xen_pt_setup_vga(XenPCIPassthroughState *s, XenHostPCIDevice *dev, Error **errp); +void xen_pt_reset(XenPCIPassthroughState *s); #endif /* XEN_PT_H */ diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c index 31ec5add1d..435abd7286 100644 --- a/hw/xen/xen_pt_config_init
Re: [Qemu-devel] [PATCH v3 3/3] msi: Handle remappable format interrupt request
On Mon, Dec 11, 2017 at 06:07:48PM +, Anthony PERARD wrote: >On Fri, Nov 17, 2017 at 02:24:25PM +0800, Chao Gao wrote: >> According to VT-d spec Interrupt Remapping and Interrupt Posting -> >> Interrupt Remapping -> Interrupt Request Formats On Intel 64 >> Platforms, fields of MSI data register have changed. This patch >> avoids wrongly regarding a remappable format interrupt request as >> an interrupt binded with a pirq. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> --- >> v3: >> - clarify the interrupt format bit is Intel-specific, then it is >> improper to define MSI_ADDR_IF_MASK in a common header. >> --- >> hw/i386/xen/xen-hvm.c | 10 +- >> hw/pci/msi.c | 5 +++-- >> hw/pci/msix.c | 4 +++- >> hw/xen/xen_pt_msi.c | 2 +- >> include/hw/xen/xen.h | 2 +- >> stubs/xen-hvm.c | 2 +- >> 6 files changed, 18 insertions(+), 7 deletions(-) >> >> diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c >> index 8028bed..52dc8af 100644 >> --- a/hw/i386/xen/xen-hvm.c >> +++ b/hw/i386/xen/xen-hvm.c >> @@ -145,8 +145,16 @@ void xen_piix_pci_write_config_client(uint32_t address, >> uint32_t val, int len) >> } >> } >> >> -int xen_is_pirq_msi(uint32_t msi_data) >> +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) >> { >> +/* If the MSI address is configured in remapping format, the MSI will >> not >> + * be remapped into a pirq. This 'if' test excludes Intel-specific >> + * remappable msi. >> + */ >> +#define MSI_ADDR_IF_MASK 0x0010 > >I don't think that is the right place for a define, they also exist >outside of the context of the function. yes. >That define would be better at the top of this file, I think.(There is will do. Thanks Chao >probably a better place in the common headers, but I'm not sure were.)
Re: [Qemu-devel] [PATCH v3 2/3] xen/pt: Pass the whole msi addr/data to Xen
On Mon, Dec 11, 2017 at 05:59:08PM +, Anthony PERARD wrote: >On Fri, Nov 17, 2017 at 02:24:24PM +0800, Chao Gao wrote: >> Previously, some fields (reserved or unalterable) are filtered by >> Qemu. This fields are useless for the legacy interrupt format. >> However, these fields are may meaningful (for intel platform) >> for the interrupt of remapping format. It is better to pass the whole >> msi addr/data to Xen without any filtering. >> >> The main reason why we want this is QEMU doesn't have the knowledge >> to decide the interrupt format after we introduce vIOMMU inside Xen. >> Passing the whole msi message down and let arch-specific vIOMMU to >> decide the interrupt format. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> --- >> v3: >> - new >> --- >> hw/xen/xen_pt_msi.c | 47 --- >> 1 file changed, 12 insertions(+), 35 deletions(-) >> >> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c >> index 6d1e3bd..f7d6e76 100644 >> --- a/hw/xen/xen_pt_msi.c >> +++ b/hw/xen/xen_pt_msi.c >> @@ -47,25 +47,6 @@ static inline uint32_t msi_ext_dest_id(uint32_t addr_hi) >> return addr_hi & 0xff00; >> } >> >> -static uint32_t msi_gflags(uint32_t data, uint64_t addr) >> -{ >> -uint32_t result = 0; >> -int rh, dm, dest_id, deliv_mode, trig_mode; >> - >> -rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1; >> -dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1; >> -dest_id = msi_dest_id(addr); >> -deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7; >> -trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1; >> - >> -result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH) >> -| (dm << XEN_PT_GFLAGS_SHIFT_DM) >> -| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE) >> -| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE); >> - >> -return result; >> -} >> - >> static inline uint64_t msi_addr64(XenPTMSI *msi) >> { >> return (uint64_t)msi->addr_hi << 32 | msi->addr_lo; >> @@ -160,23 +141,20 @@ static int msi_msix_update(XenPCIPassthroughState *s, >> bool masked) >> { >> PCIDevice *d = &s->dev; >> -uint8_t gvec = msi_vector(data); >> -uint32_t gflags = msi_gflags(data, addr); >> +uint32_t gflags = masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED); >> int rc = 0; >> uint64_t table_addr = 0; >> >> -XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" >> - " (entry: %#x)\n", >> - is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); >> +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x addr %"PRIx64 >> + " data %#x gflags %#x (entry: %#x)\n", >> + is_msix ? "-X" : "", pirq, addr, data, gflags, msix_entry); >> >> if (is_msix) { >> table_addr = s->msix->mmio_base_addr; >> } >> >> -gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED); >> - >> -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, >> - pirq, gflags, table_addr); >> +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, pirq, addr, >> + data, gflags, table_addr); > >Are you trying to modifie an existing API? That is not going to work. We >want to be able to build QEMU against older version of Xen, and it >should work as well. Yes. I thought it didn't matter. And definitely, I was wrong. I will keep compatibility by introducing a new API. A wapper function, which calls the old or new API according to the Xen version, would be used here. Thanks Chao
Re: [Qemu-devel] [PATCH v3 1/3] i386/msi: Correct mask of destination ID in MSI address
On Fri, Nov 17, 2017 at 03:24:21PM +0200, Michael S. Tsirkin wrote: >On Fri, Nov 17, 2017 at 02:24:23PM +0800, Chao Gao wrote: >> According to SDM 10.11.1, only [19:12] bits of MSI address are >> Destination ID, change the mask to avoid ambiguity for VT-d spec >> has used the bit 4 to indicate a remappable interrupt request. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> Reviewed-by: Anthony PERARD >> Reviewed-by: Peter Xu > >Reviewed-by: Michael S. Tsirkin > >Do you want me to merge this? Of course. It is a simple patch and has no dependency with the other two in this series. Thanks Chao
[Qemu-devel] [PATCH v3 3/3] msi: Handle remappable format interrupt request
According to VT-d spec Interrupt Remapping and Interrupt Posting -> Interrupt Remapping -> Interrupt Request Formats On Intel 64 Platforms, fields of MSI data register have changed. This patch avoids wrongly regarding a remappable format interrupt request as an interrupt binded with a pirq. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- v3: - clarify the interrupt format bit is Intel-specific, then it is improper to define MSI_ADDR_IF_MASK in a common header. --- hw/i386/xen/xen-hvm.c | 10 +- hw/pci/msi.c | 5 +++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 2 +- include/hw/xen/xen.h | 2 +- stubs/xen-hvm.c | 2 +- 6 files changed, 18 insertions(+), 7 deletions(-) diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c index 8028bed..52dc8af 100644 --- a/hw/i386/xen/xen-hvm.c +++ b/hw/i386/xen/xen-hvm.c @@ -145,8 +145,16 @@ void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len) } } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { +/* If the MSI address is configured in remapping format, the MSI will not + * be remapped into a pirq. This 'if' test excludes Intel-specific + * remappable msi. + */ +#define MSI_ADDR_IF_MASK 0x0010 +if (msi_addr_lo & MSI_ADDR_IF_MASK) { +return 0; +} /* If vector is 0, the msi is remapped into a pirq, passed as * dest_id. */ diff --git a/hw/pci/msi.c b/hw/pci/msi.c index 5e05ce5..d05c876 100644 --- a/hw/pci/msi.c +++ b/hw/pci/msi.c @@ -289,7 +289,7 @@ void msi_reset(PCIDevice *dev) static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) { uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev)); -uint32_t mask, data; +uint32_t mask, data, addr_lo; bool msi64bit = flags & PCI_MSI_FLAGS_64BIT; assert(vector < PCI_MSI_VECTORS_MAX); @@ -298,7 +298,8 @@ static bool msi_is_masked(const PCIDevice *dev, unsigned int vector) } data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); -if (xen_is_pirq_msi(data)) { +addr_lo = pci_get_long(dev->config + msi_address_lo_off(dev)); +if (xen_is_pirq_msi(addr_lo, data)) { return false; } diff --git a/hw/pci/msix.c b/hw/pci/msix.c index c944c02..4cb01db 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -83,9 +83,11 @@ static bool msix_vector_masked(PCIDevice *dev, unsigned int vector, bool fmask) { unsigned offset = vector * PCI_MSIX_ENTRY_SIZE; uint8_t *data = &dev->msix_table[offset + PCI_MSIX_ENTRY_DATA]; +uint8_t *addr_lo = &dev->msix_table[offset + PCI_MSIX_ENTRY_LOWER_ADDR]; /* MSIs on Xen can be remapped into pirqs. In those cases, masking * and unmasking go through the PV evtchn path. */ -if (xen_enabled() && xen_is_pirq_msi(pci_get_long(data))) { +if (xen_enabled() && xen_is_pirq_msi(pci_get_long(addr_lo), + pci_get_long(data))) { return false; } return fmask || dev->msix_table[offset + PCI_MSIX_ENTRY_VECTOR_CTRL] & diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index f7d6e76..0e5bf83 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -96,7 +96,7 @@ static int msi_msix_setup(XenPCIPassthroughState *s, assert((!is_msix && msix_entry == 0) || is_msix); -if (xen_is_pirq_msi(data)) { +if (xen_is_pirq_msi(addr, data)) { *ppirq = msi_ext_dest_id(addr >> 32) | msi_dest_id(addr); if (!*ppirq) { /* this probably identifies an misconfiguration of the guest, diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 7efcdaa..0d6c83e 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -34,7 +34,7 @@ int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); void xen_piix3_set_irq(void *opaque, int irq_num, int level); void xen_piix_pci_write_config_client(uint32_t address, uint32_t val, int len); void xen_hvm_inject_msi(uint64_t addr, uint32_t data); -int xen_is_pirq_msi(uint32_t msi_data); +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data); qemu_irq *xen_interrupt_controller_init(void); diff --git a/stubs/xen-hvm.c b/stubs/xen-hvm.c index 3ca6c51..aeb1592 100644 --- a/stubs/xen-hvm.c +++ b/stubs/xen-hvm.c @@ -31,7 +31,7 @@ void xen_hvm_inject_msi(uint64_t addr, uint32_t data) { } -int xen_is_pirq_msi(uint32_t msi_data) +int xen_is_pirq_msi(uint32_t msi_addr_lo, uint32_t msi_data) { return 0; } -- 1.8.3.1
[Qemu-devel] [PATCH v3 0/3] Qemu: add Xen vIOMMU interrupt remapping function support
This patchset is to deal with MSI interrupt remapping request when guest updates MSI registers. In case of conflicts, this series also can be found in my personal github: Xen: https://github.com/gc1008/viommu_xen.git vIOMMU4 Qemu: https://github.com/gc1008/viommu_qemu.git vIOMMU3 Any comments would be highly appreciated. And below is the change histroy Changes from v2: In last version, a new interface is used for binding a guest remappable msi with a physical interrupt, while the old interface is used for binding non-remappable msi. But for AMD, only from the MSI message itself, the interrupt format cannot be infered. To address this, we decide to pass the whole guest msi message to Xen and let vIOMMUs in Xen detemine whether an given interrupt is remappable or not. So the following changes are made: - Instead of introducing a new interface for binding remapping format msi, the exist interface is modified to support msi of both format. - In patch 3, define MSI_ADDR_IF_MASK inside a function because it is intel-specific. It is improper to define it in a common header. Chao Gao (3): i386/msi: Correct mask of destination ID in MSI address xen/pt: Pass the whole msi addr/data to Xen msi: Handle remappable format interrupt request hw/i386/xen/xen-hvm.c | 10 - hw/pci/msi.c | 5 +++-- hw/pci/msix.c | 4 +++- hw/xen/xen_pt_msi.c | 49 --- include/hw/i386/apic-msidef.h | 2 +- include/hw/xen/xen.h | 2 +- stubs/xen-hvm.c | 2 +- 7 files changed, 31 insertions(+), 43 deletions(-) -- 1.8.3.1
[Qemu-devel] [PATCH v3 2/3] xen/pt: Pass the whole msi addr/data to Xen
Previously, some fields (reserved or unalterable) are filtered by Qemu. This fields are useless for the legacy interrupt format. However, these fields are may meaningful (for intel platform) for the interrupt of remapping format. It is better to pass the whole msi addr/data to Xen without any filtering. The main reason why we want this is QEMU doesn't have the knowledge to decide the interrupt format after we introduce vIOMMU inside Xen. Passing the whole msi message down and let arch-specific vIOMMU to decide the interrupt format. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu --- v3: - new --- hw/xen/xen_pt_msi.c | 47 --- 1 file changed, 12 insertions(+), 35 deletions(-) diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c index 6d1e3bd..f7d6e76 100644 --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -47,25 +47,6 @@ static inline uint32_t msi_ext_dest_id(uint32_t addr_hi) return addr_hi & 0xff00; } -static uint32_t msi_gflags(uint32_t data, uint64_t addr) -{ -uint32_t result = 0; -int rh, dm, dest_id, deliv_mode, trig_mode; - -rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1; -dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1; -dest_id = msi_dest_id(addr); -deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7; -trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1; - -result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH) -| (dm << XEN_PT_GFLAGS_SHIFT_DM) -| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE) -| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE); - -return result; -} - static inline uint64_t msi_addr64(XenPTMSI *msi) { return (uint64_t)msi->addr_hi << 32 | msi->addr_lo; @@ -160,23 +141,20 @@ static int msi_msix_update(XenPCIPassthroughState *s, bool masked) { PCIDevice *d = &s->dev; -uint8_t gvec = msi_vector(data); -uint32_t gflags = msi_gflags(data, addr); +uint32_t gflags = masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED); int rc = 0; uint64_t table_addr = 0; -XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" - " (entry: %#x)\n", - is_msix ? "-X" : "", pirq, gvec, gflags, msix_entry); +XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x addr %"PRIx64 + " data %#x gflags %#x (entry: %#x)\n", + is_msix ? "-X" : "", pirq, addr, data, gflags, msix_entry); if (is_msix) { table_addr = s->msix->mmio_base_addr; } -gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED); - -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, - pirq, gflags, table_addr); +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, pirq, addr, + data, gflags, table_addr); if (rc) { XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", @@ -199,8 +177,6 @@ static int msi_msix_disable(XenPCIPassthroughState *s, bool is_binded) { PCIDevice *d = &s->dev; -uint8_t gvec = msi_vector(data); -uint32_t gflags = msi_gflags(data, addr); int rc = 0; if (pirq == XEN_PT_UNASSIGNED_PIRQ) { @@ -208,12 +184,13 @@ static int msi_msix_disable(XenPCIPassthroughState *s, } if (is_binded) { -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", - is_msix ? "-X" : "", pirq, gvec); -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags); +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, addr %"PRIx64", data %#x\n", + is_msix ? "-X" : "", pirq, addr, data); +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, pirq, addr, data); if (rc) { -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, gvec: %#x)\n", - is_msix ? "-X" : "", errno, pirq, gvec); +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, " + "addr: %"PRIx64", data: %#x)\n", + is_msix ? "-X" : "", errno, pirq, addr, data); return rc; } } -- 1.8.3.1
[Qemu-devel] [PATCH v3 1/3] i386/msi: Correct mask of destination ID in MSI address
According to SDM 10.11.1, only [19:12] bits of MSI address are Destination ID, change the mask to avoid ambiguity for VT-d spec has used the bit 4 to indicate a remappable interrupt request. Signed-off-by: Chao Gao Signed-off-by: Lan Tianyu Reviewed-by: Anthony PERARD Reviewed-by: Peter Xu --- include/hw/i386/apic-msidef.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h index 8b4d4cc..420b411 100644 --- a/include/hw/i386/apic-msidef.h +++ b/include/hw/i386/apic-msidef.h @@ -26,6 +26,6 @@ #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_IDX_SHIFT 4 -#define MSI_ADDR_DEST_ID_MASK 0x000 +#define MSI_ADDR_DEST_ID_MASK 0x000ff000 #endif /* HW_APIC_MSIDEF_H */ -- 1.8.3.1
Re: [Qemu-devel] [RFC PATCH 4/4] msi: taking interrupt format into consideration during judging a pirq is binded with a event channel
On Thu, Mar 30, 2017 at 06:29:29PM +0100, Anthony PERARD wrote: >On Fri, Mar 17, 2017 at 07:29:17PM +0800, Lan Tianyu wrote: >> From: Chao Gao >> Subject: msi: taking interrupt format into consideration during >> judging a pirq is binded with a event channel > >This is quite a long title, I think we can make it shorter. Maybe "msi: >Handle MSI remapping format. OK. > >> data = pci_get_word(dev->config + msi_data_off(dev, msi64bit)); >> -if (xen_is_pirq_msi(data)) { >> +addr_lo = pci_get_word(dev->config + msi_address_lo_off(dev)); > >This could be get_long, so addr_lo will actually have the all low bits >of the addr. yes. > >> +int xen_is_pirq_msi(uint32_t msi_data, uint32_t msi_addr_lo) >> { >> +/* If msi address is configurate to remapping format, the msi will not >> + * remapped into a pirq. >> + */ >> +if ( msi_addr_lo & 0x10 ) > >That's a magic number, is 0x10 MSI_ADDR_IM_MASK? Yes. Really thanks for your comments. > >Thanks, > >-- >Anthony PERARD
Re: [Qemu-devel] [RFC PATCH 3/4] xen-pt: bind/unbind interrupt remapping format MSI
On Thu, Mar 30, 2017 at 05:51:45PM +0100, Anthony PERARD wrote: >On Fri, Mar 17, 2017 at 07:29:16PM +0800, Lan Tianyu wrote: >> From: Chao Gao >> >> If a vIOMMU is exposed to guest, guest will configure the msi to remapping >> format. The original code isn't suitable to the new format. A new pair >> bind/unbind interfaces are added for this usage. This patch recognizes >> this case and use new interfaces to bind/unbind msi. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> --- >> hw/xen/xen_pt_msi.c | 36 >> include/hw/i386/apic-msidef.h | 1 + >> 2 files changed, 29 insertions(+), 8 deletions(-) >> >> diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c >> index 62add06..8b0d7fc 100644 >> --- a/hw/xen/xen_pt_msi.c >> +++ b/hw/xen/xen_pt_msi.c >> @@ -161,6 +161,7 @@ static int msi_msix_update(XenPCIPassthroughState *s, >> uint8_t gvec = msi_vector(data); >> uint32_t gflags = msi_gflags(data, addr); >> int rc = 0; >> +bool ir = !!(addr & MSI_ADDR_IM_MASK); >> uint64_t table_addr = 0; >> >> XEN_PT_LOG(d, "Updating MSI%s with pirq %d gvec %#x gflags %#x" >> @@ -171,8 +172,14 @@ static int msi_msix_update(XenPCIPassthroughState *s, >> table_addr = s->msix->mmio_base_addr; >> } >> >> -rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, >> +if (ir) { > >You could maybe use add&MSI_ADDR_IM_MASK instead of going through a >variable. > >> +rc = xc_domain_update_msi_irq_remapping(xen_xc, xen_domid, pirq, >> +d->devfn, data, addr, table_addr); > >Do you also want to update the XEN_PT_LOG above? Since it does not >always reflect the update_msi call anymore. Yes. I adjust the output. > >> +} >> +else { >> +rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, >>pirq, gflags, table_addr); >> +} >> >> if (rc) { >> XEN_PT_ERR(d, "Updating of MSI%s failed. (err: %d)\n", >> @@ -204,13 +211,26 @@ static int msi_msix_disable(XenPCIPassthroughState *s, >> } >> >> if (is_binded) { >> -XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", >> - is_msix ? "-X" : "", pirq, gvec); >> -rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, >> gflags); >> -if (rc) { >> -XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: %d, >> gvec: %#x)\n", >> - is_msix ? "-X" : "", errno, pirq, gvec); >> -return rc; >> +if ( addr & MSI_ADDR_IM_MASK ) { >> +XEN_PT_LOG(d, "Unbinding of MSI%s . ( pirq: %d, data: %x, addr: >> %lx)\n", > >For addr, it should be PRIx64 instead of %lx. > >> + is_msix ? "-X" : "", pirq, data, addr); >> +rc = xc_domain_unbind_msi_irq_remapping(xen_xc, xen_domid, pirq, >> +d->devfn, data, addr); >> +if (rc) { >> +XEN_PT_ERR(d, "Unbinding of MSI%s . (error: %d, pirq: %d, >> data: %x, addr: %lx)\n", >> + is_msix ? "-X" : "", rc, pirq, data, addr); >> +return rc; >> +} >> + >> +} else { >> +XEN_PT_LOG(d, "Unbind MSI%s with pirq %d, gvec %#x\n", >> + is_msix ? "-X" : "", pirq, gvec); >> +rc = xc_domain_unbind_msi_irq(xen_xc, xen_domid, gvec, pirq, >> gflags); >> +if (rc) { >> +XEN_PT_ERR(d, "Unbinding of MSI%s failed. (err: %d, pirq: >> %d, gvec: %#x)\n", >> + is_msix ? "-X" : "", errno, pirq, gvec); >> +return rc; >> +} >> } >> } >> >> diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h >> index 8b4d4cc..08b584f 100644 >> --- a/include/hw/i386/apic-msidef.h >> +++ b/include/hw/i386/apic-msidef.h >> @@ -27,5 +27,6 @@ >> #define MSI_ADDR_DEST_ID_SHIFT 12 >> #define MSI_ADDR_DEST_IDX_SHIFT 4 >> #define MSI_ADDR_DEST_ID_MASK 0x000 > >Could you add a 0 to dest_id here? So their will be 8 digit and it those >not look weird when compared to the next define. > Will do. >> +#define MSI_ADDR_IM_MASK 0x0010 > >Is the definition of MSI_ADDR_IM_MASK available somewhere? In the Intel >SDM I've only found this bit to be reserved. Yes, it is defined in VT-d spec 5.1.5.2 MSI and MSI-X Register Programming. I made a mistake here. I should use MSI_ADDR_IF_MASK. Thanks Chao > >Thanks, > >-- >Anthony PERARD
Re: [Qemu-devel] [RFC PATCH 2/4] Xen: add a dummy vIOMMU to create/destroy vIOMMU in Xen
On Thu, Mar 30, 2017 at 05:24:52PM +0100, Anthony PERARD wrote: >Hi, > >On Fri, Mar 17, 2017 at 07:29:15PM +0800, Lan Tianyu wrote: >> From: Chao Gao >> >> Since adding a dynamic sysbus device is forbidden, so choose TYPE_DEVICE >> as parent class. >> >> Signed-off-by: Chao Gao >> Signed-off-by: Lan Tianyu >> --- >> hw/xen/Makefile.objs | 1 + >> hw/xen/xen_viommu.c | 116 >> +++ >> 2 files changed, 117 insertions(+) >> create mode 100644 hw/xen/xen_viommu.c >> >> +static void xen_viommu_realize(DeviceState *dev, Error **errp) >> +{ >> +int rc; >> +uint64_t cap; >> +char *dom; >> +char viommu_path[1024]; >> +XenVIOMMUState *s = XEN_VIOMMU_DEVICE(dev); >> + >> +s->id = -1; >> + >> +/* Read vIOMMU attributes from Xenstore. */ >> +dom = xs_get_domain_path(xenstore, xen_domid); >> +snprintf(viommu_path, sizeof(viommu_path), "%s/viommu", dom); >> +rc = xenstore_read_uint64(viommu_path, "base_addr", &s->base_addr); > >Could these informations (base_addr and cap) be read from the command >line instead of via xenstore? >Any reason for these informations to be on xenstore? Actually, we passed both via command line at first. We just concerned whether it was ok to pass a address through command line since we find no device does the similar thing. > >> +if (rc) { >> +error_report("Can't get base address of vIOMMU"); > >I think error_setg should be used instead of error_report. > >> +exit(1); > >Also, exit should be remove, and return instead. error_setg would be >enough to signal that the device can not work. > >> +} >> + >> +rc = xenstore_read_uint64(viommu_path, "cap", &s->cap); >> +if (rc) { >> +error_report("Can't get capabilities of vIOMMU"); >> +exit(1); >> +} >> + >> +rc = xc_viommu_query_cap(xen_xc, xen_domid, &cap); > >Since xc_viommu_* seems to be new, you should use >xendevicemodel_viommu_* instead, also you will need to define a stub for >them to be able to compile QEMU against older version of Xen. Will follow your suggestions above. Thanks Chao > > >The patch looks good otherwise. > >Thanks, > >-- >Anthony PERARD
Re: [Qemu-devel] [PATCH v5 0/3] IOMMU: intel_iommu support map and unmap notifications
On Fri, Oct 21, 2016 at 12:07:18AM +0300, Aviv B.D wrote: >From: "Aviv Ben-David" > >* Advertize Cache Mode capability in iommu cap register. > This capability is controlled by "cache-mode" property of intel-iommu device. > To enable this option call QEMU with "-device intel-iommu,cache-mode=true". > >* On page cache invalidation in intel vIOMMU, check if the domain belong to > registered notifier, and notify accordingly. > >Currently this patch still doesn't enabling VFIO devices support with vIOMMU >present. Current problems: >* vfio_iommu_map_notify is not aware about memory range belong to specific > VFIOGuestIOMMU. >* memory_region_iommu_replay hangs QEMU on start up while it itterate over > 64bit address space. Commenting out the call to this function enables > workable VFIO device while vIOMMU present. > After applying the patch series based on commit ede0cbeb, I run the following cmd: modprobe vfio-pci echo 8086 1528 > /sys/bus/pci/drivers/vfio-pci/new_id echo ":03:00.1" > /sys/bus/pci/devices/\:03\:00.1/driver/unbind echo 8086 1528 > /sys/bus/pci/drivers/vfio-pci/new_id to prepare for assigning nic device. After that, I try to boot a guest by qemu-system-x86_64 -boot c -m 4096 -monitor pty -serial stdio --enable-kvm \ -M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \ -machine q35 -device intel-iommu,intremap=on,eim=on,cache-mode=true \ -net none -device vfio-pci,host=03:00.1 and however encounter this error: qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area e258e000 qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area e258f000 qemu-system-x86_64: -device vfio-pci,host=03:00.1: iommu map to non memory area e259 Do i make some mistakes in this test? How to correct it? by the way, there is a build error: qemu/hw/pci-host/apb.c:326:5: error: initialization from incompatible pointer type [-Werror] .translate = pbm_translate_iommu, ^ qemu/hw/pci-host/apb.c:326:5: error: (near initialization for \u2018pbm_iommu_ops.translate\u2019) [-Werror] cc1: all warnings being treated as errors make: *** [hw/pci-host/apb.o] Error 1 > >-- >1.9.1 > >
Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode
Hi, we had 3 problems left here. 1. IRQremapping can't work with x2apic_cluster mode. 2. apic_id > 255 can't receive devices interrupts. 3. windows crash when present IRQremapping capability to it. I test with latest kernel v4.8-rc7 and find all of them are unsolved. Also in the qemu and kernel code bases, I couldn't find some patches to solve them. Will you fix these problems or leave them to communities? Thanks, -Chao On Tue, Aug 09, 2016 at 02:51:16PM +0200, Radim Krčmář wrote: >2016-08-09 16:19+0800, Chao Gao: >> On Tue, Aug 09, 2016 at 02:18:15PM +0800, Peter Xu wrote: >>>I think the problem is with x2apic. Currently, x2apic is enabled in >>>vIOMMU when kernel irqchip is used. This is problematic, since >>>actually we are throughing away dest_id[31:8] without Radim's patches, >>>meanwhile I see that by default x2apic is using cluster mode. >>> >>>In cluster mode, 8 bits will possibly not suffice (I think the reason >>>why >17 vcpus will bring trouble is that each cluster has 16 vcpus, >>>we'll have trouble if we have more than one cluster). >>> >>>To temporarily solve your issue, you should not only need "-global >>>ioapic.version=0x20" in QEMU command line, but also add "x2apic_phys" >>>to you guest kernel boot parameter, to force guest boot with x2apic >>>physical mode (not cluster mode). Though this can only work for <255 >>>vcpus. IMHO we may still need to wait for Radim's patches to test >255 >>>case. >> >> ok, after adding "x2apic_phys" to guest kernel boot parameter, I >> boot up a 288(yes, 288) vcpus guest successfully with command >> qemu-system-x86_64 -boot c -m 4096 -sdl --enable-kvm \ >> -M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \ >> -device intel-iommu,intremap=on -machine q35 >> Also, I can see interrupts on those cpu with inital apicid>255 from >> /proc/cpuinfo and /proc/interrupts. > >Great, thanks for testing! >Only IPIs will be correctly delivered to apic_id > 255 without few more >patches on the QEMU side, though. >
Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode
On Tue, Aug 09, 2016 at 02:18:15PM +0800, Peter Xu wrote: >On Tue, Aug 09, 2016 at 12:33:17PM +0800, Chao Gao wrote: >> On Mon, Aug 08, 2016 at 04:57:14PM +0800, Peter Xu wrote: >> >On Mon, Aug 08, 2016 at 03:41:23PM +0800, Chao Gao wrote: >> >> HI, everyone. >> >> >> >> We have done some tests after merging this patch set into the lastest qemu >> >> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are >> >> some problems we have met. >> >> >> >> 1. We can't boot up a 288 vcpus linux guest with CLI: >> >> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \ >> >> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \ >> >> -hda vdisk.img -device intel-iommu,intremap=on -machine q35. >> >> The problem exists, even after we only assign 32 vcpus to the linux guest. >> >> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is >> >> a clue. >> >> The output of qemu and kernel is in attachments. Do you have any idea >> >> about the problem and how to solve it? >> > >> >IIUC, we need to wait for Radim's QEMU patches to finally enable 288 >> >vcpus? >> > >> >Btw, could you please try adding this to the QEMU cmdline when testing >> >with 32 vcpus: >> > >> > -global ioapic.version=0x20 >> > >> >I see that you were running RHEL 7.2 guest with a default e1000. In >> >that case, we may need to boost ioapic version to 0x20. >> >> It doesn't work. My host machine has 16 cpus. When I assign 4 or 8 vcpus to >> the guest >> or 255 vcpus but set "kernel-irqchip=off", the guest work well. Maybe when >> irqchip >> is in kernel, intremap can only handle situations that vcpus number is less >> than >> physical cpus'. Do you think it's right? > >I don't think so. Vcpu number should not be related to host cpu >numbers. > >I think the problem is with x2apic. Currently, x2apic is enabled in >vIOMMU when kernel irqchip is used. This is problematic, since >actually we are throughing away dest_id[31:8] without Radim's patches, >meanwhile I see that by default x2apic is using cluster mode. > >In cluster mode, 8 bits will possibly not suffice (I think the reason >why >17 vcpus will bring trouble is that each cluster has 16 vcpus, >we'll have trouble if we have more than one cluster). > >To temporarily solve your issue, you should not only need "-global >ioapic.version=0x20" in QEMU command line, but also add "x2apic_phys" >to you guest kernel boot parameter, to force guest boot with x2apic >physical mode (not cluster mode). Though this can only work for <255 >vcpus. IMHO we may still need to wait for Radim's patches to test >255 >case. ok, after adding "x2apic_phys" to guest kernel boot parameter, I boot up a 288(yes, 288) vcpus guest successfully with command qemu-system-x86_64 -boot c -m 4096 -sdl --enable-kvm \ -M kernel-irqchip=split -bios bios.bin -smp cpus=288 -hda vdisk.img \ -device intel-iommu,intremap=on -machine q35 Also, I can see interrupts on those cpu with inital apicid>255 from /proc/cpuinfo and /proc/interrupts. Thanks. -Chao
Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode
On Mon, Aug 08, 2016 at 04:57:14PM +0800, Peter Xu wrote: >On Mon, Aug 08, 2016 at 03:41:23PM +0800, Chao Gao wrote: >> HI, everyone. >> >> We have done some tests after merging this patch set into the lastest qemu >> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are >> some problems we have met. >> >> 1. We can't boot up a 288 vcpus linux guest with CLI: >> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \ >> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \ >> -hda vdisk.img -device intel-iommu,intremap=on -machine q35. >> The problem exists, even after we only assign 32 vcpus to the linux guest. >> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is a >> clue. >> The output of qemu and kernel is in attachments. Do you have any idea >> about the problem and how to solve it? > >IIUC, we need to wait for Radim's QEMU patches to finally enable 288 >vcpus? > >Btw, could you please try adding this to the QEMU cmdline when testing >with 32 vcpus: > > -global ioapic.version=0x20 > >I see that you were running RHEL 7.2 guest with a default e1000. In >that case, we may need to boost ioapic version to 0x20. It doesn't work. My host machine has 16 cpus. When I assign 4 or 8 vcpus to the guest or 255 vcpus but set "kernel-irqchip=off", the guest work well. Maybe when irqchip is in kernel, intremap can only handle situations that vcpus number is less than physical cpus'. Do you think it's right? Thanks, -Chao
Re: [Qemu-devel] [PATCH for-2.8 00/18] pc: q35: x2APIC support in kvm_apic mode
On Mon, Aug 08, 2016 at 11:18:20AM +0200, Igor Mammedov wrote: >On Mon, 8 Aug 2016 15:41:23 +0800 >Chao Gao wrote: > >> HI, everyone. >> >> We have done some tests after merging this patch set into the lastest qemu >> master. In kvm aspect, we use the lastest kvm linux-next branch. Here are >> some problems we have met. >> >> 1. We can't boot up a 288 vcpus linux guest with CLI: >> qemu-system-x86_64 -boot c -m 4096 -sdl -monitor pty --enable-kvm \ >> -M kernel-irqchip=split -serial stdio -bios bios.bin -smp cpus=288 \ >> -hda vdisk.img -device intel-iommu,intremap=on -machine q35. >> The problem exists, even after we only assign 32 vcpus to the linux guest. >> Maybe the output "do_IRQ: 146.113 No irq handler for vector (irq -1)" is a >> clue. >> The output of qemu and kernel is in attachments. Do you have any idea >> about the problem and how to solve it? >I don't think we ever looked at "kernel-irqchip=split" only in kernel variant's >been targeted so far. >Radim probably knows better whether it should work or not. > >Have you tried with smaller amount of CPUs but with APIC IDs above 254, >like in test below? > >[...] > >> >Tested with following CLI: >> > QEMU -M q35 -enable-kvm -smp 1,sockets=9,cores=32,threads=1,maxcpus=288 \ >> > -device qemu64-x86_64-cpu,socket-id=8,core-id=30,thread-id=0 \ >> > -bios x2apic_bios.bin I test with CLI: qemu-system-x86_64 -M q35 \ -enable-kvm -smp 1,sockets=9,cores=32,threads=1,maxcpus=288 \ -device qemu64-x86_64-cpu,socket-id=8,core-id=30,thread-id=0 \ -bios bios.bin -hda vdisk.img -serial stdio -m 4096 2>>qemu_and_guest.log >&2 But, I think there should have a cpu with initial apicid >255 in /proc/cpuinfo. The log(in attachments) shows that the guest kernel treats the other cpu as a bad one. What do you think cause the problem? # cat /proc/interrupts localhost login: CPU0 0:125 IO-APIC-edge timer 1:117 IO-APIC-edge i8042 4:382 IO-APIC-edge serial 7: 0 IO-APIC-edge parport0 8: 1 IO-APIC-edge rtc0 9: 0 IO-APIC-fasteoi acpi 12: 1661 IO-APIC-edge i8042 16: 0 IO-APIC-fasteoi i801_smbus 22: 27 IO-APIC-fasteoi enp0s2 24: 7310 PCI-MSI-edge :00:1f.2 NMI: 0 Non-maskable interrupts LOC: 6401 Local timer interrupts SPU: 0 Spurious interrupts PMI: 0 Performance monitoring interrupts IWI: 3870 IRQ work interrupts RTR: 0 APIC ICR read retries RES: 0 Rescheduling interrupts CAL: 0 Function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts MCE: 0 Machine check exceptions MCP: 1 Machine check polls ERR: 0 MIS: 0 # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.5+ stepping: 3 microcode : 0x1 cpu MHz : 3591.682 cache size : 4096 KB physical id : 0 siblings: 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl xtopology pni cx16 x2apic hypervisor lahf_lm bogomips: 7183.36 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Warning: Number of hotpluggable cpus requested (288) exceeds the recommended cpus supported by KVM (240) Changing serial settings was 0/0 now 3/0 SeaBIOS (version ?-20160808_104017-g.c) BUILD: gcc: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4) binutils: version 2.23.52.0.1-55.el7 20130226 No Xen hypervisor found. vendor 8086 device 29c0 Running on QEMU (q35) Running on KVM RamSize: 0x8000 [cmos] Relocating init from 0x000da4e0 to 0x7ffac9d0 (size 79264) Found QEMU fw_cfg QEMU fw_cfg DMA interface supported RamBlock: addr 0x len 0x8000 [e820] RamBlock: addr 0x0001 len 0x8000 [e820] Moving pm_base to 0x600 === PCI bus & bridge init === PCI: pci_bios_init_bus_rec bus = 0x0 === PCI device probing === Found 6 PCI devices (max PCI bus is 00) === PCI new allocation pass #1 === PCI: check devices === PCI new allocation pass #2 === PCI: IO: c000 - c09f PCI: 32: c000 - fec0 PCI: map device bdf=00:02.0 bar 1, addr c000, size 0040 [io] PCI: map device bdf=00:1f.3 bar 4, addr c040, size 00