Re: mmapping physical memory
Hi Anatoly, On Mon, Aug 26, 2013 at 12:58:25PM +0100, Anatoly Burakov wrote: > Hi all > > I am using IVSHMEM to mmap /dev/mem into guest. The mmap works fine on > QEMU without KVM support enabled, but with KVM i get kernel errors: > > * (with EPT enabled) > > [ 746.940720] [ cut here ] > [ 746.948612] kernel BUG at arch/x86/kvm/../../../virt/kvm/kvm_main.c:1257! So the problem is KVM cannot do put_page on a pfn coming from a /dev/mem mapping, but it cannot handle VM_PFNMAP mappings without PageReserved set. During kvm_release_page_* KVM only has the pfn number of the page, and it has to decide if this page is refcounted or not, solely based on the pfn number. So if the page is not set as referenced it cannot allow a mapping to be established, or later during spte teardown put_page would run on the /dev/mem memory leading to memory corruption. The above BUG_ON isn't just a false positive, but it shows a limitation in the KVM page fault ability to map any kind of memory coming from the host (including /dev/mem mappings). So I'm suggesting to drop FOLL_GET in the page fault and kvm_release_page_* after the spte establishment, and to relay entirely on the mmu notifier and the kvm_mmu lock by adding a vcpu->in_progress_fault_addr to set before calling gup hva_to_pfn and to clear in the mmu notifier code within kvm->mmu_lock and to check within the kvm->mmu_lock during spte establishment to know if the page pointer become stale and we shall bail out and repeat the fault or not. We'll still need to use FOLL_GET and set_page_dirty in some cases, like after modifying the page in places like emulator_cmpxchg_emulated. Those places cannot depend on the mmu notifier and the dirty bit set in the pte isn't enough because the page can be swapped out to disk and marked clean before kmap_atomic runs, but the 99% of the hva_to_pfn are coming from the KVM secondary MMU page faults, they're protected by the mmu notifier and they can skip the refcounting completely including FOLL_GET. And then because we won't have to run put_page at all anymore, the above BUG will disappear too. In terms of performance, I estimate the only cons will be a "ATOMIC_ONCE(vcpu->in_progress_fault_addr) = addr" per-thread cacheline local and lockless initialization before calling gup in hva_to_pfn and the pros will be the removal of all refcounting atomic_inc/dec and set_page_dirty from all the KVM page faults. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mmapping physical memory
Hi all I am using IVSHMEM to mmap /dev/mem into guest. The mmap works fine on QEMU without KVM support enabled, but with KVM i get kernel errors: * (with EPT enabled) [ 746.940720] [ cut here ] [ 746.948612] kernel BUG at arch/x86/kvm/../../../virt/kvm/kvm_main.c:1257! [ 746.949067] invalid opcode: [#1] SMP [ 746.949393] Modules linked in: rte_kni(OF) igb_uio(OF) ebtable_nat(F) xt_CHECKSUM(F) bridge(F) stp(F) llc(F) nf_conntrack_netbios_ns(F) nf_conntrack_broadcast(F) ipt_MASQUERADE(F) ip6table_mangle(F) ip6t_REJECT(F) nf_conntrack_ipv6(F) nf_defrag_ipv6(F) bnep(F) bluetooth(F) rfkill(F) iptable_nat(F) nf_nat_ipv4(F) nf_nat(F) iptable_mangle(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F) ebtable_filter(F) ebtables(F) ip6table_filter(F) ip6_tables(F) be2iscsi(F) iscsi_boot_sysfs(F) bnx2i(F) cnic(F) uio(F) cxgb4i(F) cxgb4(F) cxgb3i(F) cxgb3(F) libcxgbi(F) ib_iser(F) rdma_cm(F) ib_addr(F) iw_cm(F) ib_cm(F) ib_sa(F) ib_mad(F) ib_core(F) iscsi_tcp(F) libiscsi_tcp(F) libiscsi(F) scsi_transport_iscsi(F) iTCO_wdt(F) iTCO_vendor_support(F) acpi_cpufreq(F) mperf(F) coretemp(F) shpchp(F) [ 747.014963] lpc_ich(F) mfd_core(F) i2c_i801(F) ioatdma(F) microcode(F) joydev(F) i7core_edac(F) edac_core(F) vhost_net(F) tun(F) macvtap(F) macvlan(F) kvm_intel(F) kvm(F) uinput(F) crc32_pclmul(F) crc32c_intel(F) ghash_clmulni_intel(F) ast(F) ixgbe(F) igb(F) drm_kms_helper(F) e1000e(F) dca(F) ttm(F) ptp(F) drm(F) i2c_algo_bit(F) pps_core(F) mdio(F) i2c_core(F) sunrpc(F) [last unloaded: rte_kni] [ 747.136764] CPU 8 [ 747.136909] Pid: 2501, comm: qemu-system-x86 Tainted: GF O 3.9.11-200.no_strict_dev_mem.fc18.x86_64 #1 Intel Corporation S5520HC/S5520HC [ 747.228668] RIP: 0010:[] [] __gfn_to_pfn_memslot+0x36a/0x3e0 [kvm] [ 747.259705] RSP: 0018:880130d39ae8 EFLAGS: 00010246 [ 747.291580] RAX: RBX: RCX: 8801effeb000 [ 747.322598] RDX: 001c3c00 RSI: 7fd11f00 RDI: ea00070f [ 747.354242] RBP: 880130d39b58 R08: 0126 R09: 880130d39c2f [ 747.385123] R10: R11: 7fd14000 R12: 7fd11f01 [ 747.415981] R13: 880130d39ba7 R14: 8801c3bcb4f0 R15: 8802b4538001 [ 747.447877] FS: 7fd35c1e9700() GS:8801e9c8() knlGS: [ 747.479010] CS: 0010 DS: ES: CR0: 8005003b [ 747.510220] CR2: 7fe2ffc0 CR3: 0001e66c4000 CR4: 27e0 [ 747.542410] DR0: DR1: DR2: [ 747.573780] DR3: DR6: 0ff0 DR7: 0400 [ 747.604759] Process qemu-system-x86 (pid: 2501, threadinfo 880130d38000, task 8801c3bcb4f0) [ 747.637044] Stack: [ 747.668362] 880130d39af8 81083798 880130d39b48 7fd11f00 [ 747.700654] 001c3c00 00ff8802b3272a90 0380 8802b3272a80 [ 747.731895] 0380 000fc000 880130d39c38 880365fe8000 [ 747.763068] Call Trace: [ 747.793746] [] ? hrtimer_start+0x18/0x20 [ 747.824435] [] __gfn_to_pfn+0x60/0x70 [kvm] [ 747.855267] [] gfn_to_pfn_async+0x1a/0x20 [kvm] [ 747.884586] [] try_async_pf+0x4a/0x1d0 [kvm] [ 747.914146] [] tdp_page_fault+0xfa/0x210 [kvm] [ 747.943000] [] kvm_mmu_page_fault+0x31/0x100 [kvm] [ 747.972271] [] handle_ept_violation+0x5e/0x100 [kvm_intel] [ 748.000620] [] vmx_handle_exit+0xf6/0x7c0 [kvm_intel] [ 748.029860] [] ? kvm_apic_has_interrupt+0x28/0xe0 [kvm] [ 748.058214] [] ? vmx_invpcid_supported+0x20/0x20 [kvm_intel] [ 748.086496] [] kvm_arch_vcpu_ioctl_run+0x2fb/0x11a0 [kvm] [ 748.114711] [] ? kvm_arch_vcpu_load+0x57/0x1e0 [kvm] [ 748.142788] [] kvm_vcpu_ioctl+0x26e/0x5f0 [kvm] [ 748.170647] [] ? do_futex+0x100/0xad0 [ 748.198558] [] ? perf_event_context_sched_in+0x94/0xc0 [ 748.226194] [] do_vfs_ioctl+0x97/0x580 [ 748.253809] [] ? file_has_perm+0x97/0xb0 [ 748.281110] [] sys_ioctl+0x91/0xb0 [ 748.307911] [] system_call_fastpath+0x16/0x1b [ 748.88] Code: ff ff 49 29 d2 4c 89 d2 48 c1 ea 0c 48 03 90 98 00 00 00 48 89 d7 48 89 55 b0 e8 92 d6 ff ff 84 c0 48 8b 55 b0 0f 85 bf fe ff ff <0f> 0b 0f 1f 40 00 48 ba 00 00 00 00 00 00 f0 7f e9 aa fe ff ff [ 748.392724] RIP [] __gfn_to_pfn_memslot+0x36a/0x3e0 [kvm] [ 748.419435] RSP [ 748.524222] ---[ end trace 854a37c471141217 ]--- *** (with EPT disabled) [ 559.581338] [ cut here ] [ 559.581701] kernel BUG at arch/x86/kvm/../../../virt/kvm/kvm_main.c:1257! [ 559.582169] invalid opcode: [#1] SMP [ 559.582499] Modules linked in: kvm_intel rte_kni(OF) igb_uio(OF) ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle be2iscsi iscsi_boot_sysfs nf_conntrack_ipv4 bn