On Wed, Dec 24, 2008 at 04:28:44PM +0100, Andrea Arcangeli wrote: > On Wed, Dec 24, 2008 at 02:50:57PM +0200, Avi Kivity wrote: > > Marcelo Tosatti wrote: > >> The destructor for huge pages uses the backing inode for adjusting > >> hugetlbfs accounting. > >> > >> Hugepage mappings are destroyed by exit_mmap, after > >> mmu_notifier_release, so there are no notifications through > >> unmap_hugepage_range at this point. > >> > >> The hugetlbfs inode can be freed with pages backed by it referenced > >> by the shadow. When the shadow releases its reference, the huge page > >> destructor will access a now freed inode. > >> > >> Implement the release operation for kvm mmu notifiers to release page > >> refs before the hugetlbfs inode is gone. > >> > >> > > > > I see this isn't it. Andrea, comments? > > Yeah, the patch looks good, I talked a bit with Marcelo about this by > PM. The issue is that it's not as strightforward as it seems, > basically when I implemented the ->release handlers and had sptes > teardown running before the files were closed (instead of waiting the > kvm anon inode release handler to fire) I was getting bugchecks from > debug options including preempt=y (certain debug checks only becomes > functional with preempt enabled unfortunately), so eventually I > removed ->release because for kvm ->release wasn't useful because no > guest mode can run any more by the time mmu notifier ->release is > invoked, and that avoided the issues with the bugchecks. > > We'll be using the mmu notifiers ->release because it's always called > just before the filehandle are destroyed, it's not really about the > guest mode or secondary mmu but just an ordering issue with hugetlbfs > internals. > > So in short if no bugcheck triggers this is fine (at least until > hugetlbfs provides a way to register some callback to invoke at the > start of the hugetlbfs->release handler).
The only bugcheck I see, which triggers on vanilla kvm upstream with CONFIG_PREEMPT_DEBUG=y and CONFIG_PREEMPT_RCU=y is: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC<4>ttyS1: 1 input overrun(s) last sysfs file: /sys/class/net/tap0/address CPU 0 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #164 RIP: 0010:[<ffffffff8028a5b6>] [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163 RSP: 0018:ffff88021e1f9a38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003 RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001 RBP: ffff88021e1f9a78 R08: 0000000000000286 R09: ffffffff80a1bf50 R10: ffff880119c270f8 R11: ffff88021e1f99b8 R12: ffff88021e1f9a38 R13: ffff88021e1f9a90 R14: ffff88021e1f9a98 R15: 000000000000813a FS: 0000000000000000(0000) GS:ffffffff8080d900(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b CR2: 00000000008d9828 CR3: 0000000000201000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 4768, threadinfo ffff88021e1f8000, task ffff880119c270f8) Stack: ffff88022bdfd840 ffff880119da11b8 ffffc20011c30000 000000000000813a 0000000000000000 0000000000000001 ffff88022ec11c18 ffff88022f061838 ffff88021e1f9aa8 ffffffff8028ab1d ffff88021e1f9aa8 ffffc20021976000 Call Trace: [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71 [<ffffffff8028ac54>] __vunmap+0x3a/0xca [<ffffffff8028ad35>] vfree+0x29/0x2b [<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm] [<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm] [<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm] [<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm] [<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm] [<ffffffff802a1c07>] __fput+0xeb/0x1a3 [<ffffffff802a1cd4>] fput+0x15/0x17 [<ffffffff8029f26c>] filp_close+0x67/0x72 [<ffffffff802378a8>] put_files_struct+0x74/0xc8 [<ffffffff80237943>] exit_files+0x47/0x4f [<ffffffff80238fe5>] do_exit+0x1eb/0x7a7 [<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51 [<ffffffff80239614>] do_group_exit+0x73/0xa0 [<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29 [<ffffffff8020a80f>] do_notify_resume+0x8c/0x851 [<ffffffff8025b811>] ? do_futex+0x90/0x92a [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96 [<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e [<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29 [<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0 Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 0c db 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 RIP [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163 RSP <ffff88021e1f9a38> ---[ end trace fde3e64ebe4bbca2 ]--- Fixing recursive fault but reboot is needed! BUG: scheduling while atomic: qemu-system-x86/4768/0x00000003 Modules linked in: tun ipt_MASQUERADE iptable_nat nf_nat bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables dm_multipath kvm_intel kvm scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 4768, comm: qemu-system-x86 Tainted: G D 2.6.28-00165-g4f27e3e-dirty #164 Call Trace: [<ffffffff8025585e>] ? __debug_show_held_locks+0x1b/0x24 [<ffffffff8023187b>] __schedule_bug+0x8c/0x95 [<ffffffff805851e1>] schedule+0xd3/0x902 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8037a938>] ? put_io_context+0x67/0x72 [<ffffffff80238ed4>] do_exit+0xda/0x7a7 [<ffffffff805892c9>] oops_begin+0x0/0x90 [<ffffffff8020e3c9>] die+0x5d/0x66 [<ffffffff80588ff7>] do_general_protection+0x128/0x130 [<ffffffff80588ecf>] ? do_general_protection+0x0/0x130 [<ffffffff80588702>] error_exit+0x0/0xa9 [<ffffffff8028a5b6>] ? __purge_vmap_area_lazy+0x12c/0x163 [<ffffffff8028a5ae>] ? __purge_vmap_area_lazy+0x124/0x163 [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71 [<ffffffff8028ac54>] __vunmap+0x3a/0xca [<ffffffff8028ad35>] vfree+0x29/0x2b [<ffffffffa00f98a3>] kvm_free_physmem_slot+0x25/0x7c [kvm] [<ffffffffa00f9d75>] kvm_free_physmem+0x27/0x36 [kvm] [<ffffffffa00fccb4>] kvm_arch_destroy_vm+0xa6/0xda [kvm] [<ffffffffa00f9e11>] kvm_put_kvm+0x8d/0xa7 [kvm] [<ffffffffa00fa0e2>] kvm_vcpu_release+0x13/0x17 [kvm] [<ffffffff802a1c07>] __fput+0xeb/0x1a3 [<ffffffff802a1cd4>] fput+0x15/0x17 [<ffffffff8029f26c>] filp_close+0x67/0x72 [<ffffffff802378a8>] put_files_struct+0x74/0xc8 [<ffffffff80237943>] exit_files+0x47/0x4f [<ffffffff80238fe5>] do_exit+0x1eb/0x7a7 [<ffffffff80587edf>] ? _spin_unlock_irq+0x2b/0x51 [<ffffffff80239614>] do_group_exit+0x73/0xa0 [<ffffffff80242b10>] get_signal_to_deliver+0x30c/0x32c [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29 [<ffffffff8020a80f>] do_notify_resume+0x8c/0x851 [<ffffffff8025b811>] ? do_futex+0x90/0x92a [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff80587f51>] ? _spin_unlock_irqrestore+0x4c/0x68 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf [<ffffffff8024f300>] ? getnstimeofday+0x3a/0x96 [<ffffffff8024c4f0>] ? ktime_get_ts+0x49/0x4e [<ffffffff8020b4c1>] ? sysret_signal+0x5/0x29 [<ffffffff80256bd7>] ? trace_hardirqs_on_caller+0xf0/0x114 [<ffffffff8020b4d5>] ? sysret_signal+0x19/0x29 [<ffffffff8020b7b7>] ptregscall_common+0x67/0xb0 ttyS1: 26 input overrun(s) And its not specific to vm shutdown path. Another instance: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/class/net/tap0/address CPU 5 Modules linked in: ipt_REJECT xt_state xt_tcpudp iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables tun kvm_intel kvm bridge stp llc dm_multipath scsi_wait_scan ata_piix libata dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod shpchp pci_hotplug mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd [last unloaded: x_tables] Pid: 4440, comm: qemu-system-x86 Not tainted 2.6.28-00165-g4f27e3e-dirty #163 RIP: 0010:[<ffffffff8028a5b6>] [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163 RSP: 0018:ffff88011f4c7be8 EFLAGS: 00010246 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000003 RDX: ffffffff80a1dae0 RSI: ffff880028083980 RDI: 0000000000000001 RBP: ffff88011f4c7c28 R08: 0000000000000282 R09: ffffffff80a1bf50 R10: ffff88022e9dc0f8 R11: ffff88011f4c7b68 R12: ffff88011f4c7be8 R13: ffff88011f4c7c40 R14: ffff88011f4c7c48 R15: 0000000000008001 FS: 0000000040abf950(0063) GS:ffff88022f25ed18(0000) knlGS:0000000000000000 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000229d34000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 4440, threadinfo ffff88011f4c6000, task ffff88022e9dc0f8) Stack: ffff8802291a14b0 ffff880229003d58 ffffc20021000000 0000000000008001 ffff880229526000 0000000000000000 ffff88022d073000 ffff88011f58c0c0 ffff88011f4c7c58 ffffffff8028ab1d ffff88011f4c7c58 ffffffffa015c000 Call Trace: [<ffffffff8028ab1d>] free_unmap_vmap_area_noflush+0x69/0x70 [<ffffffff8028ab49>] remove_vm_area+0x25/0x71 [<ffffffff8028ac54>] __vunmap+0x3a/0xca [<ffffffff8028ad0a>] vunmap+0x26/0x28 [<ffffffffa01be092>] pio_copy_data+0xcf/0x113 [kvm] [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf [<ffffffffa01be16f>] complete_pio+0x99/0x1ef [kvm] [<ffffffff8023fcd2>] ? sigprocmask+0xc6/0xd0 [<ffffffffa01c0295>] kvm_arch_vcpu_ioctl_run+0x9a/0x889 [kvm] [<ffffffffa01b84f4>] kvm_vcpu_ioctl+0xfc/0x48b [kvm] [<ffffffff802ac760>] vfs_ioctl+0x2a/0x78 [<ffffffff8026be5c>] ? __rcu_read_unlock+0x92/0x9e [<ffffffff802acb46>] do_vfs_ioctl+0x398/0x3c6 [<ffffffff80256c08>] ? trace_hardirqs_on+0xd/0xf [<ffffffff802acbb6>] sys_ioctl+0x42/0x65 [<ffffffff8020b43b>] system_call_fastpath+0x16/0x1b Code: 46 48 c7 c7 c0 d1 74 80 4c 8d 65 c0 e8 8c da 2f 00 48 8b 45 c0 48 8d 58 c0 eb 10 48 89 df e8 74 fe ff ff 48 8b 43 40 48 8d 58 c0 <48> 8b 43 40 0f 18 08 48 8d 43 40 4c 39 e0 75 e0 48 c7 c7 c0 d1 RIP [<ffffffff8028a5b6>] __purge_vmap_area_lazy+0x12c/0x163 RSP <ffff88011f4c7be8> ---[ end trace 31811279a2e983e8 ]--- note: qemu-system-x86[4440] exited with preempt_count 2 (gdb) l *(__purge_vmap_area_lazy + 0x12c) 0xffffffff80289ca2 is in __purge_vmap_area_lazy (mm/vmalloc.c:516). 511 if (nr || force_flush) 512 flush_tlb_kernel_range(*start, *end); 513 514 if (nr) { 515 spin_lock(&vmap_area_lock); 516 list_for_each_entry(va, &valist, purge_list) 517 __free_vmap_area(va); 518 spin_unlock(&vmap_area_lock); 519 } 520 spin_unlock(&purge_lock); 0xffffffff80289c9a <__purge_vmap_area_lazy+292>: mov 0x40(%rbx),%rax 0xffffffff80289c9e <__purge_vmap_area_lazy+296>: lea -0x40(%rax),%rbx 0xffffffff80289ca2 <__purge_vmap_area_lazy+300>: mov 0x40(%rbx),%rax ^^^^^^^^^^^^^^^^^^^ 0xffffffff80289ca6 <__purge_vmap_area_lazy+304>: prefetcht0 (%rax) Which vanishes once PREEMPT_RCU is disabled. Nick? KVM does not make direct use of RCU. Same issue happens if the entire __purge_vmap_area_lazy runs with vmap_area_lock held. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html