Re: EPT: Misconfiguration
On Sun, Feb 27, 2011 at 11:46, Avi Kivity wrote: > > Copying netdev: looks like memory corruption in the networking stack. > > Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the > attachment). There's now only a single guest running on this host (Ubuntu Maverick). I've also upgraded the host kernel to 2.6.38-rc6, and this just happened (after a day or so): 2011-03-05T19:41:58.328866+01:00 phy005 kernel: [85271.656862] BUG kmalloc-2048 (Not tainted): Object padding overwritten 2011-03-05T19:41:58.328870+01:00 phy005 kernel: [85271.656864] - 2011-03-05T19:41:58.328875+01:00 phy005 kernel: [85271.656866] 2011-03-05T19:41:58.328885+01:00 phy005 kernel: [85271.656870] INFO: 0x880c0d52a960-0x880c0d52a967. First byte 0x0 instead of 0x5a 2011-03-05T19:41:58.328890+01:00 phy005 kernel: [85271.656880] INFO: Allocated in __netdev_alloc_skb+0x1f/0x3b age=16039 cpu=5 pid=0 2011-03-05T19:41:58.328894+01:00 phy005 kernel: [85271.656886] INFO: Freed in skb_release_data+0xa5/0xaa age=0 cpu=5 pid=1766 2011-03-05T19:41:58.328898+01:00 phy005 kernel: [85271.656890] INFO: Slab 0xea002a2ea0c0 objects=15 used=13 fp=0x880c0d52a120 flags=0xc040c1 2011-03-05T19:41:58.328902+01:00 phy005 kernel: [85271.656894] INFO: Object 0x880c0d52a120 @offset=8480 fp=0x880c0d52d2d0 2011-03-05T19:41:58.328905+01:00 phy005 kernel: [85271.656895] 2011-03-05T19:41:58.328909+01:00 phy005 kernel: [85271.656897] Bytes b4 0x880c0d52a110: 14 89 12 05 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a 2011-03-05T19:41:58.328913+01:00 phy005 kernel: [85271.656909] Object 0x880c0d52a120: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b We have a quite complex network stack, two interfaces (igb) attached to bond0, with on top two bridges and on that two vlans. The guest is running a vpn and an IPv6 tunnel. Let me know if more info is needed. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Copying netdev: looks like memory corruption in the networking stack. Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the attachment). On 02/24/2011 11:15 PM, Ruben Kerkhof wrote: > > On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti wrote: >> This and the others reported. So yes, it looks something is corrupting >> memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option. Ok, there are now only 6 vms left on this host, and I've booted it with the slub_debug=ZFPU option. After a few hours, I got the following result: 2011-02-24T21:41:30.818496+01:00 phy005 kernel: = 2011-02-24T21:41:30.818517+01:00 phy005 kernel: BUG kmalloc-2048 (Not tainted): Object padding overwritten 2011-02-24T21:41:30.818523+01:00 phy005 kernel: - 2011-02-24T21:41:30.818526+01:00 phy005 kernel: 2011-02-24T21:41:30.818530+01:00 phy005 kernel: INFO: 0x8806230752ca-0x8806230752cf. First byte 0x0 instead of 0x5a 2011-02-24T21:41:30.818534+01:00 phy005 kernel: INFO: Allocated in __netdev_alloc_skb+0x34/0x51 age=2231 cpu=8 pid=0 2011-02-24T21:41:30.818537+01:00 phy005 kernel: INFO: Freed in skb_release_data+0xc9/0xce age=2368 cpu=8 pid=2159 2011-02-24T21:41:30.818541+01:00 phy005 kernel: INFO: Slab 0xea00157a9880 objects=15 used=13 fp=0x8806230752d0 flags=0x404083 2011-02-24T21:41:30.818545+01:00 phy005 kernel: INFO: Object 0x880623074a88 @offset=19080 fp=0x8806230752d0 The rest of the output is attached since it's quite large. Kind regards, Ruben -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Hi Marcelo, On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti wrote: > On Sun, Feb 13, 2011 at 03:03:40PM +0200, Avi Kivity wrote: >> On 02/13/2011 04:07 AM, Ruben Kerkhof wrote: >> >And tonight we had another one of those errors we had a few weeks ago: >> > >> >2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration. >> >2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000 >> >> This GPA indexes into the 511th entry of the spte. Marcelo, does >> this remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 >> by any chance? > > This and the others reported. So yes, it looks something is corrupting > memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option. Sure, but not for a while, I'm first moving all my customers of this machine. We've had to reboot it like 5 or 6 times in the last couple of weeks. As soon as that's done I'm going to test the hell out of it. Now that we moved a few of the vm's we don't see any oopses, so it could either be that it only triggers under load, or there's a specific guest which is triggering it. > Is there any reason for not upgrading to FC14? I haven't had a reason to upgrade yet, all our other machines are running fine, using the same kernel. Plus I'm still finding lots of issues unrelated to kvm on F14, broken ssh in combination with openldap, ipmi bugs, selinux policy etc. Next to that it takes a lot of time to test all our images etc. I'll probably skip the F14 kernel and go straight to 2.638, since that should bring significant improvements like THP, async pagefaults etc. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Sun, Feb 13, 2011 at 03:03:40PM +0200, Avi Kivity wrote: > On 02/13/2011 04:07 AM, Ruben Kerkhof wrote: > >And tonight we had another one of those errors we had a few weeks ago: > > > >2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration. > >2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000 > > This GPA indexes into the 511th entry of the spte. Marcelo, does > this remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 > by any chance? This and the others reported. So yes, it looks something is corrupting memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option. Is there any reason for not upgrading to FC14? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Sun, Feb 13, 2011 at 14:03, Avi Kivity wrote: > On 02/13/2011 04:07 AM, Ruben Kerkhof wrote: >> >> And tonight we had another one of those errors we had a few weeks ago: >> >> 2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration. >> 2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000 > > This GPA indexes into the 511th entry of the spte. Marcelo, does this > remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 by any > chance? > >> 2011-02-13T02:56:28.694914+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x25602d007 level 4 >> 2011-02-13T02:56:28.694916+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x3df3e2007 level 3 >> 2011-02-13T02:56:28.694919+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5e90c7007 level 2 >> 2011-02-13T02:56:28.694925+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 > > Magic 1603a073 pte. > >> 2011-02-13T02:56:28.694928+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 >> 2011-02-13T02:56:28.694930+01:00 phy005 kernel: [ cut here >> ] >> 2011-02-13T02:56:28.694933+01:00 phy005 kernel: WARNING: at >> arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]() >> 2011-02-13T02:56:28.694936+01:00 phy005 kernel: Hardware name: X8DTU >> 2011-02-13T02:56:28.694941+01:00 phy005 kernel: Modules linked in: tun >> ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding >> xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter >> ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt igb ioatdma >> dca iTCO_vendor_support joydev serio_raw microcode 3w_9xxx [last >> unloaded: scsi_wait_scan] >> 2011-02-13T02:56:28.695004+01:00 phy005 kernel: Pid: 4756, comm: >> qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 >> 2011-02-13T02:56:28.695008+01:00 phy005 kernel: Call Trace: >> 2011-02-13T02:56:28.695013+01:00 phy005 kernel: [] >> warn_slowpath_common+0x7c/0x94 >> 2011-02-13T02:56:28.695020+01:00 phy005 kernel: [] >> warn_slowpath_null+0x14/0x16 >> 2011-02-13T02:56:28.695024+01:00 phy005 kernel: [] >> handle_ept_misconfig+0x152/0x1d8 [kvm_intel] >> 2011-02-13T02:56:28.695028+01:00 phy005 kernel: [] >> vmx_handle_exit+0x204/0x23a [kvm_intel] >> 2011-02-13T02:56:28.695033+01:00 phy005 kernel: [] >> kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm] >> 2011-02-13T02:56:28.695037+01:00 phy005 kernel: [] >> kvm_vcpu_ioctl+0xfd/0x56e [kvm] >> 2011-02-13T02:56:28.695042+01:00 phy005 kernel: [] ? >> virt_to_head_page+0xe/0x2f >> 2011-02-13T02:56:28.695046+01:00 phy005 kernel: [] ? >> mempool_kfree+0xe/0x10 >> 2011-02-13T02:56:28.695051+01:00 phy005 kernel: [] ? >> mempool_free+0x76/0x7b >> 2011-02-13T02:56:28.695055+01:00 phy005 kernel: [] >> vfs_ioctl+0x32/0xa6 >> 2011-02-13T02:56:28.695060+01:00 phy005 kernel: [] >> do_vfs_ioctl+0x483/0x4c9 >> 2011-02-13T02:56:28.695065+01:00 phy005 kernel: [] >> sys_ioctl+0x56/0x79 >> 2011-02-13T02:56:28.695070+01:00 phy005 kernel: [] >> system_call_fastpath+0x16/0x1b >> 2011-02-13T02:56:28.695074+01:00 phy005 kernel: ---[ end trace >> d95032626ea304ca ]--- >> >> Any help would be much appreciated. It seems very strange that I'm the >> first one who runs into this. >> I've found two bugreports which report the same, the first one at >> >> https://partner-bugzilla.redhat.com/show_bug.cgi?format=multiple&id=613691, >> but that's a duplicate of >> https://partner-bugzilla.redhat.com/show_bug.cgi?id=606131 which I'm >> not authorized to see... > > These don't appear to be related. Are you running ksm, btw? No. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Hi Avi, On Sun, Feb 13, 2011 at 13:58, Avi Kivity wrote: > On 02/10/2011 05:23 PM, Ruben Kerkhof wrote: >> >> This machine has been running for a week without problems, but then we >> started to get the following oopses again: >> >> 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle >> kernel paging request at ea71929180e0 >> 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP: >> [] gup_pte_range+0x94/0xd3 >> 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0 >> 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops: [#1] SMP >> 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file: >> /sys/devices/system/cpu/cpu15/topology/thread_siblings >> 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4 >> 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun >> ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding >> xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter >> ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb >> iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded: >> scsi_wait_scan] >> 2011-02-06T19:45:35.31+01:00 phy005 kernel: >> 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm: >> qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU >> 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP: >> 0010:[] [] >> gup_pte_range+0x94/0xd3 >> 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP: >> 0018:88060b9bda78 EFLAGS: 00010082 >> 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0 >> RBX: 3000 RCX: 0005 >> 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40 >> RSI: 7fe54e3ff000 RDI: 1603a07305004067 >> 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98 >> R08: 880b94384560 R09: 88060b9bdb44 >> 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8 >> R11: ea00 R12: 0205 >> 2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff >> R14: 0005 R15: >> 2011-02-06T19:45:35.55+01:00 phy005 kernel: FS: >> 7fe64cb0e700() GS:88065540() >> knlGS: >> 2011-02-06T19:45:35.59+01:00 phy005 kernel: CS: 0010 DS: 002b ES: >> 002b CR0: 80050033 >> 2011-02-06T19:45:35.63+01:00 phy005 kernel: CR2: ea71929180e0 >> CR3: 000bff06d000 CR4: 26e0 >> 2011-02-06T19:45:35.67+01:00 phy005 kernel: DR0: >> DR1: DR2: >> 2011-02-06T19:45:35.71+01:00 phy005 kernel: DR3: >> DR6: 0ff0 DR7: 0400 >> 2011-02-06T19:45:35.74+01:00 phy005 kernel: Process qemu-kvm (pid: >> 3650, threadinfo 88060b9bc000, task 880623ed2ee0) >> 2011-02-06T19:45:35.78+01:00 phy005 kernel: Stack: >> 2011-02-06T19:45:35.81+01:00 phy005 kernel: 7fe54e40 >> 7fe54e40 7fe54e40 88053a0d2388 >> 2011-02-06T19:45:35.85+01:00 phy005 kernel:<0> 88060b9bdaf8 >> 81034a15 7fe54e3f 7fe54e3f >> 2011-02-06T19:45:35.89+01:00 phy005 kernel:<0> 88060b9bdb44 >> 880b94384560 880bff06eca8 880bff06d7f8 >> 2011-02-06T19:45:35.92+01:00 phy005 kernel: Call Trace: >> 2011-02-06T19:45:35.96+01:00 phy005 kernel: [] >> gup_pud_range+0x156/0x192 >> 2011-02-06T19:45:35.222300+01:00 phy005 kernel: [] >> get_user_pages_fast+0xc4/0x172 >> 2011-02-06T19:45:35.222304+01:00 phy005 kernel: [] ? >> bio_add_page+0x36/0x38 >> 2011-02-06T19:45:35.222308+01:00 phy005 kernel: [] >> dio_get_page+0x54/0x127 >> 2011-02-06T19:45:35.222312+01:00 phy005 kernel: [] >> __blockdev_direct_IO+0x41d/0xa36 >> 2011-02-06T19:45:35.222316+01:00 phy005 kernel: [] ? >> x86_emulate_insn+0x1ff8/0x2d61 [kvm] >> 2011-02-06T19:45:35.222320+01:00 phy005 kernel: [] >> blkdev_direct_IO+0x4e/0x50 >> 2011-02-06T19:45:35.222324+01:00 phy005 kernel: [] ? >> blkdev_get_blocks+0x0/0x8d >> 2011-02-06T19:45:35.222328+01:00 phy005 kernel: [] >> generic_file_direct_write+0xed/0x16d >> 2011-02-06T19:45:35.222331+01:00 phy005 kernel: [] >> __generic_file_aio_write+0x196/0x281 >> 2011-02-06T19:45:35.222335+01:00 phy005 kernel: [] ? >> file_has_perm+0xa4/0xc6 >> 2011-02-06T19:45:35.222339+01:00 phy005 kernel: [] ? >> blkdev_aio_write+0x0/0x69 >> 2011-02-06T19:45:35.222343+01:00 phy005 kernel: [] >> blkdev_aio_write+0x2a/0x69 >> 2011-02-06T19:45:35.222347+01:00 phy005 kernel: [] ? >> blkdev_aio_write+0x0/0x69 >> 2011-02-06T19:45:35.222351+01:00 phy005 kernel: [] >> aio_rw_vect_retry+0x85/0x18e >> 2011-02-06T19:45:35.222355+01:00 phy005 kernel: [] >> aio_run_iocb+0x77/0x10f >> 2011-02-06T19:45:35.222359+01:00 phy005 kernel: [] >> do_io_submit+0x558/0x7ce >> 2011-02-06T19:45:35.222363+01:00 phy005 kernel: [] >> sys_io_submit+0x10/0x12 >> 2011-02-06T19:45:35.222366+01:00 phy005 kernel: [] >> system_call_fastpath+0x16/0x1b >> 2011-02-06T19:
Re: EPT: Misconfiguration
On 02/13/2011 04:07 AM, Ruben Kerkhof wrote: And tonight we had another one of those errors we had a few weeks ago: 2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration. 2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000 This GPA indexes into the 511th entry of the spte. Marcelo, does this remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 by any chance? 2011-02-13T02:56:28.694914+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x25602d007 level 4 2011-02-13T02:56:28.694916+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x3df3e2007 level 3 2011-02-13T02:56:28.694919+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5e90c7007 level 2 2011-02-13T02:56:28.694925+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 Magic 1603a073 pte. 2011-02-13T02:56:28.694928+01:00 phy005 kernel: ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 2011-02-13T02:56:28.694930+01:00 phy005 kernel: [ cut here ] 2011-02-13T02:56:28.694933+01:00 phy005 kernel: WARNING: at arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]() 2011-02-13T02:56:28.694936+01:00 phy005 kernel: Hardware name: X8DTU 2011-02-13T02:56:28.694941+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt igb ioatdma dca iTCO_vendor_support joydev serio_raw microcode 3w_9xxx [last unloaded: scsi_wait_scan] 2011-02-13T02:56:28.695004+01:00 phy005 kernel: Pid: 4756, comm: qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 2011-02-13T02:56:28.695008+01:00 phy005 kernel: Call Trace: 2011-02-13T02:56:28.695013+01:00 phy005 kernel: [] warn_slowpath_common+0x7c/0x94 2011-02-13T02:56:28.695020+01:00 phy005 kernel: [] warn_slowpath_null+0x14/0x16 2011-02-13T02:56:28.695024+01:00 phy005 kernel: [] handle_ept_misconfig+0x152/0x1d8 [kvm_intel] 2011-02-13T02:56:28.695028+01:00 phy005 kernel: [] vmx_handle_exit+0x204/0x23a [kvm_intel] 2011-02-13T02:56:28.695033+01:00 phy005 kernel: [] kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm] 2011-02-13T02:56:28.695037+01:00 phy005 kernel: [] kvm_vcpu_ioctl+0xfd/0x56e [kvm] 2011-02-13T02:56:28.695042+01:00 phy005 kernel: [] ? virt_to_head_page+0xe/0x2f 2011-02-13T02:56:28.695046+01:00 phy005 kernel: [] ? mempool_kfree+0xe/0x10 2011-02-13T02:56:28.695051+01:00 phy005 kernel: [] ? mempool_free+0x76/0x7b 2011-02-13T02:56:28.695055+01:00 phy005 kernel: [] vfs_ioctl+0x32/0xa6 2011-02-13T02:56:28.695060+01:00 phy005 kernel: [] do_vfs_ioctl+0x483/0x4c9 2011-02-13T02:56:28.695065+01:00 phy005 kernel: [] sys_ioctl+0x56/0x79 2011-02-13T02:56:28.695070+01:00 phy005 kernel: [] system_call_fastpath+0x16/0x1b 2011-02-13T02:56:28.695074+01:00 phy005 kernel: ---[ end trace d95032626ea304ca ]--- Any help would be much appreciated. It seems very strange that I'm the first one who runs into this. I've found two bugreports which report the same, the first one at https://partner-bugzilla.redhat.com/show_bug.cgi?format=multiple&id=613691, but that's a duplicate of https://partner-bugzilla.redhat.com/show_bug.cgi?id=606131 which I'm not authorized to see... These don't appear to be related. Are you running ksm, btw? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On 02/10/2011 05:23 PM, Ruben Kerkhof wrote: This machine has been running for a week without problems, but then we started to get the following oopses again: 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle kernel paging request at ea71929180e0 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP: [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops: [#1] SMP 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/topology/thread_siblings 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded: scsi_wait_scan] 2011-02-06T19:45:35.31+01:00 phy005 kernel: 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm: qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP: 0010:[] [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP: 0018:88060b9bda78 EFLAGS: 00010082 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0 RBX: 3000 RCX: 0005 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40 RSI: 7fe54e3ff000 RDI: 1603a07305004067 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98 R08: 880b94384560 R09: 88060b9bdb44 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8 R11: ea00 R12: 0205 2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff R14: 0005 R15: 2011-02-06T19:45:35.55+01:00 phy005 kernel: FS: 7fe64cb0e700() GS:88065540() knlGS: 2011-02-06T19:45:35.59+01:00 phy005 kernel: CS: 0010 DS: 002b ES: 002b CR0: 80050033 2011-02-06T19:45:35.63+01:00 phy005 kernel: CR2: ea71929180e0 CR3: 000bff06d000 CR4: 26e0 2011-02-06T19:45:35.67+01:00 phy005 kernel: DR0: DR1: DR2: 2011-02-06T19:45:35.71+01:00 phy005 kernel: DR3: DR6: 0ff0 DR7: 0400 2011-02-06T19:45:35.74+01:00 phy005 kernel: Process qemu-kvm (pid: 3650, threadinfo 88060b9bc000, task 880623ed2ee0) 2011-02-06T19:45:35.78+01:00 phy005 kernel: Stack: 2011-02-06T19:45:35.81+01:00 phy005 kernel: 7fe54e40 7fe54e40 7fe54e40 88053a0d2388 2011-02-06T19:45:35.85+01:00 phy005 kernel:<0> 88060b9bdaf8 81034a15 7fe54e3f 7fe54e3f 2011-02-06T19:45:35.89+01:00 phy005 kernel:<0> 88060b9bdb44 880b94384560 880bff06eca8 880bff06d7f8 2011-02-06T19:45:35.92+01:00 phy005 kernel: Call Trace: 2011-02-06T19:45:35.96+01:00 phy005 kernel: [] gup_pud_range+0x156/0x192 2011-02-06T19:45:35.222300+01:00 phy005 kernel: [] get_user_pages_fast+0xc4/0x172 2011-02-06T19:45:35.222304+01:00 phy005 kernel: [] ? bio_add_page+0x36/0x38 2011-02-06T19:45:35.222308+01:00 phy005 kernel: [] dio_get_page+0x54/0x127 2011-02-06T19:45:35.222312+01:00 phy005 kernel: [] __blockdev_direct_IO+0x41d/0xa36 2011-02-06T19:45:35.222316+01:00 phy005 kernel: [] ? x86_emulate_insn+0x1ff8/0x2d61 [kvm] 2011-02-06T19:45:35.222320+01:00 phy005 kernel: [] blkdev_direct_IO+0x4e/0x50 2011-02-06T19:45:35.222324+01:00 phy005 kernel: [] ? blkdev_get_blocks+0x0/0x8d 2011-02-06T19:45:35.222328+01:00 phy005 kernel: [] generic_file_direct_write+0xed/0x16d 2011-02-06T19:45:35.222331+01:00 phy005 kernel: [] __generic_file_aio_write+0x196/0x281 2011-02-06T19:45:35.222335+01:00 phy005 kernel: [] ? file_has_perm+0xa4/0xc6 2011-02-06T19:45:35.222339+01:00 phy005 kernel: [] ? blkdev_aio_write+0x0/0x69 2011-02-06T19:45:35.222343+01:00 phy005 kernel: [] blkdev_aio_write+0x2a/0x69 2011-02-06T19:45:35.222347+01:00 phy005 kernel: [] ? blkdev_aio_write+0x0/0x69 2011-02-06T19:45:35.222351+01:00 phy005 kernel: [] aio_rw_vect_retry+0x85/0x18e 2011-02-06T19:45:35.222355+01:00 phy005 kernel: [] aio_run_iocb+0x77/0x10f 2011-02-06T19:45:35.222359+01:00 phy005 kernel: [] do_io_submit+0x558/0x7ce 2011-02-06T19:45:35.222363+01:00 phy005 kernel: [] sys_io_submit+0x10/0x12 2011-02-06T19:45:35.222366+01:00 phy005 kernel: [] system_call_fastpath+0x16/0x1b 2011-02-06T19:45:35.222372+01:00 phy005 kernel: Code: 21 d8 49 01 c2 49 8b 3a 49 89 fe 4d 21 ee 4d 21 e6 49 39 ce 75 49 48 89 f8 0f 1f 40 00 48 21 d8 48 c1 e8 0c 48 6b c0 38 4c 01 d8<66> 83 38 00 48 89 c7 79 04 48 8b 78 10 f0 ff 47 08 49 63 39 48 2011-02-06T19:45:35.222376+01:00 phy005 kernel: RIP [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.222379+01:00 ph
Re: EPT: Misconfiguration
On Thu, Feb 10, 2011 at 16:23, Ruben Kerkhof wrote: > On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof wrote: >> On Wed, Jan 26, 2011 at 10:52, Avi Kivity wrote: >>> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: > When you say "suddenly", this was with no changes to software and > hardware? The host software and hardware hasn't changed in the two months since the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. We host customer vms on it though, so virtual machines come and go. Various operating systems, a mixture of Linux, FreeBSD and Windows 2008 R2. We have other machines with the same config without these problems though. >>> >>> Are those other machines running a similar workload? >> >> Yes, similar, or they're more heavily loaded. >> >> On this machine, about half of the 48GB memory was used for virtual machines. >> >>> The traces look awfully like bad hardware, though that can also be explained >>> by random memory corruption due to a bug. >> >> Yeah, that's what I'm expecting. We already replaced the memory, next >> step is to move the disks over to another server to make sure it's not >> the board or cpu's. >> This time I have a few different messages though: 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: [#1] SMP RSI: RDI: 1603a07305001568 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 4f 08 0f 94 c0 84 c0 74 10 85 f6 75 07 e8 63 fe ff ff eb >>> >>> lock decl 0x8(%rdi) >>> >>> %rdi is completely crap, looks like corruption again. Strangely, it is >>> similar to the bad spte from the previous trace: 0x1603a0730500d277. The >>> upper 48 bits are identical, the lower 16 bits are different.: 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted page table at address 7f37b37ff000 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 >>> >>> Here are those magic 48 bits again, in the PTE entry. 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 2011-01-25T12:38:49.417526+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 2011-01-25T12:38:49.417532+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5db595007 level 3 2011-01-25T12:38:49.417553+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 2011-01-25T12:38:49.417558+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 >>> >>> Again. >>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in process qemu-kvm pte:1603a0730500d067 pmd:61059f067 >>> >>> Again. >>> >>> However, these all came from a single boot, yes? >> >> Correct. >> >>> If so they can be the same >>> corruption. Please collect more traces, with reboots in between. > > This machine has been running for a week without problems, but then we > started to get the following oopses again: > > 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle > kernel paging request at ea71929180e0 > 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP: > [] gup_pte_range+0x94/0xd3 > 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0 > 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops: [#1] SMP > 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file: > /sys/devices/system/cpu/cpu15/topology/thread_siblings > 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4 > 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun > ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding > xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter > ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb > iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded: > scsi_wait_scan] > 2011-02-06T19:45:35.31+01:00 phy005 kernel: > 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm: > qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU > 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP: > 0010:[] [] > gup_pte_range+0x94/0xd3 > 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP: > 0018:88060b9bda78 EFLAGS: 00010082 > 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0 > RBX: 3000 RCX: 0005 > 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40 > RSI: 7fe54e3ff000 RDI: 1603a07305004067 > 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98 > R08: 880b94384560 R09: 88060b9bdb44 > 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff
Re: EPT: Misconfiguration
On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof wrote: > On Wed, Jan 26, 2011 at 10:52, Avi Kivity wrote: >> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: >>> >>> > When you say "suddenly", this was with no changes to software and >>> > hardware? >>> >>> The host software and hardware hasn't changed in the two months since >>> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. >>> >>> We host customer vms on it though, so virtual machines come and go. >>> Various operating systems, a mixture of Linux, FreeBSD and Windows >>> 2008 R2. We have other machines with the same config without these >>> problems though. >> >> Are those other machines running a similar workload? > > Yes, similar, or they're more heavily loaded. > > On this machine, about half of the 48GB memory was used for virtual machines. > >> The traces look awfully like bad hardware, though that can also be explained >> by random memory corruption due to a bug. > > Yeah, that's what I'm expecting. We already replaced the memory, next > step is to move the disks over to another server to make sure it's not > the board or cpu's. > >>> This time I have a few different messages though: >>> >>> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: >>> [#1] SMP >>> >>> RSI: RDI: 1603a07305001568 >>> >>> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 >>> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d >>> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 4f 08 0f 94 c0 84 >>> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb >> >> lock decl 0x8(%rdi) >> >> %rdi is completely crap, looks like corruption again. Strangely, it is >> similar to the bad spte from the previous trace: 0x1603a0730500d277. The >> upper 48 bits are identical, the lower 16 bits are different.: >>> >>> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted >>> page table at address 7f37b37ff000 >>> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD >>> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 >> >> Here are those magic 48 bits again, in the PTE entry. >>> >>> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. >>> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 >>> 2011-01-25T12:38:49.417526+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 >>> 2011-01-25T12:38:49.417532+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5db595007 level 3 >>> 2011-01-25T12:38:49.417553+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 >>> 2011-01-25T12:38:49.417558+01:00 phy005 kernel: >>> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 >> >> Again. >> >>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in >>> process qemu-kvm pte:1603a0730500d067 pmd:61059f067 >> >> Again. >> >> However, these all came from a single boot, yes? > > Correct. > >> If so they can be the same >> corruption. Please collect more traces, with reboots in between. This machine has been running for a week without problems, but then we started to get the following oopses again: 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle kernel paging request at ea71929180e0 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP: [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops: [#1] SMP 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/topology/thread_siblings 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded: scsi_wait_scan] 2011-02-06T19:45:35.31+01:00 phy005 kernel: 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm: qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP: 0010:[] [] gup_pte_range+0x94/0xd3 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP: 0018:88060b9bda78 EFLAGS: 00010082 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0 RBX: 3000 RCX: 0005 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40 RSI: 7fe54e3ff000 RDI: 1603a07305004067 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98 R08: 880b94384560 R09: 88060b9bdb44 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8 R11: ea00 R12: 0205 2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff R14: 0005 R15: 2011-02-06T19:45:35.55+01:00 phy0
Re: EPT: Misconfiguration
On Wed, Jan 26, 2011 at 10:52, Avi Kivity wrote: > On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: >> >> > When you say "suddenly", this was with no changes to software and >> > hardware? >> >> The host software and hardware hasn't changed in the two months since >> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. >> >> We host customer vms on it though, so virtual machines come and go. >> Various operating systems, a mixture of Linux, FreeBSD and Windows >> 2008 R2. We have other machines with the same config without these >> problems though. > > Are those other machines running a similar workload? Yes, similar, or they're more heavily loaded. On this machine, about half of the 48GB memory was used for virtual machines. > The traces look awfully like bad hardware, though that can also be explained > by random memory corruption due to a bug. Yeah, that's what I'm expecting. We already replaced the memory, next step is to move the disks over to another server to make sure it's not the board or cpu's. >> This time I have a few different messages though: >> >> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: >> [#1] SMP >> >> RSI: RDI: 1603a07305001568 >> >> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 >> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d >> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 4f 08 0f 94 c0 84 >> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb > > lock decl 0x8(%rdi) > > %rdi is completely crap, looks like corruption again. Strangely, it is > similar to the bad spte from the previous trace: 0x1603a0730500d277. The > upper 48 bits are identical, the lower 16 bits are different.: >> >> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted >> page table at address 7f37b37ff000 >> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD >> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 > > Here are those magic 48 bits again, in the PTE entry. >> >> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. >> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 >> 2011-01-25T12:38:49.417526+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 >> 2011-01-25T12:38:49.417532+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5db595007 level 3 >> 2011-01-25T12:38:49.417553+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 >> 2011-01-25T12:38:49.417558+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 > > Again. > >> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in >> process qemu-kvm pte:1603a0730500d067 pmd:61059f067 > > Again. > > However, these all came from a single boot, yes? Correct. > If so they can be the same > corruption. Please collect more traces, with reboots in between. Ok, thanks, will do. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On 01/25/2011 08:29 PM, Ruben Kerkhof wrote: > When you say "suddenly", this was with no changes to software and hardware? The host software and hardware hasn't changed in the two months since the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. We host customer vms on it though, so virtual machines come and go. Various operating systems, a mixture of Linux, FreeBSD and Windows 2008 R2. We have other machines with the same config without these problems though. Are those other machines running a similar workload? The traces look awfully like bad hardware, though that can also be explained by random memory corruption due to a bug. This time I have a few different messages though: 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: [#1] SMP RSI: RDI: 1603a07305001568 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 ff 4f 08 0f 94 c0 84 c0 74 10 85 f6 75 07 e8 63 fe ff ff eb lock decl 0x8(%rdi) %rdi is completely crap, looks like corruption again. Strangely, it is similar to the bad spte from the previous trace: 0x1603a0730500d277. The upper 48 bits are identical, the lower 16 bits are different.: 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted page table at address 7f37b37ff000 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD 94e538067 PMD 61e5bf067 PTE 1603a0730500e067 Here are those magic 48 bits again, in the PTE entry. 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration. 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038 2011-01-25T12:38:49.417526+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4 2011-01-25T12:38:49.417532+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5db595007 level 3 2011-01-25T12:38:49.417553+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2 2011-01-25T12:38:49.417558+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1 Again. 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in process qemu-kvm pte:1603a0730500d067 pmd:61059f067 Again. However, these all came from a single boot, yes? If so they can be the same corruption. Please collect more traces, with reboots in between. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Hi Avi, On Tue, Jan 25, 2011 at 18:39, Avi Kivity wrote: > On 01/25/2011 04:44 PM, Ruben Kerkhof wrote: >> >> Hi Marcello, >> >> On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti >> wrote: >> > On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote: >> >> I'm suddenly getting lots of the following errors on a server running >> >> 2.36.7, but I have no idea what it means: >> >> >> >> 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration. >> >> 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0 >> >> 2011-01-20T12:41:18.358624+01:00 phy005 kernel: >> >> ept_misconfig_inspect_spte: spte 0x50743e007 level 4 >> >> 2011-01-20T12:41:18.358627+01:00 phy005 kernel: >> >> ept_misconfig_inspect_spte: spte 0x523de2007 level 3 >> >> 2011-01-20T12:41:18.358629+01:00 phy005 kernel: >> >> ept_misconfig_inspect_spte: spte 0x62336f007 level 2 >> >> 2011-01-20T12:41:18.360109+01:00 phy005 kernel: >> >> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 >> >> 2011-01-20T12:41:18.360137+01:00 phy005 kernel: >> >> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 >> >> 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here >> >> ] >> > >> > A shadow pagetable entry in memory has bits 45-49 set, which is not >> > allowed. Its probably bad memory if this errors were not present before >> > with the same workload and host software. Would be useful to see what >> > memtest86 says. >> >> I did 2 memtest86+ passes, but no errors were found. >> >> Just to be save, we replaced all memory. The machine has been running >> stable over the weekend, but now gives exactly the same error. >> >> Is there anything else which could cause this? > > Try updating the BIOS. That's the first thing we did. It's a Supermicro with an X8DTU-F board, updated to bios version 2.0b (which includes the latest microcode). The procs are Intel 5620's > When you say "suddenly", this was with no changes to software and hardware? The host software and hardware hasn't changed in the two months since the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13. We host customer vms on it though, so virtual machines come and go. Various operating systems, a mixture of Linux, FreeBSD and Windows 2008 R2. We have other machines with the same config without these problems though. > Is cooling adequate? Yes. > How much memory is on that machine? Even outside the reserved bits the > address looks way too large. 48GB. This time I have a few different messages though: 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault: [#1] SMP 2011-01-25T11:58:50.001310+01:00 phy005 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/topology/thread_siblings 2011-01-25T11:58:50.001316+01:00 phy005 kernel: CPU 12 2011-01-25T11:58:50.001323+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm igb i2c_i801 iTCO_wdt i2c_core ioatdma joydev iTCO_vendor_support dca serio_raw 3w_9xxx [last unloaded: scsi_wait_scan] 2011-01-25T11:58:50.001327+01:00 phy005 kernel: 2011-01-25T11:58:50.001331+01:00 phy005 kernel: Pid: 1849, comm: qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU 2011-01-25T11:58:50.001336+01:00 phy005 kernel: RIP: 0010:[] [] __free_pages+0x9/0x26 2011-01-25T11:58:50.001339+01:00 phy005 kernel: RSP: 0018:8802fbe45ab8 EFLAGS: 00010216 2011-01-25T11:58:50.001343+01:00 phy005 kernel: RAX: 88061ef8c000 RBX: 8803131ec100 RCX: 2011-01-25T11:58:50.001348+01:00 phy005 kernel: RDX: 00ff RSI: RDI: 1603a07305001568 2011-01-25T11:58:50.001352+01:00 phy005 kernel: RBP: 8802fbe45ab8 R08: ea000a83b7f0 R09: 0004 2011-01-25T11:58:50.001356+01:00 phy005 kernel: R10: R11: 8802fbe45b38 R12: 0100 2011-01-25T11:58:50.001359+01:00 phy005 kernel: R13: 0001 R14: 8802e934c010 R15: 8802e934c010 2011-01-25T11:58:50.001363+01:00 phy005 kernel: FS: 7f1f14844700() GS:88065548() knlGS: 2011-01-25T11:58:50.001366+01:00 phy005 kernel: CS: 0010 DS: ES: CR0: 8005003b 2011-01-25T11:58:50.001370+01:00 phy005 kernel: CR2: b72f6cb0 CR3: 000ba561c000 CR4: 26e0 2011-01-25T11:58:50.001374+01:00 phy005 kernel: DR0: DR1: DR2: 2011-01-25T11:58:50.001378+01:00 phy005 kernel: DR3: DR6: 0ff0 DR7: 0400 2011-01-25T11:58:50.001382+01:00 phy005 kernel: Process qemu-kvm (pid: 1849, threadinfo 8802fbe44000, task 8802ea11aee0) 2011-01-25T11:58:50.001385+01:00 phy005 kernel: Stack: 2011-01-25T11:58:50.001389+01:00 phy005 kernel: 8802fbe45af8 810ee455 0206 c9001e2d4000 2011-01-25T11:58:50.00
Re: EPT: Misconfiguration
On 01/25/2011 04:44 PM, Ruben Kerkhof wrote: Hi Marcello, On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti wrote: > On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote: >> I'm suddenly getting lots of the following errors on a server running >> 2.36.7, but I have no idea what it means: >> >> 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration. >> 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0 >> 2011-01-20T12:41:18.358624+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x50743e007 level 4 >> 2011-01-20T12:41:18.358627+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x523de2007 level 3 >> 2011-01-20T12:41:18.358629+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x62336f007 level 2 >> 2011-01-20T12:41:18.360109+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 >> 2011-01-20T12:41:18.360137+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 >> 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here >> ] > > A shadow pagetable entry in memory has bits 45-49 set, which is not > allowed. Its probably bad memory if this errors were not present before > with the same workload and host software. Would be useful to see what > memtest86 says. I did 2 memtest86+ passes, but no errors were found. Just to be save, we replaced all memory. The machine has been running stable over the weekend, but now gives exactly the same error. Is there anything else which could cause this? Try updating the BIOS. When you say "suddenly", this was with no changes to software and hardware? Is cooling adequate? How much memory is on that machine? Even outside the reserved bits the address looks way too large. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
Hi Marcello, On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti wrote: > On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote: >> I'm suddenly getting lots of the following errors on a server running >> 2.36.7, but I have no idea what it means: >> >> 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration. >> 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0 >> 2011-01-20T12:41:18.358624+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x50743e007 level 4 >> 2011-01-20T12:41:18.358627+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x523de2007 level 3 >> 2011-01-20T12:41:18.358629+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x62336f007 level 2 >> 2011-01-20T12:41:18.360109+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 >> 2011-01-20T12:41:18.360137+01:00 phy005 kernel: >> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 >> 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here >> ] > > A shadow pagetable entry in memory has bits 45-49 set, which is not > allowed. Its probably bad memory if this errors were not present before > with the same workload and host software. Would be useful to see what > memtest86 says. I did 2 memtest86+ passes, but no errors were found. Just to be save, we replaced all memory. The machine has been running stable over the weekend, but now gives exactly the same error. Is there anything else which could cause this? Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote: > I'm suddenly getting lots of the following errors on a server running > 2.36.7, but I have no idea what it means: > > 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration. > 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0 > 2011-01-20T12:41:18.358624+01:00 phy005 kernel: > ept_misconfig_inspect_spte: spte 0x50743e007 level 4 > 2011-01-20T12:41:18.358627+01:00 phy005 kernel: > ept_misconfig_inspect_spte: spte 0x523de2007 level 3 > 2011-01-20T12:41:18.358629+01:00 phy005 kernel: > ept_misconfig_inspect_spte: spte 0x62336f007 level 2 > 2011-01-20T12:41:18.360109+01:00 phy005 kernel: > ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 > 2011-01-20T12:41:18.360137+01:00 phy005 kernel: > ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 > 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here > ] A shadow pagetable entry in memory has bits 45-49 set, which is not allowed. Its probably bad memory if this errors were not present before with the same workload and host software. Would be useful to see what memtest86 says. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Thu, Jan 20, 2011 at 12:48, Ruben Kerkhof wrote: > I'm suddenly getting lots of the following errors on a server running > 2.36.7, but I have no idea what it means: Sorry, that should be 2.34.7. Kind regards, Ruben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html