Re: EPT: Misconfiguration

2011-03-05 Thread Ruben Kerkhof
On Sun, Feb 27, 2011 at 11:46, Avi Kivity  wrote:
>
> Copying netdev: looks like memory corruption in the networking stack.
>
> Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the
> attachment).

There's now only a single guest running on this host (Ubuntu Maverick).
I've also upgraded the host kernel to 2.6.38-rc6, and this just
happened (after a day or so):

2011-03-05T19:41:58.328866+01:00 phy005 kernel: [85271.656862] BUG
kmalloc-2048 (Not tainted): Object padding overwritten
2011-03-05T19:41:58.328870+01:00 phy005 kernel: [85271.656864]
-
2011-03-05T19:41:58.328875+01:00 phy005 kernel: [85271.656866]
2011-03-05T19:41:58.328885+01:00 phy005 kernel: [85271.656870] INFO:
0x880c0d52a960-0x880c0d52a967. First byte 0x0 instead of 0x5a
2011-03-05T19:41:58.328890+01:00 phy005 kernel: [85271.656880] INFO:
Allocated in __netdev_alloc_skb+0x1f/0x3b age=16039 cpu=5 pid=0
2011-03-05T19:41:58.328894+01:00 phy005 kernel: [85271.656886] INFO:
Freed in skb_release_data+0xa5/0xaa age=0 cpu=5 pid=1766
2011-03-05T19:41:58.328898+01:00 phy005 kernel: [85271.656890] INFO:
Slab 0xea002a2ea0c0 objects=15 used=13 fp=0x880c0d52a120
flags=0xc040c1
2011-03-05T19:41:58.328902+01:00 phy005 kernel: [85271.656894] INFO:
Object 0x880c0d52a120 @offset=8480 fp=0x880c0d52d2d0
2011-03-05T19:41:58.328905+01:00 phy005 kernel: [85271.656895]
2011-03-05T19:41:58.328909+01:00 phy005 kernel: [85271.656897] Bytes
b4 0x880c0d52a110:  14 89 12 05 01 00 00 00 5a 5a 5a 5a 5a 5a 5a
5a 
2011-03-05T19:41:58.328913+01:00 phy005 kernel: [85271.656909]
Object 0x880c0d52a120:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 

We have a quite complex network stack, two interfaces (igb) attached
to bond0, with on top two bridges and on that two vlans.
The guest is running a vpn and an IPv6 tunnel.

Let me know if more info is needed.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-27 Thread Avi Kivity


Copying netdev: looks like memory corruption in the networking stack.

Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the 
attachment).


On 02/24/2011 11:15 PM, Ruben Kerkhof wrote:

>
>  On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti  wrote:

>>  This and the others reported. So yes, it looks something is corrupting
>>  memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option.

Ok, there are now only 6 vms left on this host, and I've booted it
with the slub_debug=ZFPU option.
After a few hours, I got the following result:

2011-02-24T21:41:30.818496+01:00 phy005 kernel:
=
2011-02-24T21:41:30.818517+01:00 phy005 kernel: BUG kmalloc-2048 (Not
tainted): Object padding overwritten
2011-02-24T21:41:30.818523+01:00 phy005 kernel:
-
2011-02-24T21:41:30.818526+01:00 phy005 kernel:
2011-02-24T21:41:30.818530+01:00 phy005 kernel: INFO:
0x8806230752ca-0x8806230752cf. First byte 0x0 instead of 0x5a
2011-02-24T21:41:30.818534+01:00 phy005 kernel: INFO: Allocated in
__netdev_alloc_skb+0x34/0x51 age=2231 cpu=8 pid=0
2011-02-24T21:41:30.818537+01:00 phy005 kernel: INFO: Freed in
skb_release_data+0xc9/0xce age=2368 cpu=8 pid=2159
2011-02-24T21:41:30.818541+01:00 phy005 kernel: INFO: Slab
0xea00157a9880 objects=15 used=13 fp=0x8806230752d0
flags=0x404083
2011-02-24T21:41:30.818545+01:00 phy005 kernel: INFO: Object
0x880623074a88 @offset=19080 fp=0x8806230752d0

The rest of the output is attached since it's quite large.

Kind regards,

Ruben



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-15 Thread Ruben Kerkhof
Hi Marcelo,

On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti  wrote:
> On Sun, Feb 13, 2011 at 03:03:40PM +0200, Avi Kivity wrote:
>> On 02/13/2011 04:07 AM, Ruben Kerkhof wrote:
>> >And tonight we had another one of those errors we had a few weeks ago:
>> >
>> >2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration.
>> >2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000
>>
>> This GPA indexes into the 511th entry of the spte.  Marcelo, does
>> this remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052
>> by any chance?
>
> This and the others reported. So yes, it looks something is corrupting
> memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option.

Sure, but not for a while, I'm first moving all my customers of this
machine. We've had to reboot it like 5 or 6 times in the last couple
of weeks.
As soon as that's done I'm going to test the hell out of it.

Now that we moved a few of the vm's we don't see any oopses, so it
could either be that it only triggers under load, or there's a
specific guest which is triggering it.

> Is there any reason for not upgrading to FC14?

I haven't had a reason to upgrade yet, all our other machines are
running fine, using the same kernel.
Plus I'm still finding lots of issues unrelated to kvm on F14, broken
ssh in combination with openldap, ipmi bugs, selinux policy etc.
Next to that it takes a lot of time to test all our images etc.

I'll probably skip the F14 kernel and go straight to 2.638, since that
should bring significant improvements like THP, async pagefaults etc.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-15 Thread Marcelo Tosatti
On Sun, Feb 13, 2011 at 03:03:40PM +0200, Avi Kivity wrote:
> On 02/13/2011 04:07 AM, Ruben Kerkhof wrote:
> >And tonight we had another one of those errors we had a few weeks ago:
> >
> >2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration.
> >2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000
> 
> This GPA indexes into the 511th entry of the spte.  Marcelo, does
> this remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052
> by any chance?

This and the others reported. So yes, it looks something is corrupting
memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option.
Is there any reason for not upgrading to FC14?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-13 Thread Ruben Kerkhof
On Sun, Feb 13, 2011 at 14:03, Avi Kivity  wrote:
> On 02/13/2011 04:07 AM, Ruben Kerkhof wrote:
>>
>> And tonight we had another one of those errors we had a few weeks ago:
>>
>> 2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration.
>> 2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000
>
> This GPA indexes into the 511th entry of the spte.  Marcelo, does this
> remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 by any
> chance?
>
>> 2011-02-13T02:56:28.694914+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x25602d007 level 4
>> 2011-02-13T02:56:28.694916+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x3df3e2007 level 3
>> 2011-02-13T02:56:28.694919+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x5e90c7007 level 2
>> 2011-02-13T02:56:28.694925+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
>
> Magic 1603a073 pte.
>
>> 2011-02-13T02:56:28.694928+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
>> 2011-02-13T02:56:28.694930+01:00 phy005 kernel: [ cut here
>> ]
>> 2011-02-13T02:56:28.694933+01:00 phy005 kernel: WARNING: at
>> arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]()
>> 2011-02-13T02:56:28.694936+01:00 phy005 kernel: Hardware name: X8DTU
>> 2011-02-13T02:56:28.694941+01:00 phy005 kernel: Modules linked in: tun
>> ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
>> xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
>> ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt igb ioatdma
>> dca iTCO_vendor_support joydev serio_raw microcode 3w_9xxx [last
>> unloaded: scsi_wait_scan]
>> 2011-02-13T02:56:28.695004+01:00 phy005 kernel: Pid: 4756, comm:
>> qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1
>> 2011-02-13T02:56:28.695008+01:00 phy005 kernel: Call Trace:
>> 2011-02-13T02:56:28.695013+01:00 phy005 kernel: []
>> warn_slowpath_common+0x7c/0x94
>> 2011-02-13T02:56:28.695020+01:00 phy005 kernel: []
>> warn_slowpath_null+0x14/0x16
>> 2011-02-13T02:56:28.695024+01:00 phy005 kernel: []
>> handle_ept_misconfig+0x152/0x1d8 [kvm_intel]
>> 2011-02-13T02:56:28.695028+01:00 phy005 kernel: []
>> vmx_handle_exit+0x204/0x23a [kvm_intel]
>> 2011-02-13T02:56:28.695033+01:00 phy005 kernel: []
>> kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm]
>> 2011-02-13T02:56:28.695037+01:00 phy005 kernel: []
>> kvm_vcpu_ioctl+0xfd/0x56e [kvm]
>> 2011-02-13T02:56:28.695042+01:00 phy005 kernel: [] ?
>> virt_to_head_page+0xe/0x2f
>> 2011-02-13T02:56:28.695046+01:00 phy005 kernel: [] ?
>> mempool_kfree+0xe/0x10
>> 2011-02-13T02:56:28.695051+01:00 phy005 kernel: [] ?
>> mempool_free+0x76/0x7b
>> 2011-02-13T02:56:28.695055+01:00 phy005 kernel: []
>> vfs_ioctl+0x32/0xa6
>> 2011-02-13T02:56:28.695060+01:00 phy005 kernel: []
>> do_vfs_ioctl+0x483/0x4c9
>> 2011-02-13T02:56:28.695065+01:00 phy005 kernel: []
>> sys_ioctl+0x56/0x79
>> 2011-02-13T02:56:28.695070+01:00 phy005 kernel: []
>> system_call_fastpath+0x16/0x1b
>> 2011-02-13T02:56:28.695074+01:00 phy005 kernel: ---[ end trace
>> d95032626ea304ca ]---
>>
>> Any help would be much appreciated. It seems very strange that I'm the
>> first one who runs into this.
>> I've found two bugreports which report the same, the first one at
>>
>> https://partner-bugzilla.redhat.com/show_bug.cgi?format=multiple&id=613691,
>> but that's a duplicate of
>> https://partner-bugzilla.redhat.com/show_bug.cgi?id=606131 which I'm
>> not authorized to see...
>
> These don't appear to be related.  Are you running ksm, btw?

No.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-13 Thread Ruben Kerkhof
Hi Avi,

On Sun, Feb 13, 2011 at 13:58, Avi Kivity  wrote:
> On 02/10/2011 05:23 PM, Ruben Kerkhof wrote:
>>
>> This machine has been running for a week without problems, but then we
>> started to get the following oopses again:
>>
>> 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle
>> kernel paging request at ea71929180e0
>> 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP:
>> [] gup_pte_range+0x94/0xd3
>> 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0
>> 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops:  [#1] SMP
>> 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file:
>> /sys/devices/system/cpu/cpu15/topology/thread_siblings
>> 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4
>> 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun
>> ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
>> xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
>> ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb
>> iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded:
>> scsi_wait_scan]
>> 2011-02-06T19:45:35.31+01:00 phy005 kernel:
>> 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm:
>> qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
>> 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP:
>> 0010:[]  []
>> gup_pte_range+0x94/0xd3
>> 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP:
>> 0018:88060b9bda78  EFLAGS: 00010082
>> 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0
>> RBX: 3000 RCX: 0005
>> 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40
>> RSI: 7fe54e3ff000 RDI: 1603a07305004067
>> 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98
>> R08: 880b94384560 R09: 88060b9bdb44
>> 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8
>> R11: ea00 R12: 0205
>> 2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff
>> R14: 0005 R15: 
>> 2011-02-06T19:45:35.55+01:00 phy005 kernel: FS:
>> 7fe64cb0e700() GS:88065540()
>> knlGS:
>> 2011-02-06T19:45:35.59+01:00 phy005 kernel: CS:  0010 DS: 002b ES:
>> 002b CR0: 80050033
>> 2011-02-06T19:45:35.63+01:00 phy005 kernel: CR2: ea71929180e0
>> CR3: 000bff06d000 CR4: 26e0
>> 2011-02-06T19:45:35.67+01:00 phy005 kernel: DR0: 
>> DR1:  DR2: 
>> 2011-02-06T19:45:35.71+01:00 phy005 kernel: DR3: 
>> DR6: 0ff0 DR7: 0400
>> 2011-02-06T19:45:35.74+01:00 phy005 kernel: Process qemu-kvm (pid:
>> 3650, threadinfo 88060b9bc000, task 880623ed2ee0)
>> 2011-02-06T19:45:35.78+01:00 phy005 kernel: Stack:
>> 2011-02-06T19:45:35.81+01:00 phy005 kernel: 7fe54e40
>> 7fe54e40 7fe54e40 88053a0d2388
>> 2011-02-06T19:45:35.85+01:00 phy005 kernel:<0>  88060b9bdaf8
>> 81034a15 7fe54e3f 7fe54e3f
>> 2011-02-06T19:45:35.89+01:00 phy005 kernel:<0>  88060b9bdb44
>> 880b94384560 880bff06eca8 880bff06d7f8
>> 2011-02-06T19:45:35.92+01:00 phy005 kernel: Call Trace:
>> 2011-02-06T19:45:35.96+01:00 phy005 kernel: []
>> gup_pud_range+0x156/0x192
>> 2011-02-06T19:45:35.222300+01:00 phy005 kernel: []
>> get_user_pages_fast+0xc4/0x172
>> 2011-02-06T19:45:35.222304+01:00 phy005 kernel: [] ?
>> bio_add_page+0x36/0x38
>> 2011-02-06T19:45:35.222308+01:00 phy005 kernel: []
>> dio_get_page+0x54/0x127
>> 2011-02-06T19:45:35.222312+01:00 phy005 kernel: []
>> __blockdev_direct_IO+0x41d/0xa36
>> 2011-02-06T19:45:35.222316+01:00 phy005 kernel: [] ?
>> x86_emulate_insn+0x1ff8/0x2d61 [kvm]
>> 2011-02-06T19:45:35.222320+01:00 phy005 kernel: []
>> blkdev_direct_IO+0x4e/0x50
>> 2011-02-06T19:45:35.222324+01:00 phy005 kernel: [] ?
>> blkdev_get_blocks+0x0/0x8d
>> 2011-02-06T19:45:35.222328+01:00 phy005 kernel: []
>> generic_file_direct_write+0xed/0x16d
>> 2011-02-06T19:45:35.222331+01:00 phy005 kernel: []
>> __generic_file_aio_write+0x196/0x281
>> 2011-02-06T19:45:35.222335+01:00 phy005 kernel: [] ?
>> file_has_perm+0xa4/0xc6
>> 2011-02-06T19:45:35.222339+01:00 phy005 kernel: [] ?
>> blkdev_aio_write+0x0/0x69
>> 2011-02-06T19:45:35.222343+01:00 phy005 kernel: []
>> blkdev_aio_write+0x2a/0x69
>> 2011-02-06T19:45:35.222347+01:00 phy005 kernel: [] ?
>> blkdev_aio_write+0x0/0x69
>> 2011-02-06T19:45:35.222351+01:00 phy005 kernel: []
>> aio_rw_vect_retry+0x85/0x18e
>> 2011-02-06T19:45:35.222355+01:00 phy005 kernel: []
>> aio_run_iocb+0x77/0x10f
>> 2011-02-06T19:45:35.222359+01:00 phy005 kernel: []
>> do_io_submit+0x558/0x7ce
>> 2011-02-06T19:45:35.222363+01:00 phy005 kernel: []
>> sys_io_submit+0x10/0x12
>> 2011-02-06T19:45:35.222366+01:00 phy005 kernel: []
>> system_call_fastpath+0x16/0x1b
>> 2011-02-06T19:

Re: EPT: Misconfiguration

2011-02-13 Thread Avi Kivity

On 02/13/2011 04:07 AM, Ruben Kerkhof wrote:

And tonight we had another one of those errors we had a few weeks ago:

2011-02-13T02:56:28.694496+01:00 phy005 kernel: EPT: Misconfiguration.
2011-02-13T02:56:28.694908+01:00 phy005 kernel: EPT: GPA: 0x2edff000


This GPA indexes into the 511th entry of the spte.  Marcelo, does this 
remind you of https://bugzilla.kernel.org/show_bug.cgi?id=27052 by any 
chance?



2011-02-13T02:56:28.694914+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x25602d007 level 4
2011-02-13T02:56:28.694916+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x3df3e2007 level 3
2011-02-13T02:56:28.694919+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5e90c7007 level 2
2011-02-13T02:56:28.694925+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1


Magic 1603a073 pte.


2011-02-13T02:56:28.694928+01:00 phy005 kernel:
ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
2011-02-13T02:56:28.694930+01:00 phy005 kernel: [ cut here
]
2011-02-13T02:56:28.694933+01:00 phy005 kernel: WARNING: at
arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]()
2011-02-13T02:56:28.694936+01:00 phy005 kernel: Hardware name: X8DTU
2011-02-13T02:56:28.694941+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt igb ioatdma
dca iTCO_vendor_support joydev serio_raw microcode 3w_9xxx [last
unloaded: scsi_wait_scan]
2011-02-13T02:56:28.695004+01:00 phy005 kernel: Pid: 4756, comm:
qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1
2011-02-13T02:56:28.695008+01:00 phy005 kernel: Call Trace:
2011-02-13T02:56:28.695013+01:00 phy005 kernel: []
warn_slowpath_common+0x7c/0x94
2011-02-13T02:56:28.695020+01:00 phy005 kernel: []
warn_slowpath_null+0x14/0x16
2011-02-13T02:56:28.695024+01:00 phy005 kernel: []
handle_ept_misconfig+0x152/0x1d8 [kvm_intel]
2011-02-13T02:56:28.695028+01:00 phy005 kernel: []
vmx_handle_exit+0x204/0x23a [kvm_intel]
2011-02-13T02:56:28.695033+01:00 phy005 kernel: []
kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm]
2011-02-13T02:56:28.695037+01:00 phy005 kernel: []
kvm_vcpu_ioctl+0xfd/0x56e [kvm]
2011-02-13T02:56:28.695042+01:00 phy005 kernel: [] ?
virt_to_head_page+0xe/0x2f
2011-02-13T02:56:28.695046+01:00 phy005 kernel: [] ?
mempool_kfree+0xe/0x10
2011-02-13T02:56:28.695051+01:00 phy005 kernel: [] ?
mempool_free+0x76/0x7b
2011-02-13T02:56:28.695055+01:00 phy005 kernel: []
vfs_ioctl+0x32/0xa6
2011-02-13T02:56:28.695060+01:00 phy005 kernel: []
do_vfs_ioctl+0x483/0x4c9
2011-02-13T02:56:28.695065+01:00 phy005 kernel: []
sys_ioctl+0x56/0x79
2011-02-13T02:56:28.695070+01:00 phy005 kernel: []
system_call_fastpath+0x16/0x1b
2011-02-13T02:56:28.695074+01:00 phy005 kernel: ---[ end trace
d95032626ea304ca ]---

Any help would be much appreciated. It seems very strange that I'm the
first one who runs into this.
I've found two bugreports which report the same, the first one at
https://partner-bugzilla.redhat.com/show_bug.cgi?format=multiple&id=613691,
but that's a duplicate of
https://partner-bugzilla.redhat.com/show_bug.cgi?id=606131 which I'm
not authorized to see...


These don't appear to be related.  Are you running ksm, btw?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-02-13 Thread Avi Kivity

On 02/10/2011 05:23 PM, Ruben Kerkhof wrote:


This machine has been running for a week without problems, but then we
started to get the following oopses again:

2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle
kernel paging request at ea71929180e0
2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP:
[] gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0
2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops:  [#1] SMP
2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file:
/sys/devices/system/cpu/cpu15/topology/thread_siblings
2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4
2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb
iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded:
scsi_wait_scan]
2011-02-06T19:45:35.31+01:00 phy005 kernel:
2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm:
qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP:
0010:[]  []
gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP:
0018:88060b9bda78  EFLAGS: 00010082
2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0
RBX: 3000 RCX: 0005
2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40
RSI: 7fe54e3ff000 RDI: 1603a07305004067
2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98
R08: 880b94384560 R09: 88060b9bdb44
2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8
R11: ea00 R12: 0205
2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff
R14: 0005 R15: 
2011-02-06T19:45:35.55+01:00 phy005 kernel: FS:
7fe64cb0e700() GS:88065540()
knlGS:
2011-02-06T19:45:35.59+01:00 phy005 kernel: CS:  0010 DS: 002b ES:
002b CR0: 80050033
2011-02-06T19:45:35.63+01:00 phy005 kernel: CR2: ea71929180e0
CR3: 000bff06d000 CR4: 26e0
2011-02-06T19:45:35.67+01:00 phy005 kernel: DR0: 
DR1:  DR2: 
2011-02-06T19:45:35.71+01:00 phy005 kernel: DR3: 
DR6: 0ff0 DR7: 0400
2011-02-06T19:45:35.74+01:00 phy005 kernel: Process qemu-kvm (pid:
3650, threadinfo 88060b9bc000, task 880623ed2ee0)
2011-02-06T19:45:35.78+01:00 phy005 kernel: Stack:
2011-02-06T19:45:35.81+01:00 phy005 kernel: 7fe54e40
7fe54e40 7fe54e40 88053a0d2388
2011-02-06T19:45:35.85+01:00 phy005 kernel:<0>  88060b9bdaf8
81034a15 7fe54e3f 7fe54e3f
2011-02-06T19:45:35.89+01:00 phy005 kernel:<0>  88060b9bdb44
880b94384560 880bff06eca8 880bff06d7f8
2011-02-06T19:45:35.92+01:00 phy005 kernel: Call Trace:
2011-02-06T19:45:35.96+01:00 phy005 kernel: []
gup_pud_range+0x156/0x192
2011-02-06T19:45:35.222300+01:00 phy005 kernel: []
get_user_pages_fast+0xc4/0x172
2011-02-06T19:45:35.222304+01:00 phy005 kernel: [] ?
bio_add_page+0x36/0x38
2011-02-06T19:45:35.222308+01:00 phy005 kernel: []
dio_get_page+0x54/0x127
2011-02-06T19:45:35.222312+01:00 phy005 kernel: []
__blockdev_direct_IO+0x41d/0xa36
2011-02-06T19:45:35.222316+01:00 phy005 kernel: [] ?
x86_emulate_insn+0x1ff8/0x2d61 [kvm]
2011-02-06T19:45:35.222320+01:00 phy005 kernel: []
blkdev_direct_IO+0x4e/0x50
2011-02-06T19:45:35.222324+01:00 phy005 kernel: [] ?
blkdev_get_blocks+0x0/0x8d
2011-02-06T19:45:35.222328+01:00 phy005 kernel: []
generic_file_direct_write+0xed/0x16d
2011-02-06T19:45:35.222331+01:00 phy005 kernel: []
__generic_file_aio_write+0x196/0x281
2011-02-06T19:45:35.222335+01:00 phy005 kernel: [] ?
file_has_perm+0xa4/0xc6
2011-02-06T19:45:35.222339+01:00 phy005 kernel: [] ?
blkdev_aio_write+0x0/0x69
2011-02-06T19:45:35.222343+01:00 phy005 kernel: []
blkdev_aio_write+0x2a/0x69
2011-02-06T19:45:35.222347+01:00 phy005 kernel: [] ?
blkdev_aio_write+0x0/0x69
2011-02-06T19:45:35.222351+01:00 phy005 kernel: []
aio_rw_vect_retry+0x85/0x18e
2011-02-06T19:45:35.222355+01:00 phy005 kernel: []
aio_run_iocb+0x77/0x10f
2011-02-06T19:45:35.222359+01:00 phy005 kernel: []
do_io_submit+0x558/0x7ce
2011-02-06T19:45:35.222363+01:00 phy005 kernel: []
sys_io_submit+0x10/0x12
2011-02-06T19:45:35.222366+01:00 phy005 kernel: []
system_call_fastpath+0x16/0x1b
2011-02-06T19:45:35.222372+01:00 phy005 kernel: Code: 21 d8 49 01 c2
49 8b 3a 49 89 fe 4d 21 ee 4d 21 e6 49 39 ce 75 49 48 89 f8 0f 1f 40
00 48 21 d8 48 c1 e8 0c 48 6b c0 38 4c 01 d8<66>  83 38 00 48 89 c7 79
04 48 8b 78 10 f0 ff 47 08 49 63 39 48
2011-02-06T19:45:35.222376+01:00 phy005 kernel: RIP
[] gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.222379+01:00 ph

Re: EPT: Misconfiguration

2011-02-12 Thread Ruben Kerkhof
On Thu, Feb 10, 2011 at 16:23, Ruben Kerkhof  wrote:
> On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof  wrote:
>> On Wed, Jan 26, 2011 at 10:52, Avi Kivity  wrote:
>>> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:

 >  When you say "suddenly", this was with no changes to software and
 > hardware?

 The host software and hardware hasn't changed in the two months since
 the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.

 We host customer vms on it though, so virtual machines come and go.
 Various operating systems, a mixture of Linux, FreeBSD and Windows
 2008 R2. We have other machines with the same config without these
 problems though.
>>>
>>> Are those other machines running a similar workload?
>>
>> Yes, similar, or they're more heavily loaded.
>>
>> On this machine, about half of the 48GB memory was used for virtual machines.
>>
>>> The traces look awfully like bad hardware, though that can also be explained
>>> by random memory corruption due to a bug.
>>
>> Yeah, that's what I'm expecting. We already replaced the memory, next
>> step is to move the disks over to another server to make sure it's not
>> the board or cpu's.
>>
 This time I have a few different messages though:

 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:
  [#1] SMP

 RSI:  RDI: 1603a07305001568

 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00  ff 4f 08 0f 94 c0 84
 c0 74 10 85 f6 75 07 e8 63 fe ff ff eb
>>>
>>> lock decl 0x8(%rdi)
>>>
>>> %rdi is completely crap, looks like corruption again.  Strangely, it is
>>> similar to the bad spte from the previous trace: 0x1603a0730500d277.  The
>>> upper 48 bits are identical, the lower 16 bits are different.:

 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
 page table at address 7f37b37ff000
 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
 94e538067 PMD 61e5bf067 PTE 1603a0730500e067
>>>
>>> Here are those magic 48 bits again, in the PTE entry.

 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
 2011-01-25T12:38:49.417526+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
 2011-01-25T12:38:49.417532+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5db595007 level 3
 2011-01-25T12:38:49.417553+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
 2011-01-25T12:38:49.417558+01:00 phy005 kernel:
 ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1
>>>
>>> Again.
>>>
 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
 process qemu-kvm  pte:1603a0730500d067 pmd:61059f067
>>>
>>> Again.
>>>
>>> However, these all came from a single boot, yes?
>>
>> Correct.
>>
>>> If so they can be the same
>>> corruption.  Please collect more traces, with reboots in between.
>
> This machine has been running for a week without problems, but then we
> started to get the following oopses again:
>
> 2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle
> kernel paging request at ea71929180e0
> 2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP:
> [] gup_pte_range+0x94/0xd3
> 2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0
> 2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops:  [#1] SMP
> 2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file:
> /sys/devices/system/cpu/cpu15/topology/thread_siblings
> 2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4
> 2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun
> ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
> xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
> ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb
> iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded:
> scsi_wait_scan]
> 2011-02-06T19:45:35.31+01:00 phy005 kernel:
> 2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm:
> qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
> 2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP:
> 0010:[]  []
> gup_pte_range+0x94/0xd3
> 2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP:
> 0018:88060b9bda78  EFLAGS: 00010082
> 2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0
> RBX: 3000 RCX: 0005
> 2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40
> RSI: 7fe54e3ff000 RDI: 1603a07305004067
> 2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98
> R08: 880b94384560 R09: 88060b9bdb44
> 2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff

Re: EPT: Misconfiguration

2011-02-10 Thread Ruben Kerkhof
On Wed, Jan 26, 2011 at 16:00, Ruben Kerkhof  wrote:
> On Wed, Jan 26, 2011 at 10:52, Avi Kivity  wrote:
>> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:
>>>
>>> >  When you say "suddenly", this was with no changes to software and
>>> > hardware?
>>>
>>> The host software and hardware hasn't changed in the two months since
>>> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.
>>>
>>> We host customer vms on it though, so virtual machines come and go.
>>> Various operating systems, a mixture of Linux, FreeBSD and Windows
>>> 2008 R2. We have other machines with the same config without these
>>> problems though.
>>
>> Are those other machines running a similar workload?
>
> Yes, similar, or they're more heavily loaded.
>
> On this machine, about half of the 48GB memory was used for virtual machines.
>
>> The traces look awfully like bad hardware, though that can also be explained
>> by random memory corruption due to a bug.
>
> Yeah, that's what I'm expecting. We already replaced the memory, next
> step is to move the disks over to another server to make sure it's not
> the board or cpu's.
>
>>> This time I have a few different messages though:
>>>
>>> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:
>>>  [#1] SMP
>>>
>>> RSI:  RDI: 1603a07305001568
>>>
>>> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
>>> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
>>> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00  ff 4f 08 0f 94 c0 84
>>> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb
>>
>> lock decl 0x8(%rdi)
>>
>> %rdi is completely crap, looks like corruption again.  Strangely, it is
>> similar to the bad spte from the previous trace: 0x1603a0730500d277.  The
>> upper 48 bits are identical, the lower 16 bits are different.:
>>>
>>> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
>>> page table at address 7f37b37ff000
>>> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
>>> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067
>>
>> Here are those magic 48 bits again, in the PTE entry.
>>>
>>> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
>>> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
>>> 2011-01-25T12:38:49.417526+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
>>> 2011-01-25T12:38:49.417532+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5db595007 level 3
>>> 2011-01-25T12:38:49.417553+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
>>> 2011-01-25T12:38:49.417558+01:00 phy005 kernel:
>>> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1
>>
>> Again.
>>
>>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
>>> process qemu-kvm  pte:1603a0730500d067 pmd:61059f067
>>
>> Again.
>>
>> However, these all came from a single boot, yes?
>
> Correct.
>
>> If so they can be the same
>> corruption.  Please collect more traces, with reboots in between.

This machine has been running for a week without problems, but then we
started to get the following oopses again:

2011-02-06T19:45:35.221555+01:00 phy005 kernel: BUG: unable to handle
kernel paging request at ea71929180e0
2011-02-06T19:45:35.222194+01:00 phy005 kernel: IP:
[] gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.222199+01:00 phy005 kernel: PGD 118600067 PUD 0
2011-02-06T19:45:35.03+01:00 phy005 kernel: Oops:  [#1] SMP
2011-02-06T19:45:35.21+01:00 phy005 kernel: last sysfs file:
/sys/devices/system/cpu/cpu15/topology/thread_siblings
2011-02-06T19:45:35.24+01:00 phy005 kernel: CPU 4
2011-02-06T19:45:35.29+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
ip6_tables ipv6 kvm_intel kvm i2c_i801 i2c_core iTCO_wdt serio_raw igb
iTCO_vendor_support joydev ioatdma dca 3w_9xxx [last unloaded:
scsi_wait_scan]
2011-02-06T19:45:35.31+01:00 phy005 kernel:
2011-02-06T19:45:35.33+01:00 phy005 kernel: Pid: 3650, comm:
qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
2011-02-06T19:45:35.36+01:00 phy005 kernel: RIP:
0010:[]  []
gup_pte_range+0x94/0xd3
2011-02-06T19:45:35.39+01:00 phy005 kernel: RSP:
0018:88060b9bda78  EFLAGS: 00010082
2011-02-06T19:45:35.41+01:00 phy005 kernel: RAX: ea71929180e0
RBX: 3000 RCX: 0005
2011-02-06T19:45:35.43+01:00 phy005 kernel: RDX: 7fe54e40
RSI: 7fe54e3ff000 RDI: 1603a07305004067
2011-02-06T19:45:35.45+01:00 phy005 kernel: RBP: 88060b9bda98
R08: 880b94384560 R09: 88060b9bdb44
2011-02-06T19:45:35.48+01:00 phy005 kernel: R10: 880606b2fff8
R11: ea00 R12: 0205
2011-02-06T19:45:35.51+01:00 phy005 kernel: R13: cfff
R14: 0005 R15: 
2011-02-06T19:45:35.55+01:00 phy0

Re: EPT: Misconfiguration

2011-01-26 Thread Ruben Kerkhof
On Wed, Jan 26, 2011 at 10:52, Avi Kivity  wrote:
> On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:
>>
>> >  When you say "suddenly", this was with no changes to software and
>> > hardware?
>>
>> The host software and hardware hasn't changed in the two months since
>> the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.
>>
>> We host customer vms on it though, so virtual machines come and go.
>> Various operating systems, a mixture of Linux, FreeBSD and Windows
>> 2008 R2. We have other machines with the same config without these
>> problems though.
>
> Are those other machines running a similar workload?

Yes, similar, or they're more heavily loaded.

On this machine, about half of the 48GB memory was used for virtual machines.

> The traces look awfully like bad hardware, though that can also be explained
> by random memory corruption due to a bug.

Yeah, that's what I'm expecting. We already replaced the memory, next
step is to move the disks over to another server to make sure it's not
the board or cpu's.

>> This time I have a few different messages though:
>>
>> 2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:
>>  [#1] SMP
>>
>> RSI:  RDI: 1603a07305001568
>>
>> 2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
>> 08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
>> 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00  ff 4f 08 0f 94 c0 84
>> c0 74 10 85 f6 75 07 e8 63 fe ff ff eb
>
> lock decl 0x8(%rdi)
>
> %rdi is completely crap, looks like corruption again.  Strangely, it is
> similar to the bad spte from the previous trace: 0x1603a0730500d277.  The
> upper 48 bits are identical, the lower 16 bits are different.:
>>
>> 2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
>> page table at address 7f37b37ff000
>> 2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
>> 94e538067 PMD 61e5bf067 PTE 1603a0730500e067
>
> Here are those magic 48 bits again, in the PTE entry.
>>
>> 2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
>> 2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
>> 2011-01-25T12:38:49.417526+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
>> 2011-01-25T12:38:49.417532+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x5db595007 level 3
>> 2011-01-25T12:38:49.417553+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
>> 2011-01-25T12:38:49.417558+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1
>
> Again.
>
>> 2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
>> process qemu-kvm  pte:1603a0730500d067 pmd:61059f067
>
> Again.
>
> However, these all came from a single boot, yes?

Correct.

> If so they can be the same
> corruption.  Please collect more traces, with reboots in between.

Ok, thanks, will do.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-26 Thread Avi Kivity

On 01/25/2011 08:29 PM, Ruben Kerkhof wrote:

>  When you say "suddenly", this was with no changes to software and hardware?

The host software and hardware hasn't changed in the two months since
the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.

We host customer vms on it though, so virtual machines come and go.
Various operating systems, a mixture of Linux, FreeBSD and Windows
2008 R2. We have other machines with the same config without these
problems though.


Are those other machines running a similar workload?

The traces look awfully like bad hardware, though that can also be 
explained by random memory corruption due to a bug.



This time I have a few different messages though:

2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection fault:  
[#1] SMP

RSI:  RDI: 1603a07305001568

2011-01-25T11:58:50.001486+01:00 phy005 kernel: Code: ff ff 41 8b 46
08 41 29 06 4c 89 e7 57 9d 0f 1f 44 00 00 48 83 c4 18 5b 41 5c 41 5d
41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00  ff 4f 08 0f 94 c0 84
c0 74 10 85 f6 75 07 e8 63 fe ff ff eb


lock decl 0x8(%rdi)

%rdi is completely crap, looks like corruption again.  Strangely, it is 
similar to the bad spte from the previous trace: 0x1603a0730500d277.  
The upper 48 bits are identical, the lower 16 bits are different.:

2011-01-25T12:06:32.673937+01:00 phy005 kernel: qemu-kvm: Corrupted
page table at address 7f37b37ff000
2011-01-25T12:06:32.673959+01:00 phy005 kernel: PGD c201d1067 PUD
94e538067 PMD 61e5bf067 PTE 1603a0730500e067


Here are those magic 48 bits again, in the PTE entry.

2011-01-25T12:38:49.416943+01:00 phy005 kernel: EPT: Misconfiguration.
2011-01-25T12:38:49.417518+01:00 phy005 kernel: EPT: GPA: 0x2abff038
2011-01-25T12:38:49.417526+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5f49e9007 level 4
2011-01-25T12:38:49.417532+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5db595007 level 3
2011-01-25T12:38:49.417553+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x5d5da7007 level 2
2011-01-25T12:38:49.417558+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x1603a07305006277 level 1


Again.


2011-01-25T13:16:58.192440+01:00 phy005 kernel: BUG: Bad page map in
process qemu-kvm  pte:1603a0730500d067 pmd:61059f067


Again.

However, these all came from a single boot, yes?  If so they can be the 
same corruption.  Please collect more traces, with reboots in between.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-25 Thread Ruben Kerkhof
Hi Avi,

On Tue, Jan 25, 2011 at 18:39, Avi Kivity  wrote:
> On 01/25/2011 04:44 PM, Ruben Kerkhof wrote:
>>
>> Hi Marcello,
>>
>> On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti
>>  wrote:
>> >  On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote:
>> >>  I'm suddenly getting lots of the following errors on a server running
>> >>  2.36.7, but I have no idea what it means:
>> >>
>> >>  2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration.
>> >>  2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0
>> >>  2011-01-20T12:41:18.358624+01:00 phy005 kernel:
>> >>  ept_misconfig_inspect_spte: spte 0x50743e007 level 4
>> >>  2011-01-20T12:41:18.358627+01:00 phy005 kernel:
>> >>  ept_misconfig_inspect_spte: spte 0x523de2007 level 3
>> >>  2011-01-20T12:41:18.358629+01:00 phy005 kernel:
>> >>  ept_misconfig_inspect_spte: spte 0x62336f007 level 2
>> >>  2011-01-20T12:41:18.360109+01:00 phy005 kernel:
>> >>  ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
>> >>  2011-01-20T12:41:18.360137+01:00 phy005 kernel:
>> >>  ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
>> >>  2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here
>> >>  ]
>> >
>> >  A shadow pagetable entry in memory has bits 45-49 set, which is not
>> >  allowed. Its probably bad memory if this errors were not present before
>> >  with the same workload and host software. Would be useful to see what
>> >  memtest86 says.
>>
>> I did 2 memtest86+ passes, but no errors were found.
>>
>> Just to be save, we replaced all memory. The machine has been running
>> stable over the weekend, but now gives exactly the same error.
>>
>> Is there anything else which could cause this?
>
> Try updating the BIOS.

That's the first thing we did. It's a Supermicro with an X8DTU-F
board, updated to bios version 2.0b (which includes the latest
microcode). The procs are Intel 5620's

> When you say "suddenly", this was with no changes to software and hardware?

The host software and hardware hasn't changed in the two months since
the machine has been running. 2.6.34.7 kernel and qemu-kvm 0.13.

We host customer vms on it though, so virtual machines come and go.
Various operating systems, a mixture of Linux, FreeBSD and Windows
2008 R2. We have other machines with the same config without these
problems though.

> Is cooling adequate?

Yes.

> How much memory is on that machine?  Even outside the reserved bits the
> address looks way too large.

48GB.

This time I have a few different messages though:

2011-01-25T11:58:50.001208+01:00 phy005 kernel: general protection
fault:  [#1] SMP
2011-01-25T11:58:50.001310+01:00 phy005 kernel: last sysfs file:
/sys/devices/system/cpu/cpu15/topology/thread_siblings
2011-01-25T11:58:50.001316+01:00 phy005 kernel: CPU 12
2011-01-25T11:58:50.001323+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_filter
ip6_tables ipv6 kvm_intel kvm igb i2c_i801 iTCO_wdt i2c_core ioatdma
joydev iTCO_vendor_support dca serio_raw 3w_9xxx [last unloaded:
scsi_wait_scan]
2011-01-25T11:58:50.001327+01:00 phy005 kernel:
2011-01-25T11:58:50.001331+01:00 phy005 kernel: Pid: 1849, comm:
qemu-kvm Not tainted 2.6.34.7-66.tilaa.fc13.x86_64 #1 X8DTU/X8DTU
2011-01-25T11:58:50.001336+01:00 phy005 kernel: RIP:
0010:[]  [] __free_pages+0x9/0x26
2011-01-25T11:58:50.001339+01:00 phy005 kernel: RSP:
0018:8802fbe45ab8  EFLAGS: 00010216
2011-01-25T11:58:50.001343+01:00 phy005 kernel: RAX: 88061ef8c000
RBX: 8803131ec100 RCX: 
2011-01-25T11:58:50.001348+01:00 phy005 kernel: RDX: 00ff
RSI:  RDI: 1603a07305001568
2011-01-25T11:58:50.001352+01:00 phy005 kernel: RBP: 8802fbe45ab8
R08: ea000a83b7f0 R09: 0004
2011-01-25T11:58:50.001356+01:00 phy005 kernel: R10: 
R11: 8802fbe45b38 R12: 0100
2011-01-25T11:58:50.001359+01:00 phy005 kernel: R13: 0001
R14: 8802e934c010 R15: 8802e934c010
2011-01-25T11:58:50.001363+01:00 phy005 kernel: FS:
7f1f14844700() GS:88065548()
knlGS:
2011-01-25T11:58:50.001366+01:00 phy005 kernel: CS:  0010 DS:  ES:
 CR0: 8005003b
2011-01-25T11:58:50.001370+01:00 phy005 kernel: CR2: b72f6cb0
CR3: 000ba561c000 CR4: 26e0
2011-01-25T11:58:50.001374+01:00 phy005 kernel: DR0: 
DR1:  DR2: 
2011-01-25T11:58:50.001378+01:00 phy005 kernel: DR3: 
DR6: 0ff0 DR7: 0400
2011-01-25T11:58:50.001382+01:00 phy005 kernel: Process qemu-kvm (pid:
1849, threadinfo 8802fbe44000, task 8802ea11aee0)
2011-01-25T11:58:50.001385+01:00 phy005 kernel: Stack:
2011-01-25T11:58:50.001389+01:00 phy005 kernel: 8802fbe45af8
810ee455 0206 c9001e2d4000
2011-01-25T11:58:50.00

Re: EPT: Misconfiguration

2011-01-25 Thread Avi Kivity

On 01/25/2011 04:44 PM, Ruben Kerkhof wrote:

Hi Marcello,

On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti  wrote:
>  On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote:
>>  I'm suddenly getting lots of the following errors on a server running
>>  2.36.7, but I have no idea what it means:
>>
>>  2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration.
>>  2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0
>>  2011-01-20T12:41:18.358624+01:00 phy005 kernel:
>>  ept_misconfig_inspect_spte: spte 0x50743e007 level 4
>>  2011-01-20T12:41:18.358627+01:00 phy005 kernel:
>>  ept_misconfig_inspect_spte: spte 0x523de2007 level 3
>>  2011-01-20T12:41:18.358629+01:00 phy005 kernel:
>>  ept_misconfig_inspect_spte: spte 0x62336f007 level 2
>>  2011-01-20T12:41:18.360109+01:00 phy005 kernel:
>>  ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
>>  2011-01-20T12:41:18.360137+01:00 phy005 kernel:
>>  ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
>>  2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here
>>  ]
>
>  A shadow pagetable entry in memory has bits 45-49 set, which is not
>  allowed. Its probably bad memory if this errors were not present before
>  with the same workload and host software. Would be useful to see what
>  memtest86 says.

I did 2 memtest86+ passes, but no errors were found.

Just to be save, we replaced all memory. The machine has been running
stable over the weekend, but now gives exactly the same error.

Is there anything else which could cause this?


Try updating the BIOS.

When you say "suddenly", this was with no changes to software and hardware?

Is cooling adequate?

How much memory is on that machine?  Even outside the reserved bits the 
address looks way too large.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-25 Thread Ruben Kerkhof
Hi Marcello,

On Fri, Jan 21, 2011 at 14:22, Marcelo Tosatti  wrote:
> On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote:
>> I'm suddenly getting lots of the following errors on a server running
>> 2.36.7, but I have no idea what it means:
>>
>> 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration.
>> 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0
>> 2011-01-20T12:41:18.358624+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x50743e007 level 4
>> 2011-01-20T12:41:18.358627+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x523de2007 level 3
>> 2011-01-20T12:41:18.358629+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x62336f007 level 2
>> 2011-01-20T12:41:18.360109+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
>> 2011-01-20T12:41:18.360137+01:00 phy005 kernel:
>> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
>> 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here
>> ]
>
> A shadow pagetable entry in memory has bits 45-49 set, which is not
> allowed. Its probably bad memory if this errors were not present before
> with the same workload and host software. Would be useful to see what
> memtest86 says.

I did 2 memtest86+ passes, but no errors were found.

Just to be save, we replaced all memory. The machine has been running
stable over the weekend, but now gives exactly the same error.

Is there anything else which could cause this?

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-21 Thread Marcelo Tosatti
On Thu, Jan 20, 2011 at 12:48:00PM +0100, Ruben Kerkhof wrote:
> I'm suddenly getting lots of the following errors on a server running
> 2.36.7, but I have no idea what it means:
> 
> 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration.
> 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0
> 2011-01-20T12:41:18.358624+01:00 phy005 kernel:
> ept_misconfig_inspect_spte: spte 0x50743e007 level 4
> 2011-01-20T12:41:18.358627+01:00 phy005 kernel:
> ept_misconfig_inspect_spte: spte 0x523de2007 level 3
> 2011-01-20T12:41:18.358629+01:00 phy005 kernel:
> ept_misconfig_inspect_spte: spte 0x62336f007 level 2
> 2011-01-20T12:41:18.360109+01:00 phy005 kernel:
> ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
> 2011-01-20T12:41:18.360137+01:00 phy005 kernel:
> ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
> 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here
> ]

A shadow pagetable entry in memory has bits 45-49 set, which is not
allowed. Its probably bad memory if this errors were not present before 
with the same workload and host software. Would be useful to see what
memtest86 says.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-20 Thread Ruben Kerkhof
On Thu, Jan 20, 2011 at 12:48, Ruben Kerkhof  wrote:
> I'm suddenly getting lots of the following errors on a server running
> 2.36.7, but I have no idea what it means:

Sorry, that should be 2.34.7.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html