Re: general protection fault in vmx_vcpu_run (2)
On Thu, Feb 25, 2021, Dmitry Vyukov wrote: > On Wed, Feb 24, 2021 at 7:08 PM 'Sean Christopherson' via > syzkaller-bugs wrote: > > > > On Wed, Feb 24, 2021, Borislav Petkov wrote: > > > Hi Dmitry, > > > > > > On Wed, Feb 24, 2021 at 06:12:57PM +0100, Dmitry Vyukov wrote: > > > > Looking at the bisection log, the bisection was distracted by something > > > > else. > > > > > > Meaning the bisection result: > > > > > > 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") > > > > > > is bogus? > > > > Ya, looks 100% bogus. > > > > > > You can always find the original reported issue over the dashboard link: > > > > https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef > > > > or on lore: > > > > https://lore.kernel.org/lkml/7ff56205ba985...@google.com/ > > > > > > Ok, so this looks like this is trying to run kvm ioctls *in* a guest, > > > i.e., nested. Right? > > > > Yep. I tried to run the reproducer yesterday, but the kernel config > > wouldn't > > boot my VM. I haven't had time to dig in. Anyways, I think you can safely > > assume this is a KVM issue unless more data comes along that says otherwise. > > Interesting. What happens? Does the kernel crash? Userspace crash? > Rootfs is not mounted? Or something else? Not sure, it ended up in the EFI shell instead of the kernel (running with QEMU's -kernel). My QEMU+KVM setup does a variety of shenanigans, I'm guessing it's an incompatibility in my setup.
Re: general protection fault in vmx_vcpu_run (2)
On Wed, Feb 24, 2021 at 7:08 PM 'Sean Christopherson' via syzkaller-bugs wrote: > > On Wed, Feb 24, 2021, Borislav Petkov wrote: > > Hi Dmitry, > > > > On Wed, Feb 24, 2021 at 06:12:57PM +0100, Dmitry Vyukov wrote: > > > Looking at the bisection log, the bisection was distracted by something > > > else. > > > > Meaning the bisection result: > > > > 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") > > > > is bogus? > > Ya, looks 100% bogus. > > > > You can always find the original reported issue over the dashboard link: > > > https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef > > > or on lore: > > > https://lore.kernel.org/lkml/7ff56205ba985...@google.com/ > > > > Ok, so this looks like this is trying to run kvm ioctls *in* a guest, > > i.e., nested. Right? > > Yep. I tried to run the reproducer yesterday, but the kernel config wouldn't > boot my VM. I haven't had time to dig in. Anyways, I think you can safely > assume this is a KVM issue unless more data comes along that says otherwise. Interesting. What happens? Does the kernel crash? Userspace crash? Rootfs is not mounted? Or something else?
Re: general protection fault in vmx_vcpu_run (2)
On Wed, Feb 24, 2021 at 6:49 PM Borislav Petkov wrote: > > Hi Dmitry, > > On Wed, Feb 24, 2021 at 06:12:57PM +0100, Dmitry Vyukov wrote: > > Looking at the bisection log, the bisection was distracted by something > > else. > > Meaning the bisection result: > > 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") > > is bogus? > > > You can always find the original reported issue over the dashboard link: > > https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef > > or on lore: > > https://lore.kernel.org/lkml/7ff56205ba985...@google.com/ > > Ok, so this looks like this is trying to run kvm ioctls *in* a guest, > i.e., nested. Right? Yes, testing happens in VM. But the kernel that crashes is the one that receives the ioctls.
Re: general protection fault in vmx_vcpu_run (2)
On Wed, Feb 24, 2021, Borislav Petkov wrote: > Hi Dmitry, > > On Wed, Feb 24, 2021 at 06:12:57PM +0100, Dmitry Vyukov wrote: > > Looking at the bisection log, the bisection was distracted by something > > else. > > Meaning the bisection result: > > 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") > > is bogus? Ya, looks 100% bogus. > > You can always find the original reported issue over the dashboard link: > > https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef > > or on lore: > > https://lore.kernel.org/lkml/7ff56205ba985...@google.com/ > > Ok, so this looks like this is trying to run kvm ioctls *in* a guest, > i.e., nested. Right? Yep. I tried to run the reproducer yesterday, but the kernel config wouldn't boot my VM. I haven't had time to dig in. Anyways, I think you can safely assume this is a KVM issue unless more data comes along that says otherwise.
Re: general protection fault in vmx_vcpu_run (2)
Hi Dmitry, On Wed, Feb 24, 2021 at 06:12:57PM +0100, Dmitry Vyukov wrote: > Looking at the bisection log, the bisection was distracted by something else. Meaning the bisection result: 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") is bogus? > You can always find the original reported issue over the dashboard link: > https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef > or on lore: > https://lore.kernel.org/lkml/7ff56205ba985...@google.com/ Ok, so this looks like this is trying to run kvm ioctls *in* a guest, i.e., nested. Right? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: general protection fault in vmx_vcpu_run (2)
On Wed, Feb 24, 2021 at 1:27 PM Borislav Petkov wrote: > > On Tue, Feb 23, 2021 at 03:17:07PM -0800, syzbot wrote: > > syzbot has bisected this issue to: > > > > commit 167dcfc08b0b1f964ea95d410aa496fd78adf475 > > Author: Lorenzo Stoakes > > Date: Tue Dec 15 20:56:41 2020 + > > > > x86/mm: Increase pgt_buf size for 5-level page tables > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13fe3ea8d0 > > start commit: a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern.. > > git tree: upstream > > final oops: https://syzkaller.appspot.com/x/report.txt?x=10013ea8d0 > > No oops here. > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17fe3ea8d0 > > Nothing special here too. > > > kernel config: https://syzkaller.appspot.com/x/.config?x=49116074dd53b631 > > Tried this on two boxes, the Intel one doesn't even boot with that > config - and it is pretty standard one - and on the AMD one the > reproducer doesn't trigger anything. It probably won't because the GP > is in vmx_vcpu_run() but since the ioctls were doing something with > IRQCHIP, I thought it is probably vendor-agnostic. > > So, all in all, I could use some more info on how you're reproducing and > maybe you could show the oops too. Hi Boris, Looking at the bisection log, the bisection was distracted by something else. You can always find the original reported issue over the dashboard link: https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef or on lore: https://lore.kernel.org/lkml/7ff56205ba985...@google.com/
Re: general protection fault in vmx_vcpu_run (2)
On Tue, Feb 23, 2021 at 03:17:07PM -0800, syzbot wrote: > syzbot has bisected this issue to: > > commit 167dcfc08b0b1f964ea95d410aa496fd78adf475 > Author: Lorenzo Stoakes > Date: Tue Dec 15 20:56:41 2020 + > > x86/mm: Increase pgt_buf size for 5-level page tables > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13fe3ea8d0 > start commit: a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern.. > git tree: upstream > final oops: https://syzkaller.appspot.com/x/report.txt?x=10013ea8d0 No oops here. > console output: https://syzkaller.appspot.com/x/log.txt?x=17fe3ea8d0 Nothing special here too. > kernel config: https://syzkaller.appspot.com/x/.config?x=49116074dd53b631 Tried this on two boxes, the Intel one doesn't even boot with that config - and it is pretty standard one - and on the AMD one the reproducer doesn't trigger anything. It probably won't because the GP is in vmx_vcpu_run() but since the ioctls were doing something with IRQCHIP, I thought it is probably vendor-agnostic. So, all in all, I could use some more info on how you're reproducing and maybe you could show the oops too. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: general protection fault in vmx_vcpu_run (2)
syzbot has bisected this issue to: commit 167dcfc08b0b1f964ea95d410aa496fd78adf475 Author: Lorenzo Stoakes Date: Tue Dec 15 20:56:41 2020 + x86/mm: Increase pgt_buf size for 5-level page tables bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=13fe3ea8d0 start commit: a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern.. git tree: upstream final oops: https://syzkaller.appspot.com/x/report.txt?x=10013ea8d0 console output: https://syzkaller.appspot.com/x/log.txt?x=17fe3ea8d0 kernel config: https://syzkaller.appspot.com/x/.config?x=49116074dd53b631 dashboard link: https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef syz repro: https://syzkaller.appspot.com/x/repro.syz?x=141f3f04d0 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17de4f12d0 Reported-by: syzbot+42a71c84ef04577f1...@syzkaller.appspotmail.com Fixes: 167dcfc08b0b ("x86/mm: Increase pgt_buf size for 5-level page tables") For information about bisection process see: https://goo.gl/tpsmEJ#bisection
Re: general protection fault in vmx_vcpu_run (2)
syzbot has found a reproducer for the following issue on: HEAD commit:a99163e9 Merge tag 'devicetree-for-5.12' of git://git.kern.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=15cd357f50 kernel config: https://syzkaller.appspot.com/x/.config?x=49116074dd53b631 dashboard link: https://syzkaller.appspot.com/bug?extid=42a71c84ef04577f1aef compiler: Debian clang version 11.0.1-2 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c7f8a8d0 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=137fc232d0 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+42a71c84ef04577f1...@syzkaller.appspotmail.com RBP: 00402ed0 R08: 00400488 R09: 00400488 R10: 00400488 R11: 0246 R12: 00402f60 R13: R14: 004ac018 R15: 00400488 == BUG: KASAN: global-out-of-bounds in atomic_switch_perf_msrs arch/x86/kvm/vmx/vmx.c:6604 [inline] BUG: KASAN: global-out-of-bounds in vmx_vcpu_run+0x4f1/0x13f0 arch/x86/kvm/vmx/vmx.c:6771 Read of size 8 at addr 89a000e9 by task syz-executor198/8346 CPU: 0 PID: 8346 Comm: syz-executor198 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x125/0x19e lib/dump_stack.c:120 print_address_description+0x5f/0x3a0 mm/kasan/report.c:230 __kasan_report mm/kasan/report.c:396 [inline] kasan_report+0x15e/0x200 mm/kasan/report.c:413 atomic_switch_perf_msrs arch/x86/kvm/vmx/vmx.c:6604 [inline] vmx_vcpu_run+0x4f1/0x13f0 arch/x86/kvm/vmx/vmx.c:6771 vcpu_enter_guest+0x2ed9/0x8f10 arch/x86/kvm/x86.c:9074 vcpu_run+0x316/0xb70 arch/x86/kvm/x86.c:9225 kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9453 kvm_vcpu_ioctl+0x62a/0xa30 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3295 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x43eee9 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:7ffe7ad00d38 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 00400488 RCX: 0043eee9 RDX: RSI: ae80 RDI: 0005 RBP: 00402ed0 R08: 00400488 R09: 00400488 R10: 00400488 R11: 0246 R12: 00402f60 R13: R14: 004ac018 R15: 00400488 The buggy address belongs to the variable: str__initcall__trace_system_name+0x9/0x40 Memory state around the buggy address: 899fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 89a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >89a00080: 00 00 00 00 00 00 00 00 00 00 00 00 00 01 f9 f9 ^ 89a00100: f9 f9 f9 f9 07 f9 f9 f9 f9 f9 f9 f9 00 03 f9 f9 89a00180: f9 f9 f9 f9 00 06 f9 f9 f9 f9 f9 f9 00 00 00 00 ==
Re: general protection fault in vmx_vcpu_run
On Wed, Jul 4, 2018 at 9:31 PM, Raslan, KarimAllah wrote: > Dmitry, > > Can you share the host kernel version? > > I can not reproduce any of these crash signatures and I think it's > really a nested virtualization bug. So I will need the exact host > kernel version as well. > > I am currently getting all sorts of: > > "KVM: entry failed, hardware error 0x7" > > ... instead of the crash signatures that you are posting. Hi Raslan, The tested kernel runs as GCE VM. Jim, how can we describe the host kernel for GCE? Potentially only we can debug this. > On Sat, 2018-06-30 at 08:09 +, Raslan, KarimAllah wrote: >> Looking also at the other crash [0]: >> >> msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap; >> 811f65b7: e8 44 cb 57 00 callq 81773100 >> <__sanitizer_cov_trace_pc> >> 811f65bc: 48 8b 54 24 08 mov0x8(%rsp),%rdx >> 811f65c1: 48 b8 00 00 00 00 00movabs >> $0xdc00,%rax >> 811f65c8: fc ff df >> 811f65cb: 48 c1 ea 03 shr$0x3,%rdx >> 811f65cf: 80 3c 02 >> 00 cmpb $0x0,(%rdx,%rax,1)<- fault here. >> 811f65d3: 0f 85 36 19 00 00 jne811f7f0f >> >> >> %rdx should contain a pointer to loaded_vmcs. It is directly loaded >> from the stack [0x8(%rsp)]. This same stack location was just used >> before the inlined assembly for VMRESUME/VMLAUNCH here: >> >> vmx->__launched = vmx->loaded_vmcs->launched; >> 811f639f: e8 5c cd 57 00 callq 81773100 >> <__sanitizer_cov_trace_pc> >> 811f63a4: 48 8b 54 24 08 mov0x8(%rsp),%rdx >> 811f63a9: 48 b8 00 00 00 00 00movabs >> $0xdc00,%rax >> 811f63b0: fc ff df >> 811f63b3: 48 c1 ea 03 shr$0x3,%rdx >> 811f63b7: 80 3c 02 >> 00 cmpb $0x0,(%rdx,%rax,1)<- used here. >> >> ... and this stack location was never touched by anything in between! >> So something must have corrupted the stack itself not really the >> kvm_vc >> pu struct. >> >> Obviously the inlined assembly block is using the stack as well, but I >> can not see anything that would cause this corruption there. >> >> That being said, looking at the %rsp and %rbp values that are dumped >> in the stack trace: >> >> RSP: 8801b7d7f380 >> RBP: 8801b8260140 >> >> ... they are almost 4.8 MiB apart! Should not these two register be a >> bit closer to each other? :) >> >> So 2 possibilities here: >> >> 1- %rsp is wrong >> >> That would explain why the loaded_vmcs was NULL. However, it is a bit >> harder to understand how it became wrong! It should have been restored >> during the VMEXIT from the HOST_RSP value in the VMCS! >> >> Is this a nested setup? >> >> 2- %rbp is wrong >> >> That would also explain why the loaded_vmcs was NULL. Whatever >> corrupted the stack that caused loaded_vmcs to be NULL could have also >> corrupted the %rbp saved in the stack. That would mean that it happened >> during a function call. All function calls that happened between the >> point when the stack was sane (just before the "asm" block for >> VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I >> can not see where the stack would get corrupted though! Obviously >> another source of corruption can be a completely unrelated thread >> directly corruption this thread's memory. >> >> Maybe it would be easier to just try to repro it first and see which >> one is true (if at all). >> >> [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 >> >> >> On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote: >> > >> > 22: 0f 01 c3 vmresume >> > 25: 48 89 4c 24 08mov%rcx,0x8(%rsp) >> > 2a: 59pop%rcx >> > >> > : >> > 2b: 0f 96 81 88 56 00 00 setbe 0x5688(%rcx) >> > 32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx) >> > 39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx) >> > >> > %rcx should be pointing to the vcpu_vmx structure, but it's not even >> > canonical: 110035842e78. >> > > Amazon Development Center Germany GmbH > Berlin - Dresden - Aachen > main office: Krausenstr. 38, 10117 Berlin > Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger > Ust-ID: DE289237879 > Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Re: general protection fault in vmx_vcpu_run
Dmitry, Can you share the host kernel version? I can not reproduce any of these crash signatures and I think it's really a nested virtualization bug. So I will need the exact host kernel version as well. I am currently getting all sorts of: "KVM: entry failed, hardware error 0x7" ... instead of the crash signatures that you are posting. Regards. On Sat, 2018-06-30 at 08:09 +, Raslan, KarimAllah wrote: > Looking also at the other crash [0]: > > msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap; > 811f65b7: e8 44 cb 57 00 callq 81773100 > <__sanitizer_cov_trace_pc> > 811f65bc: 48 8b 54 24 08 mov0x8(%rsp),%rdx > 811f65c1: 48 b8 00 00 00 00 00movabs > $0xdc00,%rax > 811f65c8: fc ff df > 811f65cb: 48 c1 ea 03 shr$0x3,%rdx > 811f65cf: 80 3c 02 > 00 cmpb $0x0,(%rdx,%rax,1) <- fault here. > 811f65d3: 0f 85 36 19 00 00 jne811f7f0f > > > %rdx should contain a pointer to loaded_vmcs. It is directly loaded > from the stack [0x8(%rsp)]. This same stack location was just used > before the inlined assembly for VMRESUME/VMLAUNCH here: > > vmx->__launched = vmx->loaded_vmcs->launched; > 811f639f: e8 5c cd 57 00 callq 81773100 > <__sanitizer_cov_trace_pc> > 811f63a4: 48 8b 54 24 08 mov0x8(%rsp),%rdx > 811f63a9: 48 b8 00 00 00 00 00movabs > $0xdc00,%rax > 811f63b0: fc ff df > 811f63b3: 48 c1 ea 03 shr$0x3,%rdx > 811f63b7: 80 3c 02 > 00 cmpb $0x0,(%rdx,%rax,1) <- used here. > > ... and this stack location was never touched by anything in between! > So something must have corrupted the stack itself not really the > kvm_vc > pu struct. > > Obviously the inlined assembly block is using the stack as well, but I > can not see anything that would cause this corruption there. > > That being said, looking at the %rsp and %rbp values that are dumped > in the stack trace: > > RSP: 8801b7d7f380 > RBP: 8801b8260140 > > ... they are almost 4.8 MiB apart! Should not these two register be a > bit closer to each other? :) > > So 2 possibilities here: > > 1- %rsp is wrong > > That would explain why the loaded_vmcs was NULL. However, it is a bit > harder to understand how it became wrong! It should have been restored > during the VMEXIT from the HOST_RSP value in the VMCS! > > Is this a nested setup? > > 2- %rbp is wrong > > That would also explain why the loaded_vmcs was NULL. Whatever > corrupted the stack that caused loaded_vmcs to be NULL could have also > corrupted the %rbp saved in the stack. That would mean that it happened > during a function call. All function calls that happened between the > point when the stack was sane (just before the "asm" block for > VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I > can not see where the stack would get corrupted though! Obviously > another source of corruption can be a completely unrelated thread > directly corruption this thread's memory. > > Maybe it would be easier to just try to repro it first and see which > one is true (if at all). > > [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 > > > On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote: > > > > 22: 0f 01 c3 vmresume > > 25: 48 89 4c 24 08mov%rcx,0x8(%rsp) > > 2a: 59pop%rcx > > > > : > > 2b: 0f 96 81 88 56 00 00 setbe 0x5688(%rcx) > > 32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx) > > 39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx) > > > > %rcx should be pointing to the vcpu_vmx structure, but it's not even > > canonical: 110035842e78. > > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Re: general protection fault in vmx_vcpu_run
Looking also at the other crash [0]: msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap; 811f65b7: e8 44 cb 57 00 callq 81773100 <__sanitizer_cov_trace_pc> 811f65bc: 48 8b 54 24 08 mov0x8(%rsp),%rdx 811f65c1: 48 b8 00 00 00 00 00movabs $0xdc00,%rax 811f65c8: fc ff df 811f65cb: 48 c1 ea 03 shr$0x3,%rdx 811f65cf: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <- fault here. 811f65d3: 0f 85 36 19 00 00 jne811f7f0f %rdx should contain a pointer to loaded_vmcs. It is directly loaded from the stack [0x8(%rsp)]. This same stack location was just used before the inlined assembly for VMRESUME/VMLAUNCH here: vmx->__launched = vmx->loaded_vmcs->launched; 811f639f: e8 5c cd 57 00 callq 81773100 <__sanitizer_cov_trace_pc> 811f63a4: 48 8b 54 24 08 mov0x8(%rsp),%rdx 811f63a9: 48 b8 00 00 00 00 00movabs $0xdc00,%rax 811f63b0: fc ff df 811f63b3: 48 c1 ea 03 shr$0x3,%rdx 811f63b7: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <- used here. ... and this stack location was never touched by anything in between! So something must have corrupted the stack itself not really the kvm_vc pu struct. Obviously the inlined assembly block is using the stack as well, but I can not see anything that would cause this corruption there. That being said, looking at the %rsp and %rbp values that are dumped in the stack trace: RSP: 8801b7d7f380 RBP: 8801b8260140 ... they are almost 4.8 MiB apart! Should not these two register be a bit closer to each other? :) So 2 possibilities here: 1- %rsp is wrong That would explain why the loaded_vmcs was NULL. However, it is a bit harder to understand how it became wrong! It should have been restored during the VMEXIT from the HOST_RSP value in the VMCS! Is this a nested setup? 2- %rbp is wrong That would also explain why the loaded_vmcs was NULL. Whatever corrupted the stack that caused loaded_vmcs to be NULL could have also corrupted the %rbp saved in the stack. That would mean that it happened during a function call. All function calls that happened between the point when the stack was sane (just before the "asm" block for VMLAUNCH) and the crash-site are only kcov related. Looking at kcov, I can not see where the stack would get corrupted though! Obviously another source of corruption can be a completely unrelated thread directly corruption this thread's memory. Maybe it would be easier to just try to repro it first and see which one is true (if at all). [0] https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 On Thu, 2018-06-28 at 10:18 -0700, Jim Mattson wrote: > 22: 0f 01 c3 vmresume > 25: 48 89 4c 24 08mov%rcx,0x8(%rsp) > 2a: 59pop%rcx > > : > 2b: 0f 96 81 88 56 00 00 setbe 0x5688(%rcx) > 32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx) > 39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx) > > %rcx should be pointing to the vcpu_vmx structure, but it's not even > canonical: 110035842e78. > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Re: general protection fault in vmx_vcpu_run
22: 0f 01 c3 vmresume 25: 48 89 4c 24 08mov%rcx,0x8(%rsp) 2a: 59pop%rcx : 2b: 0f 96 81 88 56 00 00 setbe 0x5688(%rcx) 32: 48 89 81 00 03 00 00 mov%rax,0x300(%rcx) 39: 48 89 99 18 03 00 00 mov%rbx,0x318(%rcx) %rcx should be pointing to the vcpu_vmx structure, but it's not even canonical: 110035842e78.
Re: general protection fault in vmx_vcpu_run
On Sat, Apr 14, 2018 at 3:07 AM, syzbot wrote: > syzbot has found reproducer for the following crash on upstream commit > 1bad9ce155a7c010a9a5f3261ad12a6a8eccfb2c (Fri Apr 13 19:27:11 2018 +) > Merge tag 'sh-for-4.17' of git://git.libc.org/linux-sh > syzbot dashboard link: > https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 > > So far this crash happened 4 times on upstream. > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6257386297753600 > syzkaller reproducer: > https://syzkaller.appspot.com/x/repro.syz?id=4808329293463552 > Raw console output: > https://syzkaller.appspot.com/x/log.txt?id=4943675322793984 > Kernel config: > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+cc483201a3c6436d3...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. #syz dup: BUG: unable to handle kernel paging request in vmx_vcpu_run > IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready > IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready > 8021q: adding VLAN 0 to HW filter on device team0 > kasan: CONFIG_KASAN_INLINE enabled > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: [#1] SMP KASAN > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 0 PID: 6472 Comm: syzkaller667776 Not tainted 4.16.0+ #1 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 > RSP: 0018:8801c95bf368 EFLAGS: 00010002 > RAX: 8801b44df6e8 RBX: 8801ada0ec40 RCX: 1100392b7e78 > RDX: RSI: 81467b15 RDI: 8801ada0ec50 > RBP: 8801b44df790 R08: 8801c4efe780 R09: fbfff1141218 > R10: fbfff1141218 R11: 88a090c3 R12: 8801b186aa90 > R13: 8801ae61e000 R14: dc00 R15: 8801ae61e3e0 > FS: 7fa147982700() GS:8801db00() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: CR3: 0001d780d000 CR4: 001426f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > Code: 8b a9 68 03 00 00 4c 8b b1 70 03 00 00 4c 8b b9 78 03 00 00 48 8b 89 > 08 03 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 48 89 4c 24 08 59 <0f> 96 81 88 56 > 00 00 48 89 81 00 03 00 00 48 89 99 18 03 00 00 > RIP: vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: 8801c95bf368 > ---[ end trace ffd91ebc3bb06b01 ]--- > Kernel panic - not syncing: Fatal exception > Shutting down cpus with NMI > Dumping ftrace buffer: >(ftrace buffer empty) > Kernel Offset: disabled > Rebooting in 86400 seconds.. > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/37b58a0569c49b70%40google.com. > > For more options, visit https://groups.google.com/d/optout.
Re: general protection fault in vmx_vcpu_run
syzbot has found reproducer for the following crash on upstream commit 1bad9ce155a7c010a9a5f3261ad12a6a8eccfb2c (Fri Apr 13 19:27:11 2018 +) Merge tag 'sh-for-4.17' of git://git.libc.org/linux-sh syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=cc483201a3c6436d3550 So far this crash happened 4 times on upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6257386297753600 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=4808329293463552 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=4943675322793984 Kernel config: https://syzkaller.appspot.com/x/.config?id=-5947642240294114534 compiler: gcc (GCC) 8.0.1 20180413 (experimental) IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+cc483201a3c6436d3...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready 8021q: adding VLAN 0 to HW filter on device team0 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 6472 Comm: syzkaller667776 Not tainted 4.16.0+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: 0018:8801c95bf368 EFLAGS: 00010002 RAX: 8801b44df6e8 RBX: 8801ada0ec40 RCX: 1100392b7e78 RDX: RSI: 81467b15 RDI: 8801ada0ec50 RBP: 8801b44df790 R08: 8801c4efe780 R09: fbfff1141218 R10: fbfff1141218 R11: 88a090c3 R12: 8801b186aa90 R13: 8801ae61e000 R14: dc00 R15: 8801ae61e3e0 FS: 7fa147982700() GS:8801db00() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0001d780d000 CR4: 001426f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: Code: 8b a9 68 03 00 00 4c 8b b1 70 03 00 00 4c 8b b9 78 03 00 00 48 8b 89 08 03 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 48 89 4c 24 08 59 <0f> 96 81 88 56 00 00 48 89 81 00 03 00 00 48 89 99 18 03 00 00 RIP: vmx_vcpu_run+0x95f/0x25f0 arch/x86/kvm/vmx.c:9746 RSP: 8801c95bf368 ---[ end trace ffd91ebc3bb06b01 ]--- Kernel panic - not syncing: Fatal exception Shutting down cpus with NMI Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds..