Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
On Wed, Apr 19, 2017 at 08:39:05PM +0100, Ben Hutchings wrote : > On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote: > [...] > > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5 > > minutes even without doing any 'xl vcpu-set'? > > The MCE polling timer for each CPU runs every 5 minutes, so this is > presumably the first time it runs. Perhaps this domain is configured > such that CPUs are hot-removed shortly after boot? I didn't explicitly set anything like that, but I guess it could also be a default configuration in Xen. > In the first crash, it looks like the timer for CPU x!=0 is being > called on CPU 0. In general this can happen if CPU x is hot-removed; > its timers are migrated to another CPU. This should *not* be possible > with the MCE timer, as there is a hotplug callback that removes the > timer when a CPU is removed. There is a check for the timer having > been migrated anyway, which triggers the WARNING. The timer function > then tries to re-add the timer for the current CPU, but that's still > pending, which triggers the BUG. Either the hotplug callback was not > called, or the timer was migrated before being removed resulting in a > race condition. > > > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of > > the time (between 1 and 16 vcpus), but after several tries, I got the > > attached trace. > > I'm not sure what's going on in this crash, but as it's a null > dereference in migrate_timer_list it seems somewhat related. > > I didn't find any changes that would explain how this was fixed between > 4.0 and 4.2. I suggest you work around it by adding 'nomce' to the > kernel command line as I would expect Xen or dom0 to handle MCEs. Thanks a lot Ben, I can't reproduce the issue with 'nomce'. Thanks, Vincent signature.asc Description: PGP signature
Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
On Fri, 2017-04-14 at 11:18 +0200, Vincent Legout wrote: [...] > Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5 > minutes even without doing any 'xl vcpu-set'? The MCE polling timer for each CPU runs every 5 minutes, so this is presumably the first time it runs. Perhaps this domain is configured such that CPUs are hot-removed shortly after boot? In the first crash, it looks like the timer for CPU x!=0 is being called on CPU 0. In general this can happen if CPU x is hot-removed; its timers are migrated to another CPU. This should *not* be possible with the MCE timer, as there is a hotplug callback that removes the timer when a CPU is removed. There is a check for the timer having been migrated anyway, which triggers the WARNING. The timer function then tries to re-add the timer for the current CPU, but that's still pending, which triggers the BUG. Either the hotplug callback was not called, or the timer was migrated before being removed resulting in a race condition. > With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of > the time (between 1 and 16 vcpus), but after several tries, I got the > attached trace. I'm not sure what's going on in this crash, but as it's a null dereference in migrate_timer_list it seems somewhat related. I didn't find any changes that would explain how this was fixed between 4.0 and 4.2. I suggest you work around it by adding 'nomce' to the kernel command line as I would expect Xen or dom0 to handle MCEs. Ben. -- Ben Hutchings Man invented language to satisfy his deep need to complain. - Lily Tomlin signature.asc Description: This is a digitally signed message part
Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
On Fri, Apr 14, 2017 at 09:15:58AM +0200, Vincent Legout wrote : > On Thu, Apr 13, 2017 at 11:41:37PM +0100, Ben Hutchings wrote : > > Control: tag -1 moreinfo > > > > On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote: > > > Package: src:linux > > > Version: 3.16.39-1+deb8u2 > > > Severity: normal > > > > > > Hi, > > > > > > A xen jessie domU crashes around 5 minutes after the boot with the > > > attached backtrace (at every boot). dom0 is also a Debian jessie running > > > Xen 4.8. > > > > > > It only happens when the guest is in pv mode, it works fine with pvhvm. > > > > > > It also crashes with older 3.16 kernels and 4.0.2-1, but not with > > > 4.2.1-1 (last 2 kernels from snapshot.debian.org). > > > > > > # uname -a > > > 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 > > > GNU/Linux > > > > From the crash log: > > > > > [ 300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW > > > 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 > > > > This indicates there was an earlier WARNING message; what was that? > > Thanks for the answer. > > I got this WARNING after I increased verbosity in the command line: The WARNING and BUG disappear if "maxvcpus" is disabled in the guest configuration (which prevents adding or removing vcpus). Could cpu hotplug be buggy in 3.16? And Xen triggers this bug after 5 minutes even without doing any 'xl vcpu-set'? With "maxvcpus" set larger "vcpus", xl vcpu-set seems to work most of the time (between 1 and 16 vcpus), but after several tries, I got the attached trace. Vincent [ 62.000210] BUG: unable to handle kernel NULL pointer dereference at 0008 [ 62.000229] IP: [] migrate_timer_list+0x3b/0xc0 [ 62.000246] PGD 0 [ 62.000251] Oops: 0002 [#1] SMP [ 62.000261] Modules linked in: x86_pkg_temp_thermal thermal_sys intel_rapl coretemp crc32_pclmul evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper pcspkr cryptd autofs4 ext4 crc16 mbcache jbd2 xen_netfront xen_blkfront crct10dif_pclmul crct10dif_common crc32c_intel [ 62.000306] CPU: 9 PID: 89 Comm: xenwatch Not tainted 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 [ 62.000318] task: 88003d597370 ti: 88003d598000 task.ti: 88003d598000 [ 62.000326] RIP: e030:[] [] migrate_timer_list+0x3b/0xc0 [ 62.000338] RSP: e02b:88003d59bd70 EFLAGS: 00010087 [ 62.000344] RAX: dead0200 RBX: RCX: 223a [ 62.000351] RDX: RSI: 88003f96ca00 RDI: 88003daac000 [ 62.000357] RBP: 88003f96ca00 R08: 4000 R09: fff8 [ 62.000364] R10: R11: R12: 88003e3a5430 [ 62.000375] R13: 88003daac000 R14: 818e2fa0 R15: 88003e3a5030 [ 62.000387] FS: () GS:88003f92() knlGS: [ 62.000395] CS: e033 DS: ES: CR0: 80050033 [ 62.000401] CR2: 0008 CR3: 01813000 CR4: 00042660 [ 62.000408] Stack: [ 62.000411] 88003daac000 88003e3a5c30 88003e3a5830 [ 62.000422] 88003e3a5430 81075188 88003e3a4000 fff2 [ 62.000432] 8184c1a0 0007 0001 [ 62.000443] Call Trace: [ 62.000455] [] ? timer_cpu_notify+0xf8/0x2e0 [ 62.000465] [] ? notifier_call_chain+0x4e/0x70 [ 62.000478] [] ? cpu_notify+0x1f/0x40 [ 62.000486] [] ? cpu_notify_nofail+0xa/0x20 [ 62.000499] [] ? _cpu_down+0x17b/0x290 [ 62.000512] [] ? unregister_xenbus_watch+0x210/0x210 [ 62.000520] [] ? cpu_down+0x2d/0x40 [ 62.000530] [] ? handle_vcpu_hotplug_event+0xa7/0xd0 [ 62.000538] [] ? xenwatch_thread+0x92/0x130 [ 62.000550] [] ? prepare_to_wait_event+0xf0/0xf0 [ 62.000565] [] ? kthread+0xbd/0xe0 [ 62.000572] [] ? kthread_create_on_node+0x180/0x180 [ 62.000586] [] ? ret_from_fork+0x58/0x90 [ 62.000594] [] ? kthread_create_on_node+0x180/0x180 [ 62.000600] Code: 49 89 fd 41 54 49 89 f4 55 53 48 8b 2e 48 39 ee 74 4a 66 0f 1f 44 00 00 0f 1f 44 00 00 48 8b 45 08 48 8b 55 00 48 89 ee 4c 89 ef <48> 89 42 08 48 89 10 48 b8 00 02 00 00 00 00 ad de 48 89 45 08 [ 62.000680] RIP [] migrate_timer_list+0x3b/0xc0 [ 62.000692] RSP [ 62.000696] CR2: 0008 [ 62.000703] ---[ end trace b62387850d17f99e ]--- [ 84.492006] INFO: rcu_sched detected stalls on CPUs/tasks: { 2 8 9} (detected by 4, t=5255 jiffies, g=614, c=613, q=59) [ 84.492039] sending NMI to all CPUs: [ 63.481417] NMI backtrace for cpu 0 [ 63.481417] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 [ 63.481417] task: 8181a460 ti: 8180 task.ti: 8180 [ 63.481417] RIP: e030:[] [] _raw_spin_lock+0x28/0x30 [ 63.481417] RSP: e02b:88003f803b58 EFLAGS: 0093 [ 63.481417] RAX: 0198 RBX: 88003af9d3d8 RCX: 019b [ 63.481417]
Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
On Thu, Apr 13, 2017 at 11:41:37PM +0100, Ben Hutchings wrote : > Control: tag -1 moreinfo > > On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote: > > Package: src:linux > > Version: 3.16.39-1+deb8u2 > > Severity: normal > > > > Hi, > > > > A xen jessie domU crashes around 5 minutes after the boot with the > > attached backtrace (at every boot). dom0 is also a Debian jessie running > > Xen 4.8. > > > > It only happens when the guest is in pv mode, it works fine with pvhvm. > > > > It also crashes with older 3.16 kernels and 4.0.2-1, but not with > > 4.2.1-1 (last 2 kernels from snapshot.debian.org). > > > > # uname -a > > 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux > > From the crash log: > > > [ 300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW > > 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 > > This indicates there was an earlier WARNING message; what was that? Thanks for the answer. I got this WARNING after I increased verbosity in the command line: [ 300.636063] [ cut here ] [ 300.636102] WARNING: CPU: 0 PID: 0 at /build/linux-GSgHvp/linux-3.16.39/arch/x86/kernel/cpu/mcheck/mce.c:1307 mce_timer_fn+0x132/0x140() [ 300.636116] Modules linked in: x86_pkg_temp_thermal thermal_sys intel_rapl coretemp crc32_pclmul evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper pcspkr cryptd autofs4 ext4 crc16 mbcache jbd2 xen_netfront xen_blkfront crct10dif_pclmul crct10dif_common crc32c_intel [ 300.636167] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 [ 300.636178] 81514c81 0009 [ 300.636188] 81068867 88003f80ca00 88003f9eca00 0100 [ 300.636199] 81038a30 000f 81038b62 81a66e00 [ 300.636211] Call Trace: [ 300.636216][] ? dump_stack+0x5d/0x78 [ 300.636242] [] ? warn_slowpath_common+0x77/0x90 [ 300.636250] [] ? mce_cpu_restart+0x40/0x40 [ 300.636257] [] ? mce_timer_fn+0x132/0x140 [ 300.636267] [] ? call_timer_fn+0x31/0x140 [ 300.636274] [] ? mce_cpu_restart+0x40/0x40 [ 300.636284] [] ? run_timer_softirq+0x1e9/0x2f0 [ 300.636292] [] ? __do_softirq+0xf1/0x2d0 [ 300.636299] [] ? irq_exit+0x95/0xa0 [ 300.636309] [] ? xen_evtchn_do_upcall+0x35/0x50 [ 300.636319] [] ? xen_do_hypervisor_callback+0x1e/0x30 [ 300.636324][] ? xen_hypercall_sched_op+0xc/0x20 [ 300.636339] [] ? xen_hypercall_sched_op+0xc/0x20 [ 300.636349] [] ? xen_safe_halt+0xc/0x20 [ 300.636360] [] ? default_idle+0x19/0xd0 [ 300.636370] [] ? cpu_startup_entry+0x374/0x470 [ 300.636384] [] ? start_kernel+0x497/0x4a2 [ 300.636392] [] ? set_init_arg+0x4e/0x4e [ 300.636400] [] ? xen_start_kernel+0x569/0x573 [ 300.636413] ---[ end trace 7131ef713ca84161 ]--- Then, the same BUG as before. It always happens after 300 seconds. Vincent signature.asc Description: PGP signature
Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
Control: tag -1 moreinfo On Thu, 2017-04-13 at 11:18 +0200, Vincent Legout wrote: > Package: src:linux > Version: 3.16.39-1+deb8u2 > Severity: normal > > Hi, > > A xen jessie domU crashes around 5 minutes after the boot with the > attached backtrace (at every boot). dom0 is also a Debian jessie running > Xen 4.8. > > It only happens when the guest is in pv mode, it works fine with pvhvm. > > It also crashes with older 3.16 kernels and 4.0.2-1, but not with > 4.2.1-1 (last 2 kernels from snapshot.debian.org). > > # uname -a > 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux From the crash log: > [ 300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW > 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 This indicates there was an earlier WARNING message; what was that? Ben. -- Ben Hutchings Any sufficiently advanced bug is indistinguishable from a feature. signature.asc Description: This is a digitally signed message part
Bug#860236: xen pv domU crash with 3.16 kernel and xen 4.8
Package: src:linux Version: 3.16.39-1+deb8u2 Severity: normal Hi, A xen jessie domU crashes around 5 minutes after the boot with the attached backtrace (at every boot). dom0 is also a Debian jessie running Xen 4.8. It only happens when the guest is in pv mode, it works fine with pvhvm. It also crashes with older 3.16 kernels and 4.0.2-1, but not with 4.2.1-1 (last 2 kernels from snapshot.debian.org). # uname -a 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux Vincent [ 300.632313] kernel BUG at /build/linux-GSgHvp/linux-3.16.39/kernel/timer.c:946! [ 300.632320] invalid opcode: [#1] SMP [ 300.632326] Modules linked in: fuse btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs libcrc32c crc32c_generic dm_mod x86_pkg_temp_thermal thermal_sys intel_rapl coretemp crc32_pclmul evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper pcspkr ablk_helper cryptd autofs4 ext4 crc16 mbcache jbd2 crct10dif_pclmul crct10dif_common xen_netfront xen_blkfront crc32c_intel [ 300.632389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 [ 300.632396] task: 8181a460 ti: 8180 task.ti: 8180 [ 300.632403] RIP: e030:[] [] add_timer_on+0xea/0x100 [ 300.632415] RSP: e02b:88003f603e78 EFLAGS: 00010282 [ 300.632422] RAX: RBX: 81a66e00 RCX: 0001000125c4 [ 300.632428] RDX: 88003f60 RSI: RDI: 88003f60ca00 [ 300.632434] RBP: 88003f60ca00 R08: 0001009f R09: 88003f603de0 [ 300.632441] R10: 88003f603de4 R11: dfbfefff R12: 81a66e00 [ 300.632448] R13: 81038a30 R14: R15: [ 300.632462] FS: () GS:88003f60() knlGS:88003f60 [ 300.632469] CS: e033 DS: ES: CR0: 80050033 [ 300.632476] CR2: 01bf6808 CR3: 0008 CR4: 00042660 [ 300.632483] Stack: [ 300.632487] 81a66e00 88003f7eca00 0100 81038a30 [ 300.632499] 000f 81073ea1 81a66e00 [ 300.632509] 88003f7eca00 0001 81038a30 000f [ 300.632521] Call Trace: [ 300.632525] [ 300.632530] [] ? mce_cpu_restart+0x40/0x40 [ 300.632543] [] ? call_timer_fn+0x31/0x140 [ 300.632553] [] ? mce_cpu_restart+0x40/0x40 [ 300.632563] [] ? run_timer_softirq+0x1e9/0x2f0 [ 300.632570] [] ? __do_softirq+0xf1/0x2d0 [ 300.632577] [] ? irq_exit+0x95/0xa0 [ 300.632584] [] ? xen_evtchn_do_upcall+0x35/0x50 [ 300.632595] [] ? xen_do_hypervisor_callback+0x1e/0x30 [ 300.632600] [ 300.632603] [] ? xen_hypercall_sched_op+0xc/0x20 [ 300.632614] [] ? xen_hypercall_sched_op+0xc/0x20 [ 300.632623] [] ? xen_safe_halt+0xc/0x20 [ 300.632631] [] ? default_idle+0x19/0xd0 [ 300.632640] [] ? cpu_startup_entry+0x374/0x470 [ 300.632650] [] ? start_kernel+0x497/0x4a2 [ 300.632657] [] ? set_init_arg+0x4e/0x4e [ 300.632665] [] ? xen_start_kernel+0x569/0x573 [ 300.632674] Code: a6 85 00 48 85 db 74 21 48 8b 03 66 0f 1f 44 00 00 48 8b 7b 08 48 83 c3 10 4c 89 ea 48 89 ee ff d0 48 8b 03 48 85 c0 75 e8 eb 87 <0f> 0b 48 8b 74 24 30 e8 3a fe ff ff e9 3e ff ff ff 0f 1f 44 00 [ 300.632756] RIP [] add_timer_on+0xea/0x100 [ 300.632766] RSP [ 300.632779] ---[ end trace 77fe5db1be9d3b29 ]--- [ 300.632790] Kernel panic - not syncing: Fatal exception in interrupt [ 300.632803] Kernel Offset: 0x0 from 0x8100 (relocation range: 0x8000-0x9fff) signature.asc Description: PGP signature