[Kernel-packages] [Bug 1834505] Re: Random kernel panics in QEMU instances when running on EPYC architecture

2019-06-28 Thread Louis Bouchard
** Changed in: linux (Ubuntu)
   Status: Incomplete => Opinion

** Changed in: linux (Ubuntu)
   Status: Opinion => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834505

Title:
  Random kernel panics in QEMU instances when running on EPYC
  architecture

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  We have been seeing many kernel panic in QEMU instances on newly
  deployed servers all running on the EPYC architecture. Many of the KP
  occur early after the start of the QEMU process or within a few hours.

  All the servers are running an up to date Bionic. After the first few
  panics on 4.15 kernels, we upgraded to 4.18 and still had panics.

  The only thing running on the underlying server is QEMU processes.
  Here is a typical backtrace of a kernel panic :

  [58034.598930] BUG: unable to handle kernel paging request at 943276c49f64
  [58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 
800076c49163
  [58034.612992] Oops:  [#1] SMP NOPTI
  [58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 
4.18.0-20-generic #21~18.04.1-Ubuntu
  [58034.614685] Hardware name: Scaleway Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
  [58034.615803] RIP: 0010:sched_ttwu_pending+0x6b/0xe0
  [58034.616385] Code: 4b a1 93 00 41 83 a4 24 98 09 00 00 03 4c 89 e7 48 89 45 
d8 c7 45 e0 00 00 00 00 e8 ef ca ff ff 48 8d 73 d0 48 83 fe d0 74 2c <0f> b6 96 
64 08 00 00 48 8b 46 30 48 8d 4d d8 4c 89 e7 48 8d 5>
  [58034.618537] RSP: 0018:94327f803f90 EFLAGS: 00010087
  [58034.619134] RAX: 34c83ba48a83 RBX: 943276c49730 RCX: 

  [58034.619992] RDX: 2a12 RSI: 943276c49700 RDI: 
94327fe3e000
  [58034.620913] RBP: 94327f803fb8 R08: 3753c7b87c93 R09: 

  [58034.621835] R10:  R11:  R12: 
94327f822c00
  [58034.622723] R13:  R14:  R15: 

  [58034.623539] FS:  () GS:94327f80() 
knlGS:
  [58034.624513] CS:  0010 DS:  ES:  CR0: 80050033
  [58034.625214] CR2: 943276c49f64 CR3: 76544000 CR4: 
003406f0
  [58034.626268] Call Trace:
  [58034.626605]  
  [58034.626924]  scheduler_ipi+0xa9/0x130
  [58034.627400]  smp_reschedule_interrupt+0x39/0xe0
  [58034.627925]  reschedule_interrupt+0xf/0x20
  [58034.628398]  
  [58034.628659] RIP: 0010:rcu_idle_exit+0x40/0x70
  [58034.629181] Code: fa 66 0f 1f 44 00 00 48 c7 c3 80 a6 01 00 65 48 03 1d ec 
4b 50 4e 48 8b 03 48 85 c0 74 16 48 83 c0 01 48 89 03 4c 89 e7 57 9d <0f> 1f 44 
00 00 5b 41 5c 5d c3 e8 a1 c6 ff ff 48 b8 00 00 00 0>
  [58034.631577] RSP: 0018:b2e03e48 EFLAGS: 0202 ORIG_RAX: 
ff02
  [58034.632566] RAX: 4000 RBX: 94327f81a680 RCX: 
94327f81a680
  [58034.633557] RDX:  RSI: 94327f81a680 RDI: 
0202
  [58034.634416] RBP: b2e03e58 R08: 0001cc00 R09: 
0001
  [58034.635292] R10: b2e03d98 R11:  R12: 
0202
  [58034.636107] R13:  R14:  R15: 
7e369c93
  [58034.637052]  do_idle+0x13f/0x280
  [58034.637453]  cpu_startup_entry+0x73/0x80
  [58034.638067]  rest_init+0xae/0xb0
  [58034.638506]  start_kernel+0x539/0x55a
  [58034.638938]  x86_64_start_reservations+0x24/0x26
  [58034.639470]  x86_64_start_kernel+0x74/0x77
  [58034.639958]  secondary_startup_64+0xa5/0xb0
  [58034.640441] Modules linked in: cfg80211 veth xt_nat xt_tcpudp 
ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fi>

  None of the server has experienced any kernel panic itself.

  After analysing many of the crashes and seeing many of them handling
  interrupts, we decided to reboot the instances with the  noapic
  parameter.

  No kernel panic has been seen since but this is just a workaround
  until a solution is found.

  AMD has been informed of the issue.

  TIA,

  ...Louis (aka Caribou)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1834505/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1834505] Re: Random kernel panics in QEMU instances when running on EPYC architecture

2019-06-27 Thread Louis Bouchard
** Description changed:

  We have been seeing many kernel panic in QEMU instances on newly
  deployed servers all running on the EPYC architecture. Many of the KP
  occur early after the start of the QEMU process or within a few hours.
+ 
+ All the servers are running an up to date Bionic. After the first few
+ panics on 4.15 kernels, we upgraded to 4.18 and still had panics.
  
  The only thing running on the underlying server is QEMU processes. Here
  is a typical backtrace of a kernel panic :
  
  [58034.598930] BUG: unable to handle kernel paging request at 943276c49f64
  [58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 
800076c49163
  [58034.612992] Oops:  [#1] SMP NOPTI
  [58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 
4.18.0-20-generic #21~18.04.1-Ubuntu
  [58034.614685] Hardware name: Scaleway Standard PC (i440FX + PIIX, 1996), 
BIOS 0.0.0 02/06/2015
  [58034.615803] RIP: 0010:sched_ttwu_pending+0x6b/0xe0
  [58034.616385] Code: 4b a1 93 00 41 83 a4 24 98 09 00 00 03 4c 89 e7 48 89 45 
d8 c7 45 e0 00 00 00 00 e8 ef ca ff ff 48 8d 73 d0 48 83 fe d0 74 2c <0f> b6 96 
64 08 00 00 48 8b 46 30 48 8d 4d d8 4c 89 e7 48 8d 5>
  [58034.618537] RSP: 0018:94327f803f90 EFLAGS: 00010087
  [58034.619134] RAX: 34c83ba48a83 RBX: 943276c49730 RCX: 

  [58034.619992] RDX: 2a12 RSI: 943276c49700 RDI: 
94327fe3e000
  [58034.620913] RBP: 94327f803fb8 R08: 3753c7b87c93 R09: 

  [58034.621835] R10:  R11:  R12: 
94327f822c00
  [58034.622723] R13:  R14:  R15: 

  [58034.623539] FS:  () GS:94327f80() 
knlGS:
  [58034.624513] CS:  0010 DS:  ES:  CR0: 80050033
  [58034.625214] CR2: 943276c49f64 CR3: 76544000 CR4: 
003406f0
  [58034.626268] Call Trace:
  [58034.626605]  
  [58034.626924]  scheduler_ipi+0xa9/0x130
  [58034.627400]  smp_reschedule_interrupt+0x39/0xe0
  [58034.627925]  reschedule_interrupt+0xf/0x20
  [58034.628398]  
  [58034.628659] RIP: 0010:rcu_idle_exit+0x40/0x70
  [58034.629181] Code: fa 66 0f 1f 44 00 00 48 c7 c3 80 a6 01 00 65 48 03 1d ec 
4b 50 4e 48 8b 03 48 85 c0 74 16 48 83 c0 01 48 89 03 4c 89 e7 57 9d <0f> 1f 44 
00 00 5b 41 5c 5d c3 e8 a1 c6 ff ff 48 b8 00 00 00 0>
  [58034.631577] RSP: 0018:b2e03e48 EFLAGS: 0202 ORIG_RAX: 
ff02
  [58034.632566] RAX: 4000 RBX: 94327f81a680 RCX: 
94327f81a680
  [58034.633557] RDX:  RSI: 94327f81a680 RDI: 
0202
  [58034.634416] RBP: b2e03e58 R08: 0001cc00 R09: 
0001
  [58034.635292] R10: b2e03d98 R11:  R12: 
0202
  [58034.636107] R13:  R14:  R15: 
7e369c93
  [58034.637052]  do_idle+0x13f/0x280
  [58034.637453]  cpu_startup_entry+0x73/0x80
  [58034.638067]  rest_init+0xae/0xb0
  [58034.638506]  start_kernel+0x539/0x55a
  [58034.638938]  x86_64_start_reservations+0x24/0x26
  [58034.639470]  x86_64_start_kernel+0x74/0x77
  [58034.639958]  secondary_startup_64+0xa5/0xb0
  [58034.640441] Modules linked in: cfg80211 veth xt_nat xt_tcpudp 
ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fi>
  
  None of the server has experienced any kernel panic itself.
  
  After analysing many of the crashes and seeing many of them handling
  interrupts, we decided to reboot the instances with the  noapic
  parameter.
  
  No kernel panic has been seen since but this is just a workaround until
  a solution is found.
  
  AMD has been informed of the issue.
  
  TIA,
  
  ...Louis (aka Caribou)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834505

Title:
  Random kernel panics in QEMU instances when running on EPYC
  architecture

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have been seeing many kernel panic in QEMU instances on newly
  deployed servers all running on the EPYC architecture. Many of the KP
  occur early after the start of the QEMU process or within a few hours.

  All the servers are running an up to date Bionic. After the first few
  panics on 4.15 kernels, we upgraded to 4.18 and still had panics.

  The only thing running on the underlying server is QEMU processes.
  Here is a typical backtrace of a kernel panic :

  [58034.598930] BUG: unable to handle kernel paging request at 943276c49f64
  [58034.612039] PGD 8363067 P4D 8363067 PUD 8367067 PMD 759ac063 PTE 
800076c49163
  [58034.612992] Oops:  [#1] SMP NOPTI
  [58034.613462] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 
4.18.0-20-generic #21~18.04.1-Ubuntu
  [58034.614685] Hardware name: Scaleway