Re: [CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

2017-11-15 Thread George Dunlap
Natan,

Thanks for the report.  Would you mind re-posting this to the
xen-users mailing list?  You're much more likely to get someone there
who's seen such a bug before.

 -George

On Tue, Nov 7, 2017 at 11:12 PM, Nathan March  wrote:
> Since moving from 4.4 to 4.6, I’ve been seeing an increasing number of
> stability issues on our hypervisors. I’m not clear if there’s a singular
> root cause here, or if I’m dealing with multiple bugs…
>
>
>
> One of the more common ones I’ve seen, is a VM on shutdown will remain in
> the null state and a kernel bug is thrown:
>
>
>
> xen001 log # xl list
>
> NameID   Mem VCPUs  State
> Time(s)
>
> Domain-0 0  614424 r-
> 6639.7
>
> (null)   3 0 1 --pscd
> 36.3
>
>
>
> [89920.839074] BUG: unable to handle kernel paging request at
> 88020ee9a000
>
> [89920.839546] IP: [] __memcpy+0x12/0x20
>
> [89920.839933] PGD 2008067
>
> [89920.840022] PUD 17f43f067
>
> [89920.840390] PMD 1e0976067
>
> [89920.840469] PTE 0
>
> [89920.840833]
>
> [89920.841123] Oops:  [#1] SMP
>
> [89920.841417] Modules linked in: ebt_ip ebtable_filter ebtables
> arptable_filter arp_tables bridge xen_pciback xen_gntalloc nfsd auth_rpcgss
> nfsv3 nfs_acl nfs fscache lockd sunrpc grace 8021q mrp garp stp llc bonding
> xen_acpi_processor blktap xen_netback xen_blkback xen_gntdev xen_evtchn
> xenfs xen_privcmd dcdbas fjes pcspkr ipmi_devintf ipmi_si ipmi_msghandler
> joydev i2c_i801 i2c_smbus lpc_ich shpchp mei_me mei ioatdma ixgbe mdio igb
> dca ptp pps_core uas usb_storage wmi ttm
>
> [89920.847080] CPU: 4 PID: 1471 Comm: loop6 Not tainted 4.9.58-29.el6.x86_64
> #1
>
> [89920.847381] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS 2.7.1
> 03/04/2015
>
> [89920.847893] task: 8801b75e0700 task.stack: c900460e
>
> [89920.848192] RIP: e030:[]  []
> __memcpy+0x12/0x20
>
> [89920.848783] RSP: e02b:c900460e3b20  EFLAGS: 00010246
>
> [89920.849081] RAX: 88018916d000 RBX: 8801b75e0700 RCX:
> 0200
>
> [89920.849384] RDX:  RSI: 88020ee9a000 RDI:
> 88018916d000
>
> [89920.849686] RBP: c900460e3b38 R08: 88011da9fcf8 R09:
> 0002
>
> [89920.849989] R10: 88019535bddc R11: ea0006245b5c R12:
> 1000
>
> [89920.850294] R13: 88018916e000 R14: 1000 R15:
> c900460e3b68
>
> [89920.850605] FS:  7fb865c30700() GS:880204b0()
> knlGS:
>
> [89920.851118] CS:  e033 DS:  ES:  CR0: 80050033
>
> [89920.851418] CR2: 88020ee9a000 CR3: 0001ef03b000 CR4:
> 00042660
>
> [89920.851720] Stack:
>
> [89920.852009]  814375ca c900460e3b38 c900460e3d08
> c900460e3bb8
>
> [89920.852821]  814381c5 c900460e3b68 c900460e3d08
> 1000
>
> [89920.853633]  c900460e3d88  1000
> ea00
>
> [89920.854445] Call Trace:
>
> [89920.854741]  [] ? memcpy_from_page+0x3a/0x70
>
> [89920.855043]  []
> iov_iter_copy_from_user_atomic+0x265/0x290
>
> [89920.855354]  [] generic_perform_write+0xf3/0x1d0
>
> [89920.855673]  [] ? xen_load_tls+0xaa/0x160
>
> [89920.855992]  [] nfs_file_write+0xdb/0x200 [nfs]
>
> [89920.856297]  [] vfs_iter_write+0xa2/0xf0
>
> [89920.856599]  [] lo_write_bvec+0x65/0x100
>
> [89920.856899]  [] do_req_filebacked+0x195/0x300
>
> [89920.857202]  [] loop_queue_work+0x5b/0x80
>
> [89920.857505]  [] kthread_worker_fn+0x98/0x1b0
>
> [89920.857808]  [] ? schedule+0x3a/0xa0
>
> [89920.858108]  [] ? _raw_spin_unlock_irqrestore+0x16/0x20
>
> [89920.858411]  [] ? kthread_probe_data+0x40/0x40
>
> [89920.858713]  [] kthread+0xe5/0x100
>
> [89920.859014]  [] ? __kthread_init_worker+0x40/0x40
>
> [89920.859317]  [] ret_from_fork+0x25/0x30
>
> [89920.859615] Code: 81 f3 00 00 00 00 e9 1e ff ff ff 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07
>  48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
>
> [89920.864410] RIP  [] __memcpy+0x12/0x20
>
> [89920.864749]  RSP 
>
> [89920.865021] CR2: 88020ee9a000
>
> [89920.865294] ---[ end trace b77d2ce5646284d1 ]---
>
>
>
> Wondering if anyone has advice on how to troubleshoot the above, or might
> have some insight into that the issue could be? This hypervisor was only up
> for a day, had almost no VMs running on it since boot, I booted a single
> windows test VM which BSOD’ed and then this happened.
>
>
>
> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
> across a wide number of systems with from both Dell and Supermicro, although
> we run the same Intel x540 10gb nic’s in each system with the same netapp
> nfs backend storage.
>
>
>
> Cheers,
>
> Nathan
>
>
> ___
> CentOS-virt mailing list
> CentOS-virt@centos.org
> 

Re: [CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

2017-11-07 Thread Sarah Newman
On 11/07/2017 04:57 PM, Sarah Newman wrote:
> On 11/07/2017 03:12 PM, Nathan March wrote:
>> Since moving from 4.4 to 4.6, I've been seeing an increasing number of
>> stability issues on our hypervisors. I'm not clear if there's a singular
>> root cause here, or if I'm dealing with multiple bugs.
>>
>>  
>>
>> One of the more common ones I've seen, is a VM on shutdown will remain in
>> the null state and a kernel bug is thrown:
>>
>>  
>>
>> xen001 log # xl list
>>
>> NameID   Mem VCPUs  State
>> Time(s)
>>
>> Domain-0 0  614424 r-
>> 6639.7
>>
>> (null)   3 0 1 --pscd
>> 36.3
>>
>>  
>>
>> [89920.839074] BUG: unable to handle kernel paging request at
>> 88020ee9a000
>>
> 
> 
>> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
>> across a wide number of systems with from both Dell and Supermicro, although
>> we run the same Intel x540 10gb nic's in each system with the same netapp
>> nfs backend storage.
> 
> We don't use NFS and have not seen the exact same issue.

Additionally we aren't using xen 4.6 any more, we're using 4.8, but we didn't 
see issues like that when we were using xen 4.6. We're also still on
4.9.39. You might try an older kernel or a newer version of xen in addition to 
looking for nfs specific issues.

--Sarah
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] Stability issues since moving to 4.6 - Kernel paging request bug + VM left in null state

2017-11-07 Thread Sarah Newman
On 11/07/2017 03:12 PM, Nathan March wrote:
> Since moving from 4.4 to 4.6, I've been seeing an increasing number of
> stability issues on our hypervisors. I'm not clear if there's a singular
> root cause here, or if I'm dealing with multiple bugs.
> 
>  
> 
> One of the more common ones I've seen, is a VM on shutdown will remain in
> the null state and a kernel bug is thrown:
> 
>  
> 
> xen001 log # xl list
> 
> NameID   Mem VCPUs  State
> Time(s)
> 
> Domain-0 0  614424 r-
> 6639.7
> 
> (null)   3 0 1 --pscd
> 36.3
> 
>  
> 
> [89920.839074] BUG: unable to handle kernel paging request at
> 88020ee9a000
> 


> This is on xen 4.6.6-4.el6 with 4.9.58-29.el6.x86_64. I see these issues
> across a wide number of systems with from both Dell and Supermicro, although
> we run the same Intel x540 10gb nic's in each system with the same netapp
> nfs backend storage.

We don't use NFS and have not seen the exact same issue.

--Sarah
___
CentOS-virt mailing list
CentOS-virt@centos.org
https://lists.centos.org/mailman/listinfo/centos-virt