Bug#786936: xen-hypervisor-4.4-amd64: Upgrade dom0 from wheezy to jessie on Dell R610 results in dom0 unaccessible with xen_netback issue

2015-05-26 Thread Andrew Perry
Package: xen-hypervisor-4.4-amd64
Version: 4.4.1-9
Severity: critical
Justification: breaks the whole system

Dear Maintainer,

After upgrading the R610 server from Debian 7 to Debian 8, the dom0 becomes 
unresponsive via ssh after an hour or so, although the domUs still remain 
accessible.

Initially we thought it may be a disk space issue on / or /boot so action was 
taken to increase those petition sizes but it has no effect.

We get the following trace in /var/log/syslog:

May 26 09:18:59 servername kernel: [31526.937788] BUG: unable to handle kernel 
paging request at c90013a4b158
May 26 09:18:59 servername kernel: [31526.937798] IP: [] 
xenvif_get_ethtool_stats+0x50/0x80 [xen_netback]
May 26 09:18:59 servername kernel: [31526.937807] PGD b243c067 PUD b243d067 PMD 
8a56c067 PTE 0
May 26 09:18:59 servername kernel: [31526.937813] Oops:  [#1] SMP 
May 26 09:18:59 servername kernel: [31526.937817] Modules linked in: 
dm_snapshot dm_bufio binfmt_misc xt_tcpudp xt_physdev iptable_filter ip_tables 
x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs xen_privcmd nfsd 
auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ib_iser rdma_cm iw_cm 
ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi bridge stp llc nls_utf8 nls_cp437 vfat fat joydev 
intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel ttm evdev 
aesni_intel ipmi_devintf iTCO_wdt iTCO_vendor_support aes_x86_64 drm_kms_helper 
acpi_power_meter dcdbas lrw gf128mul glue_helper tpm_tis tpm drm i2c_algo_bit 
ablk_helper processor i2c_core lpc_ich ipmi_si ipmi_msghandler i7core_edac 
thermal_sys cryptd mfd_core button psmouse pcspkr serio_raw shpchp wmi 
edac_core loop autofs4 ext4 crc16 mbcache jbd2 dm_mod hid_generic usbhid hid sg 
sr_mod cdrom ses sd_mod enclosure ata_generic crc32c_intel lpfc crc_t10dif 
crct10dif_generic ehci_pci
  uhci_hcd crct10dif_pclmul ata_piix ehci_hcd scsi_transport_fc libata 
megaraid_sas scsi_tgt usbcore scsi_mod usb_common crct10dif_common bnx2
May 26 09:18:59 servername kernel: [31526.937917] CPU: 0 PID: 1311 Comm: snmpd 
Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt9-3~deb8u1
May 26 09:18:59 servername kernel: [31526.937922] Hardware name: Dell Inc. 
PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
May 26 09:18:59 servername kernel: [31526.937927] task: 88008a86a250 ti: 
880002b4c000 task.ti: 880002b4c000
May 26 09:18:59 servername kernel: [31526.937931] RIP: 
e030:[]  [] 
xenvif_get_ethtool_stats+0x50/0x80 [xen_netback]
May 26 09:18:59 servername kernel: [31526.937939] RSP: e02b:880002b4fd70  
EFLAGS: 00010283
May 26 09:18:59 servername kernel: [31526.937942] RAX: c90013a14f38 RBX: 
0230f940 RCX: 92008ea28c88
May 26 09:18:59 servername kernel: [31526.937946] RDX: 88008ecadc00 RSI: 
c90013a4b190 RDI: 88008da7c000
May 26 09:18:59 servername kernel: [31526.937949] RBP: 880002b4fe10 R08: 
a06827e0 R09: 0006
May 26 09:18:59 servername kernel: [31526.937953] R10: 0010ebb8 R11: 
0246 R12: 0005
May 26 09:18:59 servername kernel: [31526.937957] R13: 88008da7c000 R14: 
a0682640 R15: 88008ecadc00
May 26 09:18:59 servername kernel: [31526.937965] FS:  7f93bcc9e700() 
GS:8800b2a0() knlGS:
May 26 09:18:59 servername kernel: [31526.937969] CS:  e033 DS:  ES:  
CR0: 8005003b
May 26 09:18:59 servername kernel: [31526.937973] CR2: c90013a4b158 CR3: 
899ff000 CR4: 2660
May 26 09:18:59 servername kernel: [31526.937977] Stack:
May 26 09:18:59 servername kernel: [31526.937979]  814225f1 
000400114813 7fff3fff32a8 
May 26 09:18:59 servername kernel: [31526.937985]  880002b4ff18 
001d3fff32a0 880002b4fde0 814039a6
May 26 09:18:59 servername kernel: [31526.937990]  0005001d 
8805 81420455 7fff3fff3280
May 26 09:18:59 servername kernel: [31526.937995] Call Trace:
May 26 09:18:59 servername kernel: [31526.938003]  [] ? 
dev_ethtool+0x921/0x1ac0
May 26 09:18:59 servername kernel: [31526.938009]  [] ? 
___sys_recvmsg+0x136/0x2a0
May 26 09:18:59 servername kernel: [31526.938014]  [] ? 
netdev_run_todo+0x55/0x2f0
May 26 09:18:59 servername kernel: [31526.938020]  [] ? 
dev_ioctl+0x19f/0x590
May 26 09:18:59 servername kernel: [31526.938026]  [] ? 
kfree+0x118/0x220
May 26 09:18:59 servername kernel: [31526.938033]  [] ? 
fsnotify_clear_marks_by_inode+0x2a/0x110
May 26 09:18:59 servername kernel: [31526.938038]  [] ? 
sock_do_ioctl+0x3d/0x50
May 26 09:18:59 servername kernel: [31526.938043]  [] ? 
sock_ioctl+0x1e8/0x2c0
May 26 09:18:59 servername kernel: [31526.938048]  [] ? 
do_vfs_ioctl+0x2cf/0x4b0
May 26 09:18:59 servername kernel: [31526.938054]  [] ? 
task_work_run+0x9c/0xd0
May 26 09:18:59 servername kernel: [31526.938059]  [] ? 
SyS_ioctl+0x81/0xa0
May 26 09:18:59 servername kernel: [31526.938065]  [] ? 
int_sign

Bug#786936: [Pkg-xen-devel] Bug#786936: xen-hypervisor-4.4-amd64: Upgrade dom0 from wheezy to jessie on Dell R610 results in dom0 unaccessible with xen_netback issue

2015-05-30 Thread Ian Campbell
Control: reassign -1 linux-image-3.16.0-4-amd64 3.16.7-ckt9-3~deb8u1

On Wed, 2015-05-27 at 08:44 +1000, Andrew Perry wrote:
> Package: xen-hypervisor-4.4-amd64
> Version: 4.4.1-9
> Severity: critical
> Justification: breaks the whole system
> 
> Dear Maintainer,
> 
> After upgrading the R610 server from Debian 7 to Debian 8, the dom0
> becomes unresponsive via ssh after an hour or so, although the domUs
> still remain accessible.
> 
> Initially we thought it may be a disk space issue on / or /boot so
> action was taken to increase those petition sizes but it has no
> effect.
> 
> We get the following trace in /var/log/syslog:
> 
> May 26 09:18:59 servername kernel: [31526.937788] BUG: unable to handle 
> kernel paging request at c90013a4b158
> May 26 09:18:59 servername kernel: [31526.937798] IP: [] 
> xenvif_get_ethtool_stats+0x50/0x80 [xen_netback]

This appears to be a dom0 kernel issue rather than a hypervisor issue,
I've (hopefully) reassigned accordingly.

While we work out a proper fix, since the error appears to be in the
ethtool stats gathering code I suspect that there might be a workaround
which would be to disable whichever code in dom0 (a monitoring daemon
like nagios perhaps?) is calling this path.

> May 26 09:18:59 servername kernel: [31526.937807] PGD b243c067 PUD b243d067 
> PMD 8a56c067 PTE 0
> May 26 09:18:59 servername kernel: [31526.937813] Oops:  [#1] SMP 
> May 26 09:18:59 servername kernel: [31526.937817] Modules linked in: 
> dm_snapshot dm_bufio binfmt_misc xt_tcpudp xt_physdev iptable_filter 
> ip_tables x_tables xen_netback xen_blkback xen_gntdev xen_evtchn xenfs 
> xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc 
> ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp 
> libiscsi_tcp libiscsi scsi_transport_iscsi bridge stp llc nls_utf8 nls_cp437 
> vfat fat joydev intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel 
> ttm evdev aesni_intel ipmi_devintf iTCO_wdt iTCO_vendor_support aes_x86_64 
> drm_kms_helper acpi_power_meter dcdbas lrw gf128mul glue_helper tpm_tis tpm 
> drm i2c_algo_bit ablk_helper processor i2c_core lpc_ich ipmi_si 
> ipmi_msghandler i7core_edac thermal_sys cryptd mfd_core button psmouse pcspkr 
> serio_raw shpchp wmi edac_core loop autofs4 ext4 crc16 mbcache jbd2 dm_mod 
> hid_generic usbhid hid sg sr_mod cdrom ses sd_mod enclosure ata_generic 
> crc32c_intel lpfc crc_t10dif crct10dif_generic ehci_pci uhci_hcd 
> crct10dif_pclmul ata_piix ehci_hcd scsi_transport_fc libata megaraid_sas 
> scsi_tgt usbcore scsi_mod usb_common crct10dif_common bnx2
> May 26 09:18:59 servername kernel: [31526.937917] CPU: 0 PID: 1311 Comm: 
> snmpd Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt9-3~deb8u1
> May 26 09:18:59 servername kernel: [31526.937922] Hardware name: Dell Inc. 
> PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
> May 26 09:18:59 servername kernel: [31526.937927] task: 88008a86a250 ti: 
> 880002b4c000 task.ti: 880002b4c000
> May 26 09:18:59 servername kernel: [31526.937931] RIP: 
> e030:[]  [] 
> xenvif_get_ethtool_stats+0x50/0x80 [xen_netback]
> May 26 09:18:59 servername kernel: [31526.937939] RSP: e02b:880002b4fd70  
> EFLAGS: 00010283
> May 26 09:18:59 servername kernel: [31526.937942] RAX: c90013a14f38 RBX: 
> 0230f940 RCX: 92008ea28c88
> May 26 09:18:59 servername kernel: [31526.937946] RDX: 88008ecadc00 RSI: 
> c90013a4b190 RDI: 88008da7c000
> May 26 09:18:59 servername kernel: [31526.937949] RBP: 880002b4fe10 R08: 
> a06827e0 R09: 0006
> May 26 09:18:59 servername kernel: [31526.937953] R10: 0010ebb8 R11: 
> 0246 R12: 0005
> May 26 09:18:59 servername kernel: [31526.937957] R13: 88008da7c000 R14: 
> a0682640 R15: 88008ecadc00
> May 26 09:18:59 servername kernel: [31526.937965] FS:  7f93bcc9e700() 
> GS:8800b2a0() knlGS:
> May 26 09:18:59 servername kernel: [31526.937969] CS:  e033 DS:  ES:  
> CR0: 8005003b
> May 26 09:18:59 servername kernel: [31526.937973] CR2: c90013a4b158 CR3: 
> 899ff000 CR4: 2660
> May 26 09:18:59 servername kernel: [31526.937977] Stack:
> May 26 09:18:59 servername kernel: [31526.937979]  814225f1 
> 000400114813 7fff3fff32a8 
> May 26 09:18:59 servername kernel: [31526.937985]  880002b4ff18 
> 001d3fff32a0 880002b4fde0 814039a6
> May 26 09:18:59 servername kernel: [31526.937990]  0005001d 
> 8805 81420455 7fff3fff3280
> May 26 09:18:59 servername kernel: [31526.937995] Call Trace:
> May 26 09:18:59 servername kernel: [31526.938003]  [] ? 
> dev_ethtool+0x921/0x1ac0
> May 26 09:18:59 servername kernel: [31526.938009]  [] ? 
> ___sys_recvmsg+0x136/0x2a0
> May 26 09:18:59 servername kernel: [31526.938014]  [] ? 
> netdev_run_todo+0x55/0x2f0
> May 26 09:18:59 servername kernel: [31526.938020]  [] ? 
> dev_