Bug#929359: linux: instability on arm64 MP30-AR1 servers

2019-06-15 Thread Aurelien Jarno
On 2019-05-23 16:52, Julien Cristau wrote:
> Control: found -1 4.19.28-2
> 
> On Wed, May 22, 2019 at 11:58:15 +0200, Julien Cristau wrote:
> 
> > Source: linux
> > Version: 4.9.168-1
> > Severity: important
> > X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org
> > User: debian-ad...@lists.debian.org
> > Usertags: needed-by-DSA-Team
> > 
> > Hi,
> > 
> > ever since the 9.9 point release conova-node01.debian.org and
> > conova-node02.debian.org have been unstable.  They run for an hour or
> > three, and then things go bad.  Rebooting back to 4.9.144-3.1 makes them
> > stable again.
> > 
> Still happening after upgrading to the stretch-backports kernel:
> 

The problem is somehow related to openvswitch. After switching the
ganeti cluster from openvswitch to bridge mode, both machines run stable
again.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#929359: linux: instability on arm64 MP30-AR1 servers

2019-05-23 Thread Julien Cristau
Control: found -1 4.19.28-2

On Wed, May 22, 2019 at 11:58:15 +0200, Julien Cristau wrote:

> Source: linux
> Version: 4.9.168-1
> Severity: important
> X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org
> User: debian-ad...@lists.debian.org
> Usertags: needed-by-DSA-Team
> 
> Hi,
> 
> ever since the 9.9 point release conova-node01.debian.org and
> conova-node02.debian.org have been unstable.  They run for an hour or
> three, and then things go bad.  Rebooting back to 4.9.144-3.1 makes them
> stable again.
> 
Still happening after upgrading to the stretch-backports kernel:

[87461.376828] Bad mode in FIQ handler detected on CPU0, code 0x5600 -- SVC 
(AArch64)
[87461.376834] Internal error: Oops - bad mode: 0 [#1] SMP
[87461.389907] Modules linked in: openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 
nf_conncount nf_nat binfmt_misc nls_ascii nls_cp437 vfat fat dm_mod ip6t_REJECT 
nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 
nfnetlink_log nfnetlink xt_NFLOG xt_tcpudp xt_hashlimit xt_multiport 
xt_conntrack efi_pstore nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
iptable_filter ast ttm drm_kms_helper drm xgene_hwmon i2c_algo_bit xgene_edac 
xgene_dma joydev evdev chaoskey sg xgene_rng mailbox_xgene_slimpro rng_core 
ipmi_ssif ipmi_devintf ipmi_msghandler efivars tun drbd lru_cache efivarfs 
ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid 
hid xor raid6_pq crc32c_generic libcrc32c raid0 multipath linear raid1
[87461.460161]  md_mod sd_mod ahci_xgene libahci_platform libahci xhci_plat_hcd 
xgene_enet libata xhci_hcd i2c_xgene_slimpro marvell usbcore phy_xgene scsi_mod 
sdhci_of_arasan mdio_xgene sdhci_pltfm of_mdio cqhci fixed_phy sdhci libphy 
usb_common gpio_xgene_sb
[87461.482839] CPU: 0 PID: 1557 Comm: ovsdb-server Not tainted 
4.19.0-0.bpo.4-arm64 #1 Debian 4.19.28-2~bpo9+1
[87461.492528] Hardware name: GIGABYTE R120-P31/MP30-AR1, BIOS D7b 08/26/2016
[87461.499367] pstate:  (nzcv daif -PAN -UAO)
[87461.504132] pc : 897e2910
[87461.507427] lr : 897e2918
[87461.510722] sp : e32d4440
[87461.514016] x29: e32d4440 x28: 015a 
[87461.519301] x27: 89928c20 x26:  
[87461.524586] x25: e32d44f8 x24: e32d4528 
[87461.529870] x23: 015a x22: 0090 
[87461.535154] x21: d73fd286 x20: 0001 
[87461.540439] x19: d743b560 x18: 0024 
[87461.545723] x17: 897d7fc0 x16: 899238e0 
[87461.551007] x15: 089e8439a422 x14: 0001 
[87461.556291] x13: 5ce6a4fa x12: 0018 
[87461.561576] x11: 26295eb7 x10: 000155a6 
[87461.566860] x9 : d741f300 x8 :  
[87461.572144] x7 : 0010 x6 :  
[87461.577429] x5 : e32d42b8 x4 : d73f0410 
[87461.582713] x3 : d743b568 x2 : d7403a20 
[87461.587997] x1 : 0001 x0 : d743b560 
[87461.593283] Process ovsdb-server (pid: 1557, stack limit = 
0x03b97138)
[87461.600468] ---[ end trace 2ab4838ec3817e8e ]---
[87461.606271] Bad mode in FIQ handler detected on CPU0, code 0x5600 -- SVC 
(AArch64)
[87482.616230] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[87482.622133] rcu: 0-...0: (1 GPs behind) idle=9a6/1/0x4000 
softirq=1153372/1153372 fqs=2456 
[87482.631564] rcu: (detected by 4, t=5255 jiffies, g=6202645, q=14630)
[87482.637973] Task dump for CPU 0:
[87482.641182] ovsdb-serverR  running task0  1557   1556 0x0002
[87482.648197] Call trace:
[87482.650636]  __switch_to+0x8c/0xd0
[87482.654018](null)

Cheers,
Julien



Bug#929359: linux: instability on arm64 MP30-AR1 servers

2019-05-22 Thread Julien Cristau
Source: linux
Version: 4.9.168-1
Severity: important
X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org
User: debian-ad...@lists.debian.org
Usertags: needed-by-DSA-Team

Hi,

ever since the 9.9 point release conova-node01.debian.org and
conova-node02.debian.org have been unstable.  They run for an hour or
three, and then things go bad.  Rebooting back to 4.9.144-3.1 makes them
stable again.

Latest example:

May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: PingAck did not arrive in time.
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) 
pdsk( UpToDate -> DUnknown ) 
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: new current UUID 
3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: ack_receiver terminated
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Terminating drbd_a_resource
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Connection closed
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: conn( NetworkFailure -> Unconnected ) 
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: receiver terminated
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Restarting receiver thread
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: receiver (re)started
May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: conn( Unconnected -> WFConnection ) 
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Handshake successful: Agreed network protocol version 101
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC 
WRITE_SAME.
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Peer authenticated using 16 bytes HMAC
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: conn( WFConnection -> WFReportParams ) 
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd 
resource3: Starting ack_recv thread (from drbd_r_resource [8449])
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: drbd_sync_handshake:
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: self 
3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5 bits:4 
flags:0
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: peer 
0BEBDA613EA56FD6::D5BF70E0AA6560C4:D5BE70E0AA6560C5 bits:0 
flags:0
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: uuid_compare()=1 by rule 70
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
pdsk( DUnknown -> Consistent ) 
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 
28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), 
total 28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: helper command: /bin/true before-resync-source minor-3
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: helper command: /bin/true before-resync-source minor-3 exit code 0 
(0x0)
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) 
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: Began resync as SyncSource (will sync 16 KB [4 bits set]).
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: updated sync UUID 
3EA2D1FA6B3ACD47:0BECDA613EA56FD7:0BEBDA613EA56FD7:D5BF70E0AA6560C5
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: Resync done (total 1 sec; paused 0 sec; 16 K/sec)
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: updated UUIDs 
3EA2D1FA6B3ACD47::0BECDA613EA56FD7:0BEBDA613EA56FD7
May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: 
block drbd3: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
May 22