Bug#929359: linux: instability on arm64 MP30-AR1 servers
On 2019-05-23 16:52, Julien Cristau wrote: > Control: found -1 4.19.28-2 > > On Wed, May 22, 2019 at 11:58:15 +0200, Julien Cristau wrote: > > > Source: linux > > Version: 4.9.168-1 > > Severity: important > > X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org > > User: debian-ad...@lists.debian.org > > Usertags: needed-by-DSA-Team > > > > Hi, > > > > ever since the 9.9 point release conova-node01.debian.org and > > conova-node02.debian.org have been unstable. They run for an hour or > > three, and then things go bad. Rebooting back to 4.9.144-3.1 makes them > > stable again. > > > Still happening after upgrading to the stretch-backports kernel: > The problem is somehow related to openvswitch. After switching the ganeti cluster from openvswitch to bridge mode, both machines run stable again. -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
Bug#929359: linux: instability on arm64 MP30-AR1 servers
Control: found -1 4.19.28-2 On Wed, May 22, 2019 at 11:58:15 +0200, Julien Cristau wrote: > Source: linux > Version: 4.9.168-1 > Severity: important > X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org > User: debian-ad...@lists.debian.org > Usertags: needed-by-DSA-Team > > Hi, > > ever since the 9.9 point release conova-node01.debian.org and > conova-node02.debian.org have been unstable. They run for an hour or > three, and then things go bad. Rebooting back to 4.9.144-3.1 makes them > stable again. > Still happening after upgrading to the stretch-backports kernel: [87461.376828] Bad mode in FIQ handler detected on CPU0, code 0x5600 -- SVC (AArch64) [87461.376834] Internal error: Oops - bad mode: 0 [#1] SMP [87461.389907] Modules linked in: openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat binfmt_misc nls_ascii nls_cp437 vfat fat dm_mod ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nfnetlink_log nfnetlink xt_NFLOG xt_tcpudp xt_hashlimit xt_multiport xt_conntrack efi_pstore nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ast ttm drm_kms_helper drm xgene_hwmon i2c_algo_bit xgene_edac xgene_dma joydev evdev chaoskey sg xgene_rng mailbox_xgene_slimpro rng_core ipmi_ssif ipmi_devintf ipmi_msghandler efivars tun drbd lru_cache efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic usbhid hid xor raid6_pq crc32c_generic libcrc32c raid0 multipath linear raid1 [87461.460161] md_mod sd_mod ahci_xgene libahci_platform libahci xhci_plat_hcd xgene_enet libata xhci_hcd i2c_xgene_slimpro marvell usbcore phy_xgene scsi_mod sdhci_of_arasan mdio_xgene sdhci_pltfm of_mdio cqhci fixed_phy sdhci libphy usb_common gpio_xgene_sb [87461.482839] CPU: 0 PID: 1557 Comm: ovsdb-server Not tainted 4.19.0-0.bpo.4-arm64 #1 Debian 4.19.28-2~bpo9+1 [87461.492528] Hardware name: GIGABYTE R120-P31/MP30-AR1, BIOS D7b 08/26/2016 [87461.499367] pstate: (nzcv daif -PAN -UAO) [87461.504132] pc : 897e2910 [87461.507427] lr : 897e2918 [87461.510722] sp : e32d4440 [87461.514016] x29: e32d4440 x28: 015a [87461.519301] x27: 89928c20 x26: [87461.524586] x25: e32d44f8 x24: e32d4528 [87461.529870] x23: 015a x22: 0090 [87461.535154] x21: d73fd286 x20: 0001 [87461.540439] x19: d743b560 x18: 0024 [87461.545723] x17: 897d7fc0 x16: 899238e0 [87461.551007] x15: 089e8439a422 x14: 0001 [87461.556291] x13: 5ce6a4fa x12: 0018 [87461.561576] x11: 26295eb7 x10: 000155a6 [87461.566860] x9 : d741f300 x8 : [87461.572144] x7 : 0010 x6 : [87461.577429] x5 : e32d42b8 x4 : d73f0410 [87461.582713] x3 : d743b568 x2 : d7403a20 [87461.587997] x1 : 0001 x0 : d743b560 [87461.593283] Process ovsdb-server (pid: 1557, stack limit = 0x03b97138) [87461.600468] ---[ end trace 2ab4838ec3817e8e ]--- [87461.606271] Bad mode in FIQ handler detected on CPU0, code 0x5600 -- SVC (AArch64) [87482.616230] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [87482.622133] rcu: 0-...0: (1 GPs behind) idle=9a6/1/0x4000 softirq=1153372/1153372 fqs=2456 [87482.631564] rcu: (detected by 4, t=5255 jiffies, g=6202645, q=14630) [87482.637973] Task dump for CPU 0: [87482.641182] ovsdb-serverR running task0 1557 1556 0x0002 [87482.648197] Call trace: [87482.650636] __switch_to+0x8c/0xd0 [87482.654018](null) Cheers, Julien
Bug#929359: linux: instability on arm64 MP30-AR1 servers
Source: linux Version: 4.9.168-1 Severity: important X-Debbugs-Cc: debian-...@lists.debian.org, debian-ad...@lists.debian.org User: debian-ad...@lists.debian.org Usertags: needed-by-DSA-Team Hi, ever since the 9.9 point release conova-node01.debian.org and conova-node02.debian.org have been unstable. They run for an hour or three, and then things go bad. Rebooting back to 4.9.144-3.1 makes them stable again. Latest example: May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: PingAck did not arrive in time. May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: new current UUID 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5 May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: ack_receiver terminated May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Terminating drbd_a_resource May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Connection closed May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: conn( NetworkFailure -> Unconnected ) May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: receiver terminated May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Restarting receiver thread May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: receiver (re)started May 22 04:17:37 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: conn( Unconnected -> WFConnection ) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Handshake successful: Agreed network protocol version 101 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Peer authenticated using 16 bytes HMAC May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: conn( WFConnection -> WFReportParams ) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: drbd resource3: Starting ack_recv thread (from drbd_r_resource [8449]) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: drbd_sync_handshake: May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: self 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5 bits:4 flags:0 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: peer 0BEBDA613EA56FD6::D5BF70E0AA6560C4:D5BE70E0AA6560C5 bits:0 flags:0 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: uuid_compare()=1 by rule 70 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0% May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0% May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3 exit code 0 (0x0) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: Began resync as SyncSource (will sync 16 KB [4 bits set]). May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: updated sync UUID 3EA2D1FA6B3ACD47:0BECDA613EA56FD7:0BEBDA613EA56FD7:D5BF70E0AA6560C5 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: Resync done (total 1 sec; paused 0 sec; 16 K/sec) May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: updated UUIDs 3EA2D1FA6B3ACD47::0BECDA613EA56FD7:0BEBDA613EA56FD7 May 22 04:17:38 conova-node01/conova-node01/:::217.196.149.227 kernel: block drbd3: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) May 22