I attempted to bisect this, using the following process: - Run the kernel-build-reboot-loop test on 3 machines in parallel I used 2 CRB1S systems (anuchin, bestovius) and 1 R120-T33 (seidel) - If any machine crashes w/ the parity error message, consider it failed - If all machines survive over night, consider it "OK".
Unfortunately, the commit it landed on looks bogus: # first bad commit: [852643165aea0999bb862b36511c5b9f6b11449f] fs//binfmt_elf.c: move variables initialization closer to their usage (Reverse bisect - this would in theory be the commit that *fixed* it) Just in case, I tried reverting that commit from 5.5-rc6. As noted in comment #2, 5.5-rc6 seems immune to this problem. Reverting the commit didn't change that - 5.5-rc6 still survived over night. Note: Of the 3 systems, anuchin was usually the one that failed during the bisect. It could be that this is a generic hw issue, and anuchin is just more severely impacted than the others. It could also be that this symptom can be caused by both a sw and a hw issue, and anuchin is impacted by the hw part, making it a bad choice for a bisect. Either way, bisection seems like a poor strategy for identifying the issue. ** Attachment added: "bisect.log" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1860013/+attachment/5323904/+files/bisect.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1860013 Title: [thunderx] Synchronous External Abort: synchronous parity or ECC error Status in linux package in Ubuntu: Triaged Status in linux source package in Bionic: Confirmed Status in linux source package in Disco: Triaged Status in linux source package in Eoan: Triaged Status in linux source package in Focal: Triaged Bug description: [Impact] Under load, ThunderX systems eventually fail with: [ 282.360376] Synchronous External Abort: synchronous parity or ECC error (0x96000018) at 0x0000ffffa6eb7000 [ 282.372351] Internal error: : 96000018 [#1] SMP [ 282.379152] Modules linked in: nls_iso8859_1 thunderx_edac thunderx_zip shpchp cavium_rng_vf cavium_rng gpio_keys uio_pdrv_genirq uio ipmi_ssif ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nicvf nicpf uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt aes_ce_blk fb_sys_fops aes_ce_cipher drm crc32_ce crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce ahci libahci thunder_bgx thunder_xcv i2c_thunderx mdio_thunder thunderx_mmc mdio_cavium aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64 [ 282.467284] Process cc1 (pid: 39700, stack limit = 0x00000000e0c44146) [ 282.477172] CPU: 25 PID: 39700 Comm: cc1 Not tainted 4.15.0-75-generic #85+lp1857074.1 [ 282.488379] Hardware name: Cavium ThunderX CRB/To be filled by O.E.M., BIOS 5.11 12/12/2012 [ 282.500121] pstate: 80000005 (Nzcv daif -PAN -UAO) [ 282.508297] pc : __arch_copy_to_user+0x13c/0x248 [ 282.516430] lr : cp_new_stat+0x140/0x178 [ 282.523768] sp : ffff00002e4d3d40 [ 282.530369] x29: ffff00002e4d3d40 x28: ffff801f51fa2d00 [ 282.538988] x27: ffff000008b52000 x26: 0000000000000050 [ 282.548031] x25: 0000000000000124 x24: 0000000000000015 [ 282.556872] x23: 0000000000000000 x22: 000000002e4d3d88 [ 282.565449] x21: ffff801f51fa2d00 x20: ffff000009588000 [ 282.574109] x19: ffff00002e4d3e30 x18: 0000ffffa87e7a70 [ 282.582790] x17: 0000ffffa8756110 x16: ffff0000082f4448 [ 282.591433] x15: 0000000000000000 x14: 0000000000000012 [ 282.599986] x13: 00682e6c746e6366 x12: 2f78756e696c2f69 [ 282.608730] x11: 0000000000000000 x10: 0000000000000cf0 [ 282.617283] x9 : 0000000000001000 x8 : 00000001000081a4 [ 282.625839] x7 : 0000000001001a2b x6 : 000000002e4d3da0 [ 282.634238] x5 : 000000002e4d3e08 x4 : 0000000000000008 [ 282.642754] x3 : 0000000000000802 x2 : fffffffffffffff8 [ 282.651250] x1 : ffff00002e4d3d90 x0 : 000000002e4d3d88 [ 282.660013] Call trace: [ 282.665421] __arch_copy_to_user+0x13c/0x248 [ 282.672979] SyS_newfstat+0x58/0x88 [ 282.679272] el0_svc_naked+0x30/0x34 [ 282.685605] Code: a8c12027 a88120c7 d503201f d503201f (a8c12829) [ 282.694411] ---[ end trace 863693cf0c3fd297 ]--- [Test Case] We found this by doing a reboot/kernel build loop. (The reboot maybe unnecessary). Code to automate this setup is at: https://code.launchpad.net/~dannf/+git/kernel-build-reboot-loop [Fix] [Regression Risk] To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1860013/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp