On 06/03/2019 08:23 PM, Stewart Smith wrote: > On my two socket POWER9 system (powernv) with 842 zwap set up, I > recently got a crash with the Ubuntu kernel (I haven't tried with > upstream, and this is the first time the system has died like this, so > I'm not sure how repeatable it is). > > [ 2.891463] zswap: loaded using pool 842-nx/zbud > ... > [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 > us, giving up : 00 00 00 00 00000000 > [16868.932913] Unable to handle kernel paging request for data at address > 0x6655f67da816cdb8 > [16868.933726] Faulting instruction address: 0xc000000000391600 > > > cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0] > pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340 > lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340 > sp: c000001c9d98bc20 > msr: 900000000280b033 > dar: 6655f67da816cdb8 > current = 0xc000001ad43cb400 > paca = 0xc00000000fac7800 softe: 0 irq_happened: 0x01 > pid = 8319, comm = make > Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 > (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu > 4.15.0-50.54-generic 4.15.18) > > 68:mon> t > [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable) > [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220 > [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10 > [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510 > [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc > --- Exception: c00 (System Call) at 00007afe9daf87f4 > SP (7fffca606880) is in userspace > > So, it looks like there could be a problem in the error path, plausibly > fixed by this patch: > > commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5 > Author: Haren Myneni <ha...@linux.vnet.ibm.com> > Date: Wed Jun 13 00:32:40 2018 -0700 > > crypto/nx: Initialize 842 high and normal RxFIFO control registers > > NX increments readOffset by FIFO size in receive FIFO control register > when CRB is read. But the index in RxFIFO has to match with the > corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX > may be processing incorrect CRBs and can cause CRB timeout. > > VAS FIFO offset is 0 when the receive window is opened during > initialization. When the module is reloaded or in kexec boot, readOffset > in FIFO control register may not match with VAS entry. This patch adds > nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO > control register for both high and normal FIFOs. > > Signed-off-by: Haren Myneni <ha...@us.ibm.com> > [mpe: Fixup uninitialized variable warning] > Signed-off-by: Michael Ellerman <m...@ellerman.id.au> > > $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5 > v4.19-rc1~24^2~50 > > > Which was never backported to any stable release, so probably needs to > be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't > seem to have picked up the patch. I'm opening an Ubuntu bug for this. > > Haren, is this something you can drive through the stable process > (assuming my above crash looks like this failure)? >
Thanks Stewart. Missed this in stable releases and I will work on it. Merged in Ubuntu 18.04.x kernel recently and will be in the next update. Also need commit 6e708000ec2c93c2bde6a46aa2d6c3e80d4eaeb9 Author: Haren Myneni <ha...@linux.vnet.ibm.com> Date: Wed Jun 13 00:28:57 2018 -0700 powerpc/powernv: Export opal_check_token symbol Export opal_check_token symbol for modules to check the availability of OPAL calls before using them. Signed-off-by: Haren Myneni <ha...@us.ibm.com> Signed-off-by: Michael Ellerman <m...@ellerman.id.au>