Hi Tejas, Sorry for the delayed response but we have been busy with bring up of custom hardware based on the zynqmp.
We have recently upgraded to the 2018.1 release and are now using the xilinx 2018.1 tagged linux kernel with quite a few patches to try make the UBI stress test run successfully on both NAND and QSPI-NOR. Most of these patches are back ported from linux-xlnx and linux-mtd. You can find the .bbappend file and all the patches we use here: https://github.com/lundmar/xilinx-mtd-debug Actually, we have spent quite a lot of time testing, fixing, and stabilizing UBI on NAND and have done so successfully by fixing a bug in the NAND Arasan driver and also back porting some fixes to the UBI subsystem. We have communicated the NAND driver bug fix via IRC (#mtd) and I believe it has already been picked up by the upstream Arasan driver maintainer. However, for QSPI-NOR, we still see the same crash in the spi-zynqmp-gqspi driver when running the UBI stress test and we can now reproduce it on two different hardware configurations: the zcu102 development board (zcu9eg, 128 MB qspi-nor: 2 x 25Q512A, no Micron logo, looks like china copies) and our own custom board (zcu6eg, 512 MB qspi-nor: 2 x MT25QL02GCBB). To reiterate, this is the crash we see when running the UBI stress test on QSPI-NOR: <-- cut --> Running io_paral /dev/ubi0 [38054.190578] Unable to handle kernel paging request at virtual address ffffff800f1fa000 [38054.198422] Mem abort info: [38054.201192] Exception class = DABT (current EL), IL = 32 bits [38054.207094] SET = 0, FnV = 0 [38054.210130] EA = 0, S1PTW = 0 [38054.213255] Data abort info: [38054.216119] ISV = 0, ISS = 0x00000047 [38054.219939] CM = 0, WnR = 1 [38054.222892] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800908e000 [38054.229663] [ffffff800f1fa000] *pgd=000000087fffe003, *pud=000000087fffe003, *pmd=000000087a1c0003, *pte=0000000000000000 [38054.240604] Internal error: Oops: 96000047 [#1] SMP [38054.245459] Modules linked in: [38054.248500] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-xilinx-v2018.1 #1 [38054.255703] Hardware name: ZynqMP ZCU102 Rev1.0 (DT) [38054.260650] task: ffffff8008de1880 task.stack: ffffff8008dd0000 [38054.266558] PC is at __memcpy+0xac/0x180 [38054.270463] LR is at zynqmp_qspi_readrxfifo.constprop.5+0x98/0xc0 [38054.276535] pc : [<ffffff8008a238ac>] lr : [<ffffff8008635a68>] pstate: 800001c5 [38054.283913] sp : ffffff8008003e00 [38054.287209] x29: ffffff8008003e00 x28: ffffff8008de1880 [38054.292504] x27: 0000000000000001 x26: ffffff8008c169e0 [38054.297800] x25: ffffff8008e58688 x24: ffffffc87b09ba00 [38054.303094] x23: 0000000000000090 x22: ffffffc87a473000 [38054.308389] x21: 0000000000000001 x20: 0000000000000001 [38054.313684] x19: ffffffc87a4735c8 x18: 0000000000000007 [38054.318978] x17: 0000000000000001 x16: 0000000000000019 [38054.324273] x15: 0000000000000033 x14: 000000000000004c [38054.329568] x13: 0000000000000068 x12: ffffff8008ae4c40 [38054.334863] x11: ffffffc87ff6a7c8 x10: 0000000000000040 [38054.340158] x9 : ffffff8008dec668 x8 : ffffffc87b400240 [38054.345453] x7 : ffffffc87b400268 x6 : ffffff800f1fa000 [38054.350747] x5 : ffffffc87b400240 x4 : 00000048771ab000 [38054.356042] x3 : 00000000000000ac x2 : 0000000000000001 [38054.361337] x1 : ffffff8008003e39 x0 : ffffff800f1fa000 [38054.366633] Process swapper/0 (pid: 0, stack limit = 0xffffff8008dd0000) [38054.373316] Call trace: [38054.375747] Exception stack(0xffffff8008003cc0 to 0xffffff8008003e00) [38054.382172] 3cc0: ffffff800f1fa000 ffffff8008003e39 0000000000000001 00000000000000ac [38054.389986] 3ce0: 00000048771ab000 ffffffc87b400240 ffffff800f1fa000 ffffffc87b400268 [38054.397798] 3d00: ffffffc87b400240 ffffff8008dec668 0000000000000040 ffffffc87ff6a7c8 [38054.405609] 3d20: ffffff8008ae4c40 0000000000000068 000000000000004c 0000000000000033 [38054.413421] 3d40: 0000000000000019 0000000000000001 0000000000000007 ffffffc87a4735c8 [38054.421233] 3d60: 0000000000000001 0000000000000001 ffffffc87a473000 0000000000000090 [38054.429046] 3d80: ffffffc87b09ba00 ffffff8008e58688 ffffff8008c169e0 0000000000000001 [38054.436857] 3da0: ffffff8008de1880 ffffff8008003e00 ffffff8008635a68 ffffff8008003e00 [38054.444669] 3dc0: ffffff8008a238ac 00000000800001c5 ffffff8009072000 ffffff8008e09b48 [38054.452482] 3de0: 0000008000000000 0000000001cbd426 ffffff8008003e00 ffffff8008a238ac [38054.460294] [<ffffff8008a238ac>] __memcpy+0xac/0x180 [38054.465239] [<ffffff8008635c30>] zynqmp_qspi_irq+0x1a0/0x2b8 [38054.470884] [<ffffff80080dbebc>] __handle_irq_event_percpu+0x5c/0x148 [38054.477305] [<ffffff80080dbfc4>] handle_irq_event_percpu+0x1c/0x58 [38054.483467] [<ffffff80080dc044>] handle_irq_event+0x44/0x78 [38054.489023] [<ffffff80080dfe20>] handle_fasteoi_irq+0xa0/0x188 [38054.494838] [<ffffff80080dafcc>] generic_handle_irq+0x24/0x38 [38054.500567] [<ffffff80080db67c>] __handle_domain_irq+0x5c/0xb8 [38054.506383] [<ffffff8008081500>] gic_handle_irq+0x68/0xc0 [38054.511762] Exception stack(0xffffff8008dd3d80 to 0xffffff8008dd3ec0) [38054.518187] 3d80: 0000000000000000 ffffffc87ff6ce80 00000048771ab000 0000200000bc4b14 [38054.526001] 3da0: 0000000000000015 00ffffffffffffff 0000033360f7a043 0000000000000013 [38054.533813] 3dc0: ffffffc87ff6be84 000000000000000a ffffffc87ff6be64 000000000000003c [38054.541625] 3de0: 0000000000000000 000000000001133c 071c71c71c71c71c ffffff8008dd8000 [38054.549437] 3e00: 0000000000000001 ffffff8008a58630 ffffffc87ff6cee0 0000229c2e78a95a [38054.557249] 3e20: 0000000000000000 ffffffc87ac4aa00 0000000000000000 ffffffc87ac1c400 [38054.565061] 3e40: 0000229c2e784e88 ffffffc87ac1c400 ffffff8008de1880 0000000000000400 [38054.572873] 3e60: 0000000000d50018 ffffff8008dd3ec0 ffffff80087754a0 ffffff8008dd3ec0 [38054.580685] 3e80: ffffff80087754a4 0000000060000145 ffffffc87ac1c400 0000000000000000 [38054.588497] 3ea0: ffffffffffffffff 0000000000000000 ffffff8008dd3ec0 ffffff80087754a4 [38054.596309] [<ffffff80080830f0>] el1_irq+0xb0/0x140 [38054.601170] [<ffffff80087754a4>] cpuidle_enter_state+0x154/0x200 [38054.607158] [<ffffff8008775588>] cpuidle_enter+0x18/0x20 [38054.612453] [<ffffff80080cfea4>] call_cpuidle+0x1c/0x40 [38054.617660] [<ffffff80080d00f4>] do_idle+0x1a4/0x1e0 [38054.622607] [<ffffff80080d02a0>] cpu_startup_entry+0x20/0x28 [38054.628250] [<ffffff8008a3752c>] rest_init+0xac/0xb8 [38054.633198] [<ffffff8008d50b78>] start_kernel+0x398/0x3ac [38054.638580] Code: 78402423 780024c3 36000562 38401423 (380014c3) [38054.644655] ---[ end trace b83027f2d72fed45 ]--- [38054.649254] Kernel panic - not syncing: Fatal exception in interrupt [38054.655591] SMP: stopping secondary CPUs [38054.659560] Kernel Offset: disabled [38054.662967] CPU features: 0x002004 [38054.666351] Memory Limit: none [38054.669392] ---[ end Kernel panic - not syncing: Fatal exception in interrupt For a full crash log including boot see https://pastebin.com/raw/rzS8wYHE Please notice we are utilizing the full capacity of the chips (Kernel command line: "mtdparts=spi0.0:-(qspi-all)"). Also, we run the CPUs at 600 MHz but we see the same crash at full speed (1.2 GHz), with and without preemption enabled. Sometimes the UBI stress test succeeds but often times it crashes. When running the UBI stress test we have also seen SPI transfer timeouts clearly caused by system being highly loaded/stressed. To work around this we simply added patches to increase the SPI transfer and chip select timeouts. Though, the increased timeouts seem to make no difference for the specific crash we see. Also, please notice that one of our patches re-configures the device tree to run the qspi-nor at a safe 50 Mhz and it also adds "has-io-mode" to avoid using DMA which is not compatible with UBI (because UBI uses vmalloc buffers). We have translated some of the relevant crash addresses involved: Link register: aarch64-gnu-linux-addr2line -e ./vmlinux ffffff8008635a68 /usr/src/kernel/drivers/spi/spi-zynqmp-gqspi.c:430 426 static void zynqmp_qspi_copy_read_data(struct zynqmp_qspi *xqspi, 427 ulong data, u8 size) 428 { 429 memcpy(xqspi->rxbuf, &data, size); 430 xqspi->rxbuf += size; 431 xqspi->bytes_to_receive -= size; 432 } aarch64-gnu-linux-addr2line -e ./vmlinux ffffff8008635c30 /usr/src/kernel/drivers/spi/spi-zynqmp-gqspi.c:796 790 if (!(mask & GQSPI_IER_RXEMPTY_MASK) && 791 (mask & GQSPI_IER_GENFIFOEMPTY_MASK)) { 792 zynqmp_qspi_readrxfifo(xqspi, GQSPI_RX_FIFO_FILL); 793 ret = IRQ_HANDLED; 794 } 795 796 if ((xqspi->bytes_to_receive == 0) && (xqspi->bytes_to_transfer == 0) 797 && ((status & GQSPI_IRQ_MASK) == GQSPI_IRQ_MASK)) { 798 zynqmp_disable_intr(xqspi); 799 xqspi->isinstr = false; 800 spi_finalize_current_transfer(master); 801 ret = IRQ_HANDLED; 802 } 803 804 return ret; 805 } Decoding the aarch64 data abort information it tells us that the data abort is caused by a TLB translation level 3 error upon a write operation. So it seems we are crashing in the memcpy() operation when it is writing 1 (x2) byte from address ffffff8008003e39 (x1) to the destination address ffffff800f1fa000 (x0), meaning at some point the destination address defined by xqspi->rxbuf is not available for writing. We still don't know what causes rxbuf to go haywire but since it seems to only happen sometimes under stress testing I suspect that it might be a synchronization or buffer allocation issue. Unfortunately it is not easy to debug because the test runs for hours before the crash occurs. Perhaps the maintainers of this driver can provide some insight which might narrow down the issue or make it easier to reproduce. We don't believe the bug is related to UBI but running the UBI stress test is currently the only way to reproduce the crash. We are using the mtd-utils-tests package provided by mtd-utils from the poky sumo branch: https://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/mtd/mtd-utils_git.bb?h=sumo Any help to solve this issue is appreciated - it is the last bug preventing us from trusting/using the xilinx spi-zynqmp-gqspi driver in Linux. /Martin From: Tejas Prajapati Rameshchandra <tejas...@xilinx.com> Sent: Monday, July 2, 2018 5:24 AM To: meta-xilinx@yoctoproject.org; Martin Lund Subject: Re: [meta-xilinx] ZynqMP-zcu102: UBI stress test fails on qspinor flash Hi Mark, I have tried reproducing the issue with the latest images but not able to reproduce it since 2 weeks. Can you please let me know which patches you have ported from 2018.1 release to 2017.4 and to be on same setup what changes you have done to device tree. Can you please let me know which flash is being used on board and do you have any other change respective to mtd-utils? Thanks, Tejas Prajapati -- _______________________________________________ meta-xilinx mailing list meta-xilinx@yoctoproject.org https://lists.yoctoproject.org/listinfo/meta-xilinx