Re: [meta-xilinx] ZynqMP-zcu102: UBI stress test fails on qspinor flash

Martin Lund Thu, 30 Aug 2018 06:44:40 -0700

Hi Tejas,

Sorry for the delayed response but we have been busy with bring up of custom 
hardware based on the zynqmp.


We have recently upgraded to the 2018.1 release and are now using the xilinx 
2018.1 tagged linux kernel with quite a few patches to try make the UBI stress 
test run successfully on both NAND and QSPI-NOR. Most of these patches are back 
ported from linux-xlnx and linux-mtd.

You can find the .bbappend file and all the patches we use here: 
https://github.com/lundmar/xilinx-mtd-debug

Actually, we have spent quite a lot of time testing, fixing, and stabilizing 
UBI on NAND and have done so successfully by fixing a bug in the NAND Arasan 
driver and also back porting some fixes to the UBI subsystem. We have 
communicated the NAND driver bug fix via IRC (#mtd) and I believe it has 
already been picked up by the upstream Arasan driver maintainer. 

However, for QSPI-NOR, we still see the same crash in the spi-zynqmp-gqspi 
driver when running the UBI stress test and we can now reproduce it on two 
different hardware configurations: the zcu102 development board (zcu9eg, 128 MB 
qspi-nor: 2 x 25Q512A, no Micron logo, looks like china copies) and our own 
custom board (zcu6eg, 512 MB qspi-nor: 2 x MT25QL02GCBB).

To reiterate, this is the crash we see when running the UBI stress test on 
QSPI-NOR:

<-- cut -->
Running io_paral /dev/ubi0
[38054.190578] Unable to handle kernel paging request at virtual address 
ffffff800f1fa000
[38054.198422] Mem abort info:
[38054.201192]   Exception class = DABT (current EL), IL = 32 bits
[38054.207094]   SET = 0, FnV = 0
[38054.210130]   EA = 0, S1PTW = 0
[38054.213255] Data abort info:
[38054.216119]   ISV = 0, ISS = 0x00000047
[38054.219939]   CM = 0, WnR = 1
[38054.222892] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800908e000
[38054.229663] [ffffff800f1fa000] *pgd=000000087fffe003, *pud=000000087fffe003, 
*pmd=000000087a1c0003, *pte=0000000000000000
[38054.240604] Internal error: Oops: 96000047 [#1] SMP
[38054.245459] Modules linked in:
[38054.248500] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-xilinx-v2018.1 
#1
[38054.255703] Hardware name: ZynqMP ZCU102 Rev1.0 (DT)
[38054.260650] task: ffffff8008de1880 task.stack: ffffff8008dd0000
[38054.266558] PC is at __memcpy+0xac/0x180
[38054.270463] LR is at zynqmp_qspi_readrxfifo.constprop.5+0x98/0xc0
[38054.276535] pc : [<ffffff8008a238ac>] lr : [<ffffff8008635a68>] pstate: 
800001c5
[38054.283913] sp : ffffff8008003e00
[38054.287209] x29: ffffff8008003e00 x28: ffffff8008de1880
[38054.292504] x27: 0000000000000001 x26: ffffff8008c169e0
[38054.297800] x25: ffffff8008e58688 x24: ffffffc87b09ba00
[38054.303094] x23: 0000000000000090 x22: ffffffc87a473000
[38054.308389] x21: 0000000000000001 x20: 0000000000000001
[38054.313684] x19: ffffffc87a4735c8 x18: 0000000000000007
[38054.318978] x17: 0000000000000001 x16: 0000000000000019
[38054.324273] x15: 0000000000000033 x14: 000000000000004c
[38054.329568] x13: 0000000000000068 x12: ffffff8008ae4c40
[38054.334863] x11: ffffffc87ff6a7c8 x10: 0000000000000040
[38054.340158] x9 : ffffff8008dec668 x8 : ffffffc87b400240
[38054.345453] x7 : ffffffc87b400268 x6 : ffffff800f1fa000
[38054.350747] x5 : ffffffc87b400240 x4 : 00000048771ab000
[38054.356042] x3 : 00000000000000ac x2 : 0000000000000001
[38054.361337] x1 : ffffff8008003e39 x0 : ffffff800f1fa000
[38054.366633] Process swapper/0 (pid: 0, stack limit = 0xffffff8008dd0000)
[38054.373316] Call trace:
[38054.375747] Exception stack(0xffffff8008003cc0 to 0xffffff8008003e00)
[38054.382172] 3cc0: ffffff800f1fa000 ffffff8008003e39 0000000000000001 
00000000000000ac
[38054.389986] 3ce0: 00000048771ab000 ffffffc87b400240 ffffff800f1fa000 
ffffffc87b400268
[38054.397798] 3d00: ffffffc87b400240 ffffff8008dec668 0000000000000040 
ffffffc87ff6a7c8
[38054.405609] 3d20: ffffff8008ae4c40 0000000000000068 000000000000004c 
0000000000000033
[38054.413421] 3d40: 0000000000000019 0000000000000001 0000000000000007 
ffffffc87a4735c8
[38054.421233] 3d60: 0000000000000001 0000000000000001 ffffffc87a473000 
0000000000000090
[38054.429046] 3d80: ffffffc87b09ba00 ffffff8008e58688 ffffff8008c169e0 
0000000000000001
[38054.436857] 3da0: ffffff8008de1880 ffffff8008003e00 ffffff8008635a68 
ffffff8008003e00
[38054.444669] 3dc0: ffffff8008a238ac 00000000800001c5 ffffff8009072000 
ffffff8008e09b48
[38054.452482] 3de0: 0000008000000000 0000000001cbd426 ffffff8008003e00 
ffffff8008a238ac
[38054.460294] [<ffffff8008a238ac>] __memcpy+0xac/0x180
[38054.465239] [<ffffff8008635c30>] zynqmp_qspi_irq+0x1a0/0x2b8
[38054.470884] [<ffffff80080dbebc>] __handle_irq_event_percpu+0x5c/0x148
[38054.477305] [<ffffff80080dbfc4>] handle_irq_event_percpu+0x1c/0x58
[38054.483467] [<ffffff80080dc044>] handle_irq_event+0x44/0x78
[38054.489023] [<ffffff80080dfe20>] handle_fasteoi_irq+0xa0/0x188
[38054.494838] [<ffffff80080dafcc>] generic_handle_irq+0x24/0x38
[38054.500567] [<ffffff80080db67c>] __handle_domain_irq+0x5c/0xb8
[38054.506383] [<ffffff8008081500>] gic_handle_irq+0x68/0xc0
[38054.511762] Exception stack(0xffffff8008dd3d80 to 0xffffff8008dd3ec0)
[38054.518187] 3d80: 0000000000000000 ffffffc87ff6ce80 00000048771ab000 
0000200000bc4b14
[38054.526001] 3da0: 0000000000000015 00ffffffffffffff 0000033360f7a043 
0000000000000013
[38054.533813] 3dc0: ffffffc87ff6be84 000000000000000a ffffffc87ff6be64 
000000000000003c
[38054.541625] 3de0: 0000000000000000 000000000001133c 071c71c71c71c71c 
ffffff8008dd8000
[38054.549437] 3e00: 0000000000000001 ffffff8008a58630 ffffffc87ff6cee0 
0000229c2e78a95a
[38054.557249] 3e20: 0000000000000000 ffffffc87ac4aa00 0000000000000000 
ffffffc87ac1c400
[38054.565061] 3e40: 0000229c2e784e88 ffffffc87ac1c400 ffffff8008de1880 
0000000000000400
[38054.572873] 3e60: 0000000000d50018 ffffff8008dd3ec0 ffffff80087754a0 
ffffff8008dd3ec0
[38054.580685] 3e80: ffffff80087754a4 0000000060000145 ffffffc87ac1c400 
0000000000000000
[38054.588497] 3ea0: ffffffffffffffff 0000000000000000 ffffff8008dd3ec0 
ffffff80087754a4
[38054.596309] [<ffffff80080830f0>] el1_irq+0xb0/0x140
[38054.601170] [<ffffff80087754a4>] cpuidle_enter_state+0x154/0x200
[38054.607158] [<ffffff8008775588>] cpuidle_enter+0x18/0x20
[38054.612453] [<ffffff80080cfea4>] call_cpuidle+0x1c/0x40
[38054.617660] [<ffffff80080d00f4>] do_idle+0x1a4/0x1e0
[38054.622607] [<ffffff80080d02a0>] cpu_startup_entry+0x20/0x28
[38054.628250] [<ffffff8008a3752c>] rest_init+0xac/0xb8
[38054.633198] [<ffffff8008d50b78>] start_kernel+0x398/0x3ac
[38054.638580] Code: 78402423 780024c3 36000562 38401423 (380014c3)
[38054.644655] ---[ end trace b83027f2d72fed45 ]---
[38054.649254] Kernel panic - not syncing: Fatal exception in interrupt
[38054.655591] SMP: stopping secondary CPUs
[38054.659560] Kernel Offset: disabled
[38054.662967] CPU features: 0x002004
[38054.666351] Memory Limit: none
[38054.669392] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

For a full crash log including boot see https://pastebin.com/raw/rzS8wYHE

Please notice we are utilizing the full capacity of the chips (Kernel command 
line: "mtdparts=spi0.0:-(qspi-all)"). Also, we run the CPUs at 600 MHz but we 
see the same crash at full speed (1.2 GHz), with and without preemption 
enabled. Sometimes the UBI stress test succeeds but often times it crashes. 
When running the UBI stress test we have also seen SPI transfer timeouts 
clearly caused by system being highly loaded/stressed. To work around this we 
simply added patches to increase the SPI transfer and chip select timeouts. 
Though, the increased timeouts seem to make no difference for the specific 
crash we see. Also, please notice that one of our patches re-configures the 
device tree to run the qspi-nor at a safe 50 Mhz and it also adds "has-io-mode" 
to avoid using DMA which is not compatible with UBI (because UBI uses vmalloc 
buffers).

We have translated some of the relevant crash addresses involved:

Link register:
aarch64-gnu-linux-addr2line -e ./vmlinux ffffff8008635a68
/usr/src/kernel/drivers/spi/spi-zynqmp-gqspi.c:430

 426 static void zynqmp_qspi_copy_read_data(struct zynqmp_qspi *xqspi,
 427                        ulong data, u8 size)
 428 {
 429     memcpy(xqspi->rxbuf, &data, size);
 430     xqspi->rxbuf += size;
 431     xqspi->bytes_to_receive -= size;
 432 }

aarch64-gnu-linux-addr2line -e ./vmlinux ffffff8008635c30
/usr/src/kernel/drivers/spi/spi-zynqmp-gqspi.c:796

 790     if (!(mask & GQSPI_IER_RXEMPTY_MASK) &&
 791         (mask & GQSPI_IER_GENFIFOEMPTY_MASK)) {
 792         zynqmp_qspi_readrxfifo(xqspi, GQSPI_RX_FIFO_FILL);
 793         ret = IRQ_HANDLED;
 794     }
 795 
 796     if ((xqspi->bytes_to_receive == 0) && (xqspi->bytes_to_transfer == 0)
 797             && ((status & GQSPI_IRQ_MASK) == GQSPI_IRQ_MASK)) {
 798         zynqmp_disable_intr(xqspi);
 799         xqspi->isinstr = false;
 800         spi_finalize_current_transfer(master);
 801         ret = IRQ_HANDLED;
 802     }
 803 
 804     return ret;
 805 }

Decoding the aarch64 data abort information it tells us that the data abort is 
caused by a TLB translation level 3 error upon a write operation.

So it seems we are crashing in the memcpy() operation when it is writing 1 (x2) 
byte from address ffffff8008003e39 (x1) to the destination address 
ffffff800f1fa000 (x0), meaning at some point the destination address defined by 
xqspi->rxbuf is not available for writing.

We still don't know what causes rxbuf to go haywire but since it seems to only 
happen sometimes under stress testing I suspect that it might be a 
synchronization or buffer allocation issue. Unfortunately it is not easy to 
debug because the test runs for hours before the crash occurs. Perhaps the 
maintainers of this driver can provide some insight which might narrow down the 
issue or make it easier to reproduce.

We don't believe the bug is related to UBI but running the UBI stress test is 
currently the only way to reproduce the crash.

We are using the mtd-utils-tests package provided by mtd-utils from the poky 
sumo branch:
https://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/mtd/mtd-utils_git.bb?h=sumo

Any help to solve this issue is appreciated - it is the last bug preventing us 
from trusting/using the xilinx spi-zynqmp-gqspi driver in Linux.

/Martin




From: Tejas Prajapati Rameshchandra <tejas...@xilinx.com>
Sent: Monday, July 2, 2018 5:24 AM
To: meta-xilinx@yoctoproject.org; Martin Lund
Subject: Re: [meta-xilinx] ZynqMP-zcu102: UBI stress test fails on qspinor flash
  

Hi Mark,
 
I have tried reproducing the issue with the latest images but not able to 
reproduce it since 2 weeks. Can you please let me know which patches you have 
ported from 2018.1 release to 2017.4 and to be on same setup what changes you 
have done  to device tree. Can you please let me know which flash is being used 
on board and do you have any other change respective to mtd-utils?
 
Thanks,
Tejas Prajapati
     
-- 
_______________________________________________
meta-xilinx mailing list
meta-xilinx@yoctoproject.org
https://lists.yoctoproject.org/listinfo/meta-xilinx

Re: [meta-xilinx] ZynqMP-zcu102: UBI stress test fails on qspinor flash

Reply via email to