Re: Non deterministic kernel crashes after minimal devicetree changes.

2019-07-22 Thread Maik Nassauer
Any ideas how to deal with this problem? 

Best regards

-- 
kernel concepts GmbH   Maik Nassauer
Hauptstraße 16   
maik.nassa...@kernelconcepts.de
D-57074 Siegen Tel: +49 271-338857-21
http://www.kernelconcepts.de/  
HR Siegen, HR B 9613
Geschäftsführer: Ole Reinhardt


signature.asc
Description: This is a digitally signed message part


Non deterministic kernel crashes after minimal devicetree changes.

2019-07-16 Thread Maik Nassauer
Dear everyone,

we are currently developing a kernel upgrade for an older hardware. The
system shall be upgraded from kernel 2.6.24 to the current stable
vanilla kernel (4.19).

With our new kernel we are facing strange and non deterministic kernel
crashes which occur more or less randomly when modifying our devicetree
(even small changes may lead to crashes).

The setup:
- CPU Platform: MPC5121 on a custom board, somewhat similar to ADS5121
eval board

- Bootloader: u-boot 1.3.2
CPU:   MPC5121e rev. 2.0, Core e300c4 at 400 MHz, CSB at 200 MHz
Board: CCS5121
DRAM:  256 MB
FLASH: 32 MB
In:serial
Out:   serial
Err:   serial
I2C:   PMC KEY
ETH:   (eeprom) 00:30:d6:00:00:00
Net:   FEC ETHERNET

- Vanilla Kernel: 4.19 (based on git commit 84df9525) with custom
modifications, mostly devicetree and some drivers and board setup.

Kernel command line: root=/dev/nfs rw
nfsroot=192.168.2.85:/srv/nfs_rootfs,v3,tcp video=fslfb:800x480-32@68
ip=192.168.2.230:192.168.2.85:192.168.2.254:255.255.255.0:dhcp28.kc.loc
:eth0:off panic=1 console=ttyPSC0,115200 no_console_suspend
video=fslfb:800x480-32@68


We are currently building the kernel and devicetree using a power pc
cross toolchain:

powerpc-linux-gnu-gcc 9.1.0-1
https://aur.archlinux.org/packages/powerpc-linux-gnu-gcc/



In the u-boot code, we changed CFG_BOOTMAPSZ from 8 to 64 MB, because
the 4.19 kernel is way bigger than the old (2.6.x) one that we
previously bootet on our system.

However the CFG_BOOTMAPSZ setting does not seem to have any influence
on the problem itself.

We are also padding the devicetree with --space 131072, so that the
actual size of the (padded) device tree binary may not have any impact
on the kernel crashes.

It looks like these crashes may be caused by alignment error or similar
reasons, because when we e.g. add `a;` to node `usb@4000` it will boot,
but when we add an additional line like `b;` the kernel crashes. Also
it does matter where we put these lines. We can't put these a/b lines
at the top of the device tree, because this will also cause a crash,
even if I just put an `a;` on the top. Also the crashes differ if I add
more lines or may even dissapear.



Here is an example of what we changed:

Original:

/* USB0 using internal UTMI PHY */
usb@4000 {
dr_mode = "otg";
fsl,invert-drvvbus;
fsl,invert-pwr-fault;
ccs5121-front-and-back-port;
ccs5121-otg-switch;
};

Modified, crashes:

/* USB0 using internal UTMI PHY */
usb@4000 {
a;// "nonsense nodes" but these
lines cause the crash.
b; 
dr_mode = "otg";
fsl,invert-drvvbus;
fsl,invert-pwr-fault;
ccs5121-front-and-back-port;
ccs5121-otg-switch;
};

The actual node, where we apply these changes does not matter. And also
a and b are just examples. You can add, whatever you want, even "real"
properties may lead to crashes.

Further, it is not sure, that just two lines will cause the crash.
Sometimes, even single lines with longer property names or multiple
added lines may lead to crashes. And also removing nodes or just
properties may also lead to crashes.

In other words: modifying the devicetree in any kind may lead to
crashes.

If we boot multiple times, we may even get different crash reports...

I hope this, in conjunction with the attached logs, is detailed enough
to illustrate the problem. Does anyone of you have any idea what
exactly might cause this or how to debug this further?

A full bootlog of a _working_ boot is attached at the end of this mail.


Thanks and best regards,

Maik Nassauer





Attachments:

Some crashes:
=


Faulting instruction address: 0x
Oops: Kernel access of bad area, sig: 11 [#1]
BE MPC5121 CCS 0
Modules linked in:
CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.19.0-00023-g1077d91e4c12-
dirty #2
NIP:   LR:  CTR: c005cf30
REGS: cf837e50 TRAP: 0400   Not tainted  (4.19.0-00023-g1077d91e4c12-
dirty)
MSR:  20009032   CR: 22000844  XER: 

GPR00: c00248d8 cf837f00 cf822aa0 c0040430 0002 0005 
 
GPR08: c005cf30   1032 42000842  0004
0100 
GPR16: cf836000 c07b55c4 c07b55c0 0001 0002 0004 04208040
 
GPR24: c07a c060e4cc c06b5334 000a fffb7619 c076cfa0 c0770d10
c06b4f60 
NIP []   (null)
LR []   (null)
Call Trace:
Instruction dump:
      
 
      
 
---[ end trace 9de0a50b44704278 ]---

Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 1 seconds..


-


Unrecoverable FP Unavailable Exception 801 at c005a6e8
Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
BE MPC5121 CCS 0
Modules linked in:
CPU: 0 PID: 430 Comm: kworker/u2:5 Not tainted 4.19.0-00023-
g1077d91e4c12-dirty #2
Workqueue: rpciod rpc