Update: stack dump and register analysis are in fact pointing to a crash in mm_alloc

I have enabled memory management debug:

mm_initialize: Heap: start=0x10000000 size=65536
mm_addregion: Region 1: base=0x10000154 size=65184
stm32_netinitialize: Enabling PHY power
stm32_netinitialize: PHY reset...
stm32_netinitialize: PHY reset done.
stm32_netinitialize: Configuring PHY int
F
mm_free: Freeing 0x70fb460b
irq_unexpected_isr: ERROR irq: 3
up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 50
up_registerdump: R0: 00000001 2000737c c00000f2 08000101 00000000 00000000 00000000 200073c8 up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000 200073c8 080126ad 080126f8
up_registerdump: xPSR: 21000000 PRIMASK: 00000000 CONTROL: 00000000
up_registerdump: EXC_RETURN: fffffff9
up_dumpstate: sp:         200072c8
up_dumpstate: stack base: 20007078
up_dumpstate: stack size: 00000400

The fact that mm_initialize only shows one region is weird... where is the heap for the main RAM at 0x20000000?

the mm_free(0x70fb460b) is not what causes the hardfault (it comes later), but what the hell is is this invalid address!

This is the first call to mm_free, here is the backtrace:

Breakpoint 1, mm_free (heap=0x200060b4 <g_mmheap>, mem=0x70fb460b) at mm_heap/mm_free.c:85
85        if (!mem)
(gdb) bt
#0  mm_free (heap=0x200060b4 <g_mmheap>, mem=0x70fb460b) at mm_heap/mm_free.c:85 #1  0x0801264a in mm_free_delaylist (heap=0x200060b4 <g_mmheap>) at mm_heap/mm_malloc.c:82 #2  0x08012672 in mm_malloc (heap=0x200060b4 <g_mmheap>, size=24) at mm_heap/mm_malloc.c:115 #3  0x08012a32 in mm_zalloc (heap=0x200060b4 <g_mmheap>, size=24) at mm_heap/mm_zalloc.c:45
#4  0x080123ac in zalloc (size=24) at umm_heap/umm_zalloc.c:68
#5  0x080399fa in inode_alloc (name=0x8059a78 "") at inode/fs_inodereserve.c:78
#6  0x08039a5c in inode_root_reserve () at inode/fs_inodereserve.c:129
#7  0x080398cc in inode_initialize () at inode/fs_inode.c:92
#8  0x08039284 in fs_initialize () at fs_initialize.c:47
#9  0x08007eb4 in nx_start () at init/nx_start.c:600
#10 0x0800421e in __start () at chip/stm32_start.c:338

As previously analyzed, this happens in fs_initialize through inode_root_reserve, so I was on the right track.

Caller shows mm_free called with that weird address:

(gdb) f 1
#1  0x0801264a in mm_free_delaylist (heap=0x200060b4 <g_mmheap>) at mm_heap/mm_malloc.c:82
82            mm_free(heap, address);
(gdb) list
77
78            /* The address should always be non-NULL since that was checked in the
79             * 'while' condition above.
80             */
81
82            mm_free(heap, address); <-- address == 0x70fb460b
83          }
84      #endif
85      }
86

(gdb) print &g_mmheap
$3 = (struct mm_heap_s *) 0x200060b4 <g_mmheap>
(gdb) print g_mmheap
$4 = {mm_impl = 0x0}

this is not good!

This is not a timing or IRQ related issue but a heap issue.

R15 = 080126f8 translates to here:

https://github.com/apache/incubator-nuttx/blob/master/mm/mm_heap/mm_malloc.c#L199

=> this free() has corrupted a badly initialized heap, and the next malloc fails, giving a hardfault because that address is invalid.

Horrific mess!

==>

I think that my old board code does not initialize the board properly, I probably have to check for differences between my code and the stm32f429i-disco built-in board (on which I based my board).

Sebastien

Le 25/05/2021 à 21:26, Nathan Hartman a écrit :
On Tue, May 25, 2021 at 12:02 PM Sebastien Lorquet <sebast...@lorquet.fr>
wrote:

Back to the business
After this we managed to recompile our project using the latest NuttX
sources, but it fails when trying to init the PHY irq on our STM32F427
board: We get "unexpected IRQ".

Yes I know that's pretty vague :-)

Is there anything obvious I should have been careful with in this
domain, before I dig the jtag probe to fix it (tomorrow) ?
I would first start by looking through the Release Notes between v7.30
and v10.1. Many big improvements and bug fixes happened and some of
them are mentioned in Compatibility Concerns along with some changes
you might need to make to configuration etc.

Also another thing you can try: Has this board and PHY worked
correctly with v7.30? If so, you can bisect and with very few tests
(I'm guessing fewer than 20) find the exact commit that broke it.
Release notes are hard to read but I did not find anything special about
phy interrupts.

Note that it may not be the phy interrupt. Here is my log:

stm32_netinitialize: Enabling PHY power
stm32_netinitialize: PHY reset...
stm32_netinitialize: PHY reset done.
stm32_netinitialize: Configuring PHY int
F
irq_unexpected_isr: ERROR irq: 3
up_assert: Assertion failed at file:irq/irq_unexpectedisr.c line: 50
up_registerdump: R0: 00000001 2000737c c00000f2 08000101 00000000
00000000 00000000 200073c8
up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000
200073c8 080126ad 080126f8
up_registerdump: xPSR: 21000000 PRIMASK: 00000000 CONTROL: 00000000
up_registerdump: EXC_RETURN: fffffff9

A lot of OS initialization things happen at the point, marked by the
letter F.

It seems that an unexpected IRQ happens in this interval, around the
time the filesystem is initialized. The backtrace goes down to memory
allocation routines through the initialization of the root inode.

My guess is that AN external IRQ is triggered (possibly not the PHY IRQ)
but the ISR handler for that one is not ready yet. I will add debug
messages.


I would expect that situation to be a simple NOP, but it seems that
undefined handlers are set to this function "irq_unexpected_isr"

Is that a new behaviour? a default config that I did not set properly
when porting our old defconfig?

Sebastien

Nathan
Did you try disabling the PHY (or networking) in Kconfig to see if removing
it from the build will eliminate the hardfault?

Have you seen this about hardfault debugging:
https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=139629445#content/view/139629445

Nathan

Reply via email to