If the memory location that is corrupted is consistent, then you can monitor that location to find the culprit (perhaps using debug output).  If your debugger supports it then setting a watchpoint could also trigger a break when the corruption occurs.

Maybe you can also try disabling features until you find the feature logic that is corrupting the heap.  There is no easy way to accomplish this.

On 3/11/2024 11:27 AM, Nathan Hartman wrote:
What's needed is some way to binary search where the culprit is.

If I understand correctly, it looks like the crash is happening in the
later stages of board bring-up? What is running before that? Can parts
be disabled or skipped to see if the problem goes away?

Another idea is to try running a static analysis tool on the sources
and see if it finds anything suspicious to be looked into more
carefully.


On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt <spudan...@gmail.com> wrote:
The reason that the error is confusing is because the error probably did
not occur at the time of the assertion; it probably occurred much earlier.

In most crashes due to heap corruption there are two players:  the
culprit and the victim threads.  The culprit thread actually cause the
corruption.  But at the time of the corruption, no error occurs.  The
error will not occur until later.

So sometime later, the victim thread runs, encounters the clobbered heap
and crashes.  In this case, "AppBringup" and "rptun" are potential
victim threads.  The fact that they crash tell you very little about the
culprit.

On 3/10/2024 6:51 PM, yfliu2008 wrote:
Gregory, thank you for the analysis.




The crashes happened during system booting up, mostly at "AppBringup" or "rptun" threads, as per the 
assertion logs. The other threads existing are the "idle" and the "lpwork" threads as per the sched logs. 
There should be no other threads as NSH creation is still ongoing.&nbsp; As for interruptions, the UART and IPI are running 
in kernel space and MTIMER are in NuttSBI space.&nbsp; The NSH is loaded from a RPMSGFS volume, thus there are a lot RPMSG 
communications.




Is the KASAN proper for use in Kernel mode?&nbsp;


With MM_KASAN_ALL it reports a read access error:



BCkasan_report: kasan detected a read access error, address at 0x708fe90,size 
is 8, return address: 0x701aeac

_assert: Assertion failed panic: at file: kasan/kasan.c:117 task: Idle_Task 
process: Kernel 0x70023c0


The call stack looks like:


#0 &nbsp;_assert (filename=0x7060f78 "kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 
<g_last_regs&gt;) at misc/assert.c:536#1 &nbsp;0x0000000007010248 in __assert (filename=0x7060f78 
"kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic") at assert/lib_assert.c:36
#2 &nbsp;0x00000000070141d6 in kasan_report (addr=0x708fe90, size=8, is_write=false, 
return_address=0x701aeac <riscv_swint+114&gt;)&nbsp; at kasan/kasan.c:117
#3 &nbsp;0x0000000007014412 in kasan_check_report (addr=0x708fe90, size=8, 
is_write=false, return_address=0x701aeac <riscv_swint+114&gt;) at kasan/kasan.c:190
#4 &nbsp;0x000000000701468c in __asan_load8_noabort (addr=0x708fe90) at 
kasan/kasan.c:315
#5 &nbsp;0x000000000701aeac in riscv_swint (irq=0, context=0x708fe40, arg=0x0) 
at common/riscv_swint.c:133
#6 &nbsp;0x000000000701b8fe in riscv_perform_syscall (regs=0x708fe40) at 
common/supervisor/riscv_perform_syscall.c:45
#7 &nbsp;0x0000000007000570 in sys_call6 ()



With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:


_assert: Assertion failed : at file: mm_heap/mm_malloc.c:245 task: rptun 
process: Kernel 0x704a030


The call stack is:


#0 &nbsp;_assert (filename=0x7056060 "mm_heap/mm_malloc.c", linenum=245, msg=0x0, regs=0x7082720 
<g_last_regs&gt;) at misc/assert.c:536#1 &nbsp;0x000000000700df18 in __assert (filename=0x7056060 
"mm_heap/mm_malloc.c", linenum=245, msg=0x0) at assert/lib_assert.c:36
#2 &nbsp;0x0000000007013082 in mm_malloc (heap=0x7089c00, size=128) at 
mm_heap/mm_malloc.c:245
#3 &nbsp;0x0000000007011694 in kmm_malloc (size=128) at kmm_heap/kmm_malloc.c:51
#4 &nbsp;0x000000000704efd4 in metal_allocate_memory (size=128) at 
.../nuttx/include/metal/system/nuttx/alloc.h:27
#5 &nbsp;0x000000000704fd8a in rproc_virtio_create_vdev (role=1, notifyid=0,
&nbsp; &nbsp; rsc=0x80200050, rsc_io=0x7080408 <metal_io_region_&gt;, 
priv=0x708ecd8,
&nbsp; &nbsp; notify=0x704e6d2 <remoteproc_virtio_notify&gt;, rst_cb=0x0)
&nbsp; &nbsp; at open-amp/lib/remoteproc/remoteproc_virtio.c:356
#6 &nbsp;0x000000000704e956 in remoteproc_create_virtio (rproc=0x708ecd8,
&nbsp; &nbsp; vdev_id=0, role=1, rst_cb=0x0) at 
open-amp/lib/remoteproc/remoteproc.c:957
#7 &nbsp;0x000000000704b1ee in rptun_dev_start (rproc=0x708ecd8)
&nbsp; &nbsp; at rptun/rptun.c:757
#8 &nbsp;0x0000000007049ff8 in rptun_start_worker (arg=0x708eac0)
&nbsp; &nbsp; at rptun/rptun.c:233
#9 &nbsp;0x000000000704a0ac in rptun_thread (argc=3, argv=0x7092010)
&nbsp; &nbsp; at rptun/rptun.c:253
#10 0x000000000700437e in nxtask_start () at task/task_start.c:107


This looks like already corrupted.



I also noticed there is a "mm_checkcorruption()" function, not sure how to use 
it yet.



Regards,
yf





Original



From:"Gregory Nutt"< spudan...@gmail.com &gt;;

Date:2024/3/11 1:43

To:"dev"< dev@nuttx.apache.org &gt;;

Subject:Re: mm/mm_heap assertion error


On 3/10/2024 4:38 AM, yfliu2008 wrote:
&gt; Dear experts,
&gt;
&gt;
&gt;
&gt;
&gt; When doing regression check on K230 with a previously working Kernel mode 
configuration, I got assertion error like below:
&gt;
&gt;
&gt;
&gt; #0 &nbsp;_assert (filename=0x704c598 "mm_heap/mm_malloc.c", linenum=245, 
msg=0x0,regs=0x7082730 &gt; #2 &nbsp;0x00000000070110f0 in mm_malloc (heap=0x7089c00, size=112) 
at mm_heap/mm_malloc.c:245
&gt; #3 &nbsp;0x000000000700fd74 in kmm_malloc (size=112) at 
kmm_heap/kmm_malloc.c:51
&gt; #4 &nbsp;0x0000000007028d4e in elf_loadphdrs (loadinfo=0x7090550) at 
libelf/libelf_sections.c:207
&gt; #5 &nbsp;0x0000000007028b0c in elf_load (loadinfo=0x7090550)&nbsp; at 
libelf/libelf_load.c:337
&gt; #6 &nbsp;0x00000000070278aa in elf_loadbinary (binp=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0) at elf.c:257
&gt; #7 &nbsp;0x00000000070293ea in load_absmodule (bin=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0) at binfmt_loadmodule.c:115
&gt; #8 &nbsp;0x0000000007029504 in load_module (bin=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0)&nbsp; at binfmt_loadmodule.c:219
&gt; #9 &nbsp;0x0000000007027674 in exec_internal (filename=0x704bca8 
"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, actions=0x0, 
attr=0x7090788, spawn=true) at binfmt_exec.c:98
&gt; #10 0x000000000702779c in exec_spawn (filename=0x704bca8 
"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, actions=0x0, 
attr=0x7090788) at binfmt_exec.c:220
&gt; #11 0x000000000700299e in nx_start_application () at init/nx_bringup.c:375
&gt; #12 0x00000000070029f0 in nx_start_task (argc=1, argv=0x7090010) at 
init/nx_bringup.c:403
&gt; #13 0x0000000007003f84 in nxtask_start () at task/task_start.c:107
&gt;
&gt;
&gt;
&gt; It looks like mm/mm_heap data structure consistency was broken. As I am 
unfamilar with these internals, I am looking forward to&nbsp; any hints about how 
to find the root cause.
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt; Regards,
&gt;
&gt; yf

This does indicate heap corruption:

      240&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /* Node next must be alloced, 
otherwise it should be merged.
      241&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * Its prenode(the founded 
node) must be free and
      preceding should
      242&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * match with nodesize.
      243&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; */
      244
      245&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DEBUGASSERT(MM_NODE_IS_ALLOC(next) 
&amp;&amp;
      MM_PREVNODE_IS_FREE(next) &amp;&amp;
      
246&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 next-&gt;preceding == nodesize);

Heap corruption normally occurs when that this a wild write outside of
the allocated memory region.&nbsp; These kinds of wild writes may clobber
some other threads data and directory or indirectly clobber the heap
meta data.&nbsp; Trying to traverse the damages heap meta data is probably
the root cause of the problem.

Only a kernel thread or interrupt handler could damage the heap.

The cause of this corruption can be really difficult to find because the
reported error does not occur when the heap is damaged but may not
manifest itself until sometime later.

It is unlikely that anyone will be able to solve this by just talking
about it.&nbsp; It might be worth increasing some kernel thread heap sizes
just to eliminate that common cause.

Reply via email to