Re: mm/mm_heap assertion error

2024-03-12 Thread Gregory Nutt

On 3/12/2024 5:12 AM, Nathan Hartman wrote:

Try Alan's suggestion to use stack monitor, and that will help understand
if there is something wrong. (If it shows that old stack size was OK, while
we know corruption was happening, then we will know to look for some out of
bound write.)
Does the stack monitor work in kernel mode?  ie., the stack monitor runs 
in user space.  Is the kernel heap exposed to the applications?   If it 
is, that could be a security issue, couldn't it?

Re: mm/mm_heap assertion error

2024-03-12 Thread Gregory Nutt

After enlarging the stack size of "AppBringUp" thread, the remote node can boot NSH 
on RPMSGFS now. I am sorry for not trying this earlier. I was browsing the "rpmsgfs.c" 
blindly and noticed a few auto variables defined in the stack... then I thought it might worth a try so 
I did it.


That is good news!  I usually try increasing stack sizes first thing.  
Because it it easy to do and by far the most common cause of memory 
corruption.


Is this a configuration option?  If not, it should be.


Now I am still unclear about why small stack leads to heap corruption? Also how we 
can read this stack issue from stackdump logs? Let me know if you have 
any hints.
For a kernel thread, the stack is allocated on the heap.  When you 
overrun the stack, the metadata at the end of stack allocation may be 
clobbered.  My confusing, the meta data or actual data of of the 
preceding (victim) chunk may be corrupted.  Often the the symptoms of 
the failure are even more obscure than these.


Re: mm/mm_heap assertion error

2024-03-12 Thread Gregory Nutt

On 3/12/2024 1:10 AM, yfliu2008 wrote:

On the other hand, if we choose not mounting NSH from the RPMSGFS, it can boot 
smoothly and after boot we can manually mount the RPMSGFS for playing.
That sounds like an initialization sequencing problem.  Perhaps 
something is getting used before it has been initialized?




Re: Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Alan C. Assis
Please watch video #54 at NuttX Channel, I explained how to use it.

I think we are missing a documentation about it here:
Documentation/applications/system/stackmonitor/index.rst

Best Regards,

Alan

On Tue, Mar 12, 2024 at 9:15 AM yfliu2008  wrote:

> Alan, thank you!
>
>
> did you mean this SCHED_STACK_RECORD thing? I set 32 for that and can see
> things like below on the target:
>
>
>  remote cat /proc/3/stack 
> StackAlloc: 0x7092000
>  StackBase: 0x7092050
>  StackSize: 4016
>  StackMax:  118042624
>  SizeBacktrace
>  StackUsed: 1552
>
>
>
>
> The "StackMax" above is 0x7093000 (118042624). But how can this work
> for the short-lived threads like "AppBringUp" thread?
>
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Alan C. Assis"< acas...@gmail.com ;
>
> Date:2024/3/12 18:56
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: Re:Re: mm/mm_heap assertion error
>
>
> You can use the stack monitor to see the stack consumption.
>
> Best Regards,
>
> Alan
>
> On Tue, Mar 12, 2024 at 7:38 AM yfliu2008  wrote:
>
>  Dear experts,
> 
> 
> 
>  After enlarging the stack size of "AppBringUp" thread, the
> remote
>  node can boot NSH on RPMSGFS now. I am sorry for not trying this
> earlier. I
>  was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
>  defined in the stack... then I thought it might worth a try so I did
> it.
> 
> 
>  Now I am still unclear about why small stack leads to heap corruption?
>  Also how we can read this stack issue from stackdump logs? Let
> me
>  know if you have any hints.
> 
> 
>  Regards,
>  yf
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  Original
> 
> 
> 
>  From:"yfliu2008"< yfliu2...@qq.com ;
> 
>  Date:2024/3/12 15:10
> 
>  To:"dev"< dev@nuttx.apache.org ;
> 
>  Subject:Re:Re: mm/mm_heap assertion error
> 
> 
>  Nathan,
> 
> 
>  Here I disabled RPMsg UART device initialization but the crash still
>  happens, I don't see other options to disable for now. On the other
> hand,
>  if we choose not mounting NSH from the RPMSGFS, it can boot smoothly
> and
>  after boot we can manually mount the RPMSGFS for playing.
> 
> 
> 
> 
>  I uploaded the logs, callstacks and ELFs at
>  https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There
> are two
>  sets from two ELFs created from same code base but with different
> DEBUG _xx
>  configs. The crash happens earlier in the build with more debug
>  options.
> 
> 
>  Please let me know if you have any more suggestions.
> 
> 
>  Regards,
>  yf
> 
> 
> 
> 
>  Original
> 
> 
> 
>  From:"Nathan Hartman"< hartman.nat...@gmail.com ;
> 
>  Date:2024/3/12 1:27
> 
>  To:"dev"< dev@nuttx.apache.org ;
> 
>  Subject:Re: mm/mm_heap assertion error
> 
> 
>  What's needed is some way to binary search where the culprit is.
> 
>  If I understand correctly, it looks like the crash is happening in the
>  later stages of board bring-up? What is running before that? Can parts
>  be disabled or skipped to see if the problem goes away?
> 
>  Another idea is to try running a static analysis tool on the sources
>  and see if it finds anything suspicious to be looked into more
>  carefully.
> 
> 
>  On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
>  
>   The reason that the error is confusing is because the error
> probably
>  did
>   not occur at the time of the assertion; it probably occurred much
>  earlier.
>  
>   In most crashes due to heap corruption there are two players:
> the
>   culprit and the victim threads.  The culprit thread actually
> cause the
>   corruption.  But at the time of the corruption, no error
> occurs.  The
>   error will not occur until later.
>  
>   So sometime later, the victim thread runs, encounters the
> clobbered
>  heap
>   and crashes.  In this case, "AppBringup" and "rptun" are
> potential
>   victim threads.  The fact that they crash tell you very little
> about
>  the
>   culprit.
>  
>   On 3/10/2024 6:51 PM, yfliu2008 wrote:
>Gregory, thank you for the analysis.
>   
>   
>   
>   
>The crashes happened during system booting up, mostly at
>  "AppBringup" or "rptun" threads, as per the assertion logs. The other
>  threads existing are the "idle" and the "lpwork" threads as per the
> sched
>  logs. There should 

Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Nathan Hartman
One possibility is stack was too small before and overflowed.

Another possibility is that stack size is OK but some code makes an
out-of-bound write to an array on the stack.

Try Alan's suggestion to use stack monitor, and that will help understand
if there is something wrong. (If it shows that old stack size was OK, while
we know corruption was happening, then we will know to look for some out of
bound write.)

Nathan


On Tue, Mar 12, 2024 at 6:38 AM yfliu2008  wrote:

> Dear experts,
>
>
>
> After enlarging the stack size of "AppBringUp" thread, the remote
> node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I
> was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> defined in the stack... then I thought it might worth a try so I did it.
>
>
> Now I am still unclear about why small stack leads to heap corruption?
> Also how we can read this stack issue from stackdump logs? Let me
> know if you have any hints.
>
>
> Regards,
> yf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Original
>
>
>
> From:"yfliu2008"< yfliu2...@qq.com ;
>
> Date:2024/3/12 15:10
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re:Re: mm/mm_heap assertion error
>
>
> Nathan,
>
>
> Here I disabled RPMsg UART device initialization but the crash still
> happens, I don't see other options to disable for now. On the other hand,
> if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and
> after boot we can manually mount the RPMSGFS for playing.
>
>
>
>
> I uploaded the logs, callstacks and ELFs at
> https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two
> sets from two ELFs created from same code base but with different DEBUG _xx
> configs. The crash happens earlier in the build with more debug
> options.
>
>
> Please let me know if you have any more suggestions.
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Nathan Hartman"< hartman.nat...@gmail.com ;
>
> Date:2024/3/12 1:27
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: mm/mm_heap assertion error
>
>
> What's needed is some way to binary search where the culprit is.
>
> If I understand correctly, it looks like the crash is happening in the
> later stages of board bring-up? What is running before that? Can parts
> be disabled or skipped to see if the problem goes away?
>
> Another idea is to try running a static analysis tool on the sources
> and see if it finds anything suspicious to be looked into more
> carefully.
>
>
> On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> 
>  The reason that the error is confusing is because the error probably
> did
>  not occur at the time of the assertion; it probably occurred much
> earlier.
> 
>  In most crashes due to heap corruption there are two players:  the
>  culprit and the victim threads.  The culprit thread actually cause the
>  corruption.  But at the time of the corruption, no error occurs.  The
>  error will not occur until later.
> 
>  So sometime later, the victim thread runs, encounters the clobbered
> heap
>  and crashes.  In this case, "AppBringup" and "rptun" are potential
>  victim threads.  The fact that they crash tell you very little about
> the
>  culprit.
> 
>  On 3/10/2024 6:51 PM, yfliu2008 wrote:
>   Gregory, thank you for the analysis.
>  
>  
>  
>  
>   The crashes happened during system booting up, mostly at
> "AppBringup" or "rptun" threads, as per the assertion logs. The other
> threads existing are the "idle" and the "lpwork" threads as per the sched
> logs. There should be no other threads as NSH creation is still
> ongoing. As for interruptions, the UART and IPI are running in kernel
> space and MTIMER are in NuttSBI space. The NSH is loaded from a
> RPMSGFS volume, thus there are a lot RPMSG communications.
>  
>  
>  
>  
>   Is the KASAN proper for use in Kernel mode?
>  
>  
>   With MM_KASAN_ALL it reports a read access error:
>  
>  
>  
>   BCkasan_report: kasan detected a read access error, address at
> 0x708fe90,size is 8, return address: 0x701aeac
>  
>   _assert: Assertion failed panic: at file: kasan/kasan.c:117
> task: Idle_Task process: Kernel 0x70023c0
>  
>  
>   The call stack looks like:
>  
>  
>   #0 _assert (filename=0x7060f78 "kasan/kasan.c",
> linenum=117, msg=0x7060ff0 "panic", regs=0x7082720   #2
> 0x070141d6 in kasan_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x7

Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Alan C. Assis
You can use the stack monitor to see the stack consumption.

Best Regards,

Alan

On Tue, Mar 12, 2024 at 7:38 AM yfliu2008  wrote:

> Dear experts,
>
>
>
> After enlarging the stack size of "AppBringUp" thread, the remote
> node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I
> was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> defined in the stack... then I thought it might worth a try so I did it.
>
>
> Now I am still unclear about why small stack leads to heap corruption?
> Also how we can read this stack issue from stackdump logs? Let me
> know if you have any hints.
>
>
> Regards,
> yf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Original
>
>
>
> From:"yfliu2008"< yfliu2...@qq.com ;
>
> Date:2024/3/12 15:10
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re:Re: mm/mm_heap assertion error
>
>
> Nathan,
>
>
> Here I disabled RPMsg UART device initialization but the crash still
> happens, I don't see other options to disable for now. On the other hand,
> if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and
> after boot we can manually mount the RPMSGFS for playing.
>
>
>
>
> I uploaded the logs, callstacks and ELFs at
> https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two
> sets from two ELFs created from same code base but with different DEBUG _xx
> configs. The crash happens earlier in the build with more debug
> options.
>
>
> Please let me know if you have any more suggestions.
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Nathan Hartman"< hartman.nat...@gmail.com ;
>
> Date:2024/3/12 1:27
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: mm/mm_heap assertion error
>
>
> What's needed is some way to binary search where the culprit is.
>
> If I understand correctly, it looks like the crash is happening in the
> later stages of board bring-up? What is running before that? Can parts
> be disabled or skipped to see if the problem goes away?
>
> Another idea is to try running a static analysis tool on the sources
> and see if it finds anything suspicious to be looked into more
> carefully.
>
>
> On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> 
>  The reason that the error is confusing is because the error probably
> did
>  not occur at the time of the assertion; it probably occurred much
> earlier.
> 
>  In most crashes due to heap corruption there are two players:  the
>  culprit and the victim threads.  The culprit thread actually cause the
>  corruption.  But at the time of the corruption, no error occurs.  The
>  error will not occur until later.
> 
>  So sometime later, the victim thread runs, encounters the clobbered
> heap
>  and crashes.  In this case, "AppBringup" and "rptun" are potential
>  victim threads.  The fact that they crash tell you very little about
> the
>  culprit.
> 
>  On 3/10/2024 6:51 PM, yfliu2008 wrote:
>   Gregory, thank you for the analysis.
>  
>  
>  
>  
>   The crashes happened during system booting up, mostly at
> "AppBringup" or "rptun" threads, as per the assertion logs. The other
> threads existing are the "idle" and the "lpwork" threads as per the sched
> logs. There should be no other threads as NSH creation is still
> ongoing. As for interruptions, the UART and IPI are running in kernel
> space and MTIMER are in NuttSBI space. The NSH is loaded from a
> RPMSGFS volume, thus there are a lot RPMSG communications.
>  
>  
>  
>  
>   Is the KASAN proper for use in Kernel mode?
>  
>  
>   With MM_KASAN_ALL it reports a read access error:
>  
>  
>  
>   BCkasan_report: kasan detected a read access error, address at
> 0x708fe90,size is 8, return address: 0x701aeac
>  
>   _assert: Assertion failed panic: at file: kasan/kasan.c:117
> task: Idle_Task process: Kernel 0x70023c0
>  
>  
>   The call stack looks like:
>  
>  
>   #0 _assert (filename=0x7060f78 "kasan/kasan.c",
> linenum=117, msg=0x7060ff0 "panic", regs=0x7082720   #2
> 0x070141d6 in kasan_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #3
> 0x07014412 in kasan_check_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #4
> 0x0701468c in __asan_load8_noabort (addr=0x708fe90) at
> kasan/kasan.c:315
>   #5 0x0701aeac in riscv_swint (irq=0,
> context=0x708fe40, arg=0x0) at common/riscv_swint.c:133
>   #6 0x

Re: mm/mm_heap assertion error

2024-03-11 Thread Gregory Nutt
ed there is a "mm_checkcorruption()" function, not sure

how to use it yet.



Regards,
yf





Original



From:"Gregory Nutt"< spudan...@gmail.com ;

Date:2024/3/11 1:43

To:"dev"< dev@nuttx.apache.org ;

Subject:Re: mm/mm_heap assertion error


On 3/10/2024 4:38 AM, yfliu2008 wrote:
 Dear experts,




 When doing regression check on K230 with a previously working

Kernel mode configuration, I got assertion error like below:




 #0 _assert (filename=0x704c598 "mm_heap/mm_malloc.c",

linenum=245, msg=0x0,regs=0x7082730  #2 0x070110f0 in
mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245

 #3 0x0700fd74 in kmm_malloc (size=112) at

kmm_heap/kmm_malloc.c:51

 #4 0x07028d4e in elf_loadphdrs (loadinfo=0x7090550)

at libelf/libelf_sections.c:207

 #5 0x07028b0c in elf_load

(loadinfo=0x7090550) at libelf/libelf_load.c:337

 #6 0x070278aa in elf_loadbinary (binp=0x708f5d0,

filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at elf.c:257

 #7 0x070293ea in load_absmodule (bin=0x708f5d0,

filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at
binfmt_loadmodule.c:115

 #8 0x07029504 in load_module (bin=0x708f5d0,

filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at
binfmt_loadmodule.c:219

 #9 0x07027674 in exec_internal (filename=0x704bca8

"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
actions=0x0, attr=0x7090788, spawn=true) at binfmt_exec.c:98

 #10 0x0702779c in exec_spawn (filename=0x704bca8

"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
actions=0x0, attr=0x7090788) at binfmt_exec.c:220

 #11 0x0700299e in nx_start_application () at

init/nx_bringup.c:375

 #12 0x070029f0 in nx_start_task (argc=1, argv=0x7090010)

at init/nx_bringup.c:403

 #13 0x07003f84 in nxtask_start () at task/task_start.c:107



 It looks like mm/mm_heap data structure consistency was broken.

As I am unfamilar with these internals, I am looking forward to any
hints about how to find the root cause.








 Regards,

 yf

This does indicate heap corruption:

   240 /* Node next must be

alloced, otherwise it should be merged.

   241 * Its prenode(the

founded node) must be free and

   preceding should
   242 * match with

nodesize.

   243 */
   244
   245

DEBUGASSERT(MM_NODE_IS_ALLOC(next) 

   MM_PREVNODE_IS_FREE(next) 


  
246
next-preceding == nodesize);

Heap corruption normally occurs when that this a wild write outside of
the allocated memory region. These kinds of wild writes may

clobber

some other threads data and directory or indirectly clobber the heap
meta data. Trying to traverse the damages heap meta data is

probably

the root cause of the problem.

Only a kernel thread or interrupt handler could damage the heap.

The cause of this corruption can be really difficult to find because

the

reported error does not occur when the heap is damaged but may not
manifest itself until sometime later.

It is unlikely that anyone will be able to solve this by just talking
about it. It might be worth increasing some kernel thread heap

sizes

just to eliminate that common cause.


Re: mm/mm_heap assertion error

2024-03-11 Thread Simon Filgis
ename=0x7056060 "mm_heap/mm_malloc.c",
> linenum=245, msg=0x0, regs=0x7082720  misc/assert.c:536#1 0x0700df18 in __assert
> (filename=0x7056060 "mm_heap/mm_malloc.c", linenum=245, msg=0x0) at
> assert/lib_assert.c:36
> >>> #2 0x07013082 in mm_malloc (heap=0x7089c00, size=128) at
> mm_heap/mm_malloc.c:245
> >>> #3 0x07011694 in kmm_malloc (size=128) at
> kmm_heap/kmm_malloc.c:51
> >>> #4 0x0704efd4 in metal_allocate_memory (size=128) at
> .../nuttx/include/metal/system/nuttx/alloc.h:27
> >>> #5 0x0704fd8a in rproc_virtio_create_vdev (role=1,
> notifyid=0,
> >>>   rsc=0x80200050, rsc_io=0x7080408  priv=0x708ecd8,
> >>>   notify=0x704e6d2  rst_cb=0x0)
> >>>   at open-amp/lib/remoteproc/remoteproc_virtio.c:356
> >>> #6 0x0704e956 in remoteproc_create_virtio
> (rproc=0x708ecd8,
> >>>   vdev_id=0, role=1, rst_cb=0x0) at
> open-amp/lib/remoteproc/remoteproc.c:957
> >>> #7 0x0704b1ee in rptun_dev_start (rproc=0x708ecd8)
> >>>   at rptun/rptun.c:757
> >>> #8 0x07049ff8 in rptun_start_worker (arg=0x708eac0)
> >>>   at rptun/rptun.c:233
> >>> #9 0x0704a0ac in rptun_thread (argc=3, argv=0x7092010)
> >>>   at rptun/rptun.c:253
> >>> #10 0x0700437e in nxtask_start () at task/task_start.c:107
> >>>
> >>>
> >>> This looks like already corrupted.
> >>>
> >>>
> >>>
> >>> I also noticed there is a "mm_checkcorruption()" function, not sure
> how to use it yet.
> >>>
> >>>
> >>>
> >>> Regards,
> >>> yf
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Original
> >>>
> >>>
> >>>
> >>> From:"Gregory Nutt"< spudan...@gmail.com ;
> >>>
> >>> Date:2024/3/11 1:43
> >>>
> >>> To:"dev"< dev@nuttx.apache.org ;
> >>>
> >>> Subject:Re: mm/mm_heap assertion error
> >>>
> >>>
> >>> On 3/10/2024 4:38 AM, yfliu2008 wrote:
> >>>  Dear experts,
> >>> 
> >>> 
> >>> 
> >>> 
> >>>  When doing regression check on K230 with a previously working
> Kernel mode configuration, I got assertion error like below:
> >>> 
> >>> 
> >>> 
> >>>  #0 _assert (filename=0x704c598 "mm_heap/mm_malloc.c",
> linenum=245, msg=0x0,regs=0x7082730  #2 0x070110f0 in
> mm_malloc (heap=0x7089c00, size=112) at mm_heap/mm_malloc.c:245
> >>>  #3 0x0700fd74 in kmm_malloc (size=112) at
> kmm_heap/kmm_malloc.c:51
> >>>  #4 0x07028d4e in elf_loadphdrs (loadinfo=0x7090550)
> at libelf/libelf_sections.c:207
> >>>  #5 0x07028b0c in elf_load
> (loadinfo=0x7090550) at libelf/libelf_load.c:337
> >>>  #6 0x070278aa in elf_loadbinary (binp=0x708f5d0,
> filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at elf.c:257
> >>>  #7 0x070293ea in load_absmodule (bin=0x708f5d0,
> filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at
> binfmt_loadmodule.c:115
> >>>  #8 0x07029504 in load_module (bin=0x708f5d0,
> filename=0x704bca8 "/system/bin/init", exports=0x0, nexports=0) at
> binfmt_loadmodule.c:219
> >>>  #9 0x07027674 in exec_internal (filename=0x704bca8
> "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
> actions=0x0, attr=0x7090788, spawn=true) at binfmt_exec.c:98
> >>>  #10 0x0702779c in exec_spawn (filename=0x704bca8
> "/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0,
> actions=0x0, attr=0x7090788) at binfmt_exec.c:220
> >>>  #11 0x0700299e in nx_start_application () at
> init/nx_bringup.c:375
> >>>  #12 0x070029f0 in nx_start_task (argc=1, argv=0x7090010)
> at init/nx_bringup.c:403
> >>>  #13 0x07003f84 in nxtask_start () at task/task_start.c:107
> >>> 
> >>> 
> >>> 
> >>>  It looks like mm/mm_heap data structure consistency was broken.
> As I am unfamilar with these internals, I am looking forward to any
> hints about how to find the root cause.
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>>  Regards,
> >>> 
> >>>  yf
> >>>
> >>> This does indicate heap corruption:
> >>>
> >>>   240 /* Node next must be
> alloced, otherwise it should be merged.
> >>>   241 * Its prenode(the
> founded node) must be free and
> >>>   preceding should
> >>>   242 * match with
> nodesize.
> >>>   243 */
> >>>   244
> >>>   245
> DEBUGASSERT(MM_NODE_IS_ALLOC(next) 
> >>>   MM_PREVNODE_IS_FREE(next) 
> >>>
>  
> 246
> next-preceding == nodesize);
> >>>
> >>> Heap corruption normally occurs when that this a wild write outside of
> >>> the allocated memory region. These kinds of wild writes may
> clobber
> >>> some other threads data and directory or indirectly clobber the heap
> >>> meta data. Trying to traverse the damages heap meta data is
> probably
> >>> the root cause of the problem.
> >>>
> >>> Only a kernel thread or interrupt handler could damage the heap.
> >>>
> >>> The cause of this corruption can be really difficult to find because
> the
> >>> reported error does not occur when the heap is damaged but may not
> >>> manifest itself until sometime later.
> >>>
> >>> It is unlikely that anyone will be able to solve this by just talking
> >>> about it. It might be worth increasing some kernel thread heap
> sizes
> >>> just to eliminate that common cause.
> >>
>


Re: mm/mm_heap assertion error

2024-03-11 Thread Gregory Nutt
If the memory location that is corrupted is consistent, then you can 
monitor that location to find the culprit (perhaps using debug output).  
If your debugger supports it then setting a watchpoint could also 
trigger a break when the corruption occurs.


Maybe you can also try disabling features until you find the feature 
logic that is corrupting the heap.  There is no easy way to accomplish this.


On 3/11/2024 11:27 AM, Nathan Hartman wrote:

What's needed is some way to binary search where the culprit is.

If I understand correctly, it looks like the crash is happening in the
later stages of board bring-up? What is running before that? Can parts
be disabled or skipped to see if the problem goes away?

Another idea is to try running a static analysis tool on the sources
and see if it finds anything suspicious to be looked into more
carefully.


On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:

The reason that the error is confusing is because the error probably did
not occur at the time of the assertion; it probably occurred much earlier.

In most crashes due to heap corruption there are two players:  the
culprit and the victim threads.  The culprit thread actually cause the
corruption.  But at the time of the corruption, no error occurs.  The
error will not occur until later.

So sometime later, the victim thread runs, encounters the clobbered heap
and crashes.  In this case, "AppBringup" and "rptun" are potential
victim threads.  The fact that they crash tell you very little about the
culprit.

On 3/10/2024 6:51 PM, yfliu2008 wrote:

Gregory, thank you for the analysis.




The crashes happened during system booting up, mostly at "AppBringup" or "rptun" threads, as per the 
assertion logs. The other threads existing are the "idle" and the "lpwork" threads as per the sched logs. 
There should be no other threads as NSH creation is still ongoing. As for interruptions, the UART and IPI are running 
in kernel space and MTIMER are in NuttSBI space. The NSH is loaded from a RPMSGFS volume, thus there are a lot RPMSG 
communications.




Is the KASAN proper for use in Kernel mode?


With MM_KASAN_ALL it reports a read access error:



BCkasan_report: kasan detected a read access error, address at 0x708fe90,size 
is 8, return address: 0x701aeac

_assert: Assertion failed panic: at file: kasan/kasan.c:117 task: Idle_Task 
process: Kernel 0x70023c0


The call stack looks like:


#0 _assert (filename=0x7060f78 "kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 




Re: mm/mm_heap assertion error

2024-03-11 Thread Nathan Hartman
What's needed is some way to binary search where the culprit is.

If I understand correctly, it looks like the crash is happening in the
later stages of board bring-up? What is running before that? Can parts
be disabled or skipped to see if the problem goes away?

Another idea is to try running a static analysis tool on the sources
and see if it finds anything suspicious to be looked into more
carefully.


On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
>
> The reason that the error is confusing is because the error probably did
> not occur at the time of the assertion; it probably occurred much earlier.
>
> In most crashes due to heap corruption there are two players:  the
> culprit and the victim threads.  The culprit thread actually cause the
> corruption.  But at the time of the corruption, no error occurs.  The
> error will not occur until later.
>
> So sometime later, the victim thread runs, encounters the clobbered heap
> and crashes.  In this case, "AppBringup" and "rptun" are potential
> victim threads.  The fact that they crash tell you very little about the
> culprit.
>
> On 3/10/2024 6:51 PM, yfliu2008 wrote:
> > Gregory, thank you for the analysis.
> >
> >
> >
> >
> > The crashes happened during system booting up, mostly at "AppBringup" or 
> > "rptun" threads, as per the assertion logs. The other threads existing are 
> > the "idle" and the "lpwork" threads as per the sched logs. There should be 
> > no other threads as NSH creation is still ongoing. As for 
> > interruptions, the UART and IPI are running in kernel space and MTIMER are 
> > in NuttSBI space. The NSH is loaded from a RPMSGFS volume, thus there 
> > are a lot RPMSG communications.
> >
> >
> >
> >
> > Is the KASAN proper for use in Kernel mode?
> >
> >
> > With MM_KASAN_ALL it reports a read access error:
> >
> >
> >
> > BCkasan_report: kasan detected a read access error, address at 
> > 0x708fe90,size is 8, return address: 0x701aeac
> >
> > _assert: Assertion failed panic: at file: kasan/kasan.c:117 task: Idle_Task 
> > process: Kernel 0x70023c0
> >
> >
> > The call stack looks like:
> >
> >
> > #0 _assert (filename=0x7060f78 "kasan/kasan.c", linenum=117, 
> > msg=0x7060ff0 "panic", regs=0x7082720  > misc/assert.c:536#1 0x07010248 in __assert 
> > (filename=0x7060f78 "kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic") at 
> > assert/lib_assert.c:36
> > #2 0x070141d6 in kasan_report (addr=0x708fe90, size=8, 
> > is_write=false, return_address=0x701aeac  > kasan/kasan.c:117
> > #3 0x07014412 in kasan_check_report (addr=0x708fe90, size=8, 
> > is_write=false, return_address=0x701aeac  > kasan/kasan.c:190
> > #4 0x0701468c in __asan_load8_noabort (addr=0x708fe90) at 
> > kasan/kasan.c:315
> > #5 0x0701aeac in riscv_swint (irq=0, context=0x708fe40, 
> > arg=0x0) at common/riscv_swint.c:133
> > #6 0x0701b8fe in riscv_perform_syscall (regs=0x708fe40) at 
> > common/supervisor/riscv_perform_syscall.c:45
> > #7 0x07000570 in sys_call6 ()
> >
> >
> >
> > With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
> >
> >
> > _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245 task: rptun 
> > process: Kernel 0x704a030
> >
> >
> > The call stack is:
> >
> >
> > #0 _assert (filename=0x7056060 "mm_heap/mm_malloc.c", linenum=245, 
> > msg=0x0, regs=0x7082720  > 0x0700df18 in __assert (filename=0x7056060 
> > "mm_heap/mm_malloc.c", linenum=245, msg=0x0) at assert/lib_assert.c:36
> > #2 0x07013082 in mm_malloc (heap=0x7089c00, size=128) at 
> > mm_heap/mm_malloc.c:245
> > #3 0x07011694 in kmm_malloc (size=128) at 
> > kmm_heap/kmm_malloc.c:51
> > #4 0x0704efd4 in metal_allocate_memory (size=128) at 
> > .../nuttx/include/metal/system/nuttx/alloc.h:27
> > #5 0x0704fd8a in rproc_virtio_create_vdev (role=1, notifyid=0,
> >   rsc=0x80200050, rsc_io=0x7080408  > priv=0x708ecd8,
> >   notify=0x704e6d2  >   at open-amp/lib/remoteproc/remoteproc_virtio.c:356
> > #6 0x0704e956 in remoteproc_create_virtio (rproc=0x708ecd8,
> >   vdev_id=0, role=1, rst_cb=0x0) at 
> > open-amp/lib/remoteproc/remoteproc.c:957
> > #7 0x0704b1ee in rptun_dev_start (rproc=0x708ecd8)
> >   at rptun/rptun.c:757
> > #8 0x07049ff8 in rptun_start_worker (arg=0x708eac0)
> >   at rptun/rptun.c:233
&g

Re: mm/mm_heap assertion error

2024-03-11 Thread Gregory Nutt
The reason that the error is confusing is because the error probably did 
not occur at the time of the assertion; it probably occurred much earlier.


In most crashes due to heap corruption there are two players:  the 
culprit and the victim threads.  The culprit thread actually cause the 
corruption.  But at the time of the corruption, no error occurs.  The 
error will not occur until later.


So sometime later, the victim thread runs, encounters the clobbered heap 
and crashes.  In this case, "AppBringup" and "rptun" are potential 
victim threads.  The fact that they crash tell you very little about the 
culprit.


On 3/10/2024 6:51 PM, yfliu2008 wrote:

Gregory, thank you for the analysis.




The crashes happened during system booting up, mostly at "AppBringup" or "rptun" threads, as per the 
assertion logs. The other threads existing are the "idle" and the "lpwork" threads as per the sched logs. 
There should be no other threads as NSH creation is still ongoing. As for interruptions, the UART and IPI are running 
in kernel space and MTIMER are in NuttSBI space. The NSH is loaded from a RPMSGFS volume, thus there are a lot RPMSG 
communications.




Is the KASAN proper for use in Kernel mode?


With MM_KASAN_ALL it reports a read access error:



BCkasan_report: kasan detected a read access error, address at 0x708fe90,size 
is 8, return address: 0x701aeac

_assert: Assertion failed panic: at file: kasan/kasan.c:117 task: Idle_Task 
process: Kernel 0x70023c0


The call stack looks like:


#0 _assert (filename=0x7060f78 "kasan/kasan.c", linenum=117, msg=0x7060ff0 "panic", regs=0x7082720 

Original

  


From:"Gregory Nutt"< spudan...@gmail.com ;

Date:2024/3/11 1:43

To:"dev"< dev@nuttx.apache.org ;

Subject:Re: mm/mm_heap assertion error


On 3/10/2024 4:38 AM, yfliu2008 wrote:
 Dear experts,




 When doing regression check on K230 with a previously working Kernel mode 
configuration, I got assertion error like below:



 #0 _assert (filename=0x704c598 "mm_heap/mm_malloc.c", linenum=245, 
msg=0x0,regs=0x7082730  #2 0x070110f0 in mm_malloc (heap=0x7089c00, size=112) 
at mm_heap/mm_malloc.c:245
 #3 0x0700fd74 in kmm_malloc (size=112) at 
kmm_heap/kmm_malloc.c:51
 #4 0x07028d4e in elf_loadphdrs (loadinfo=0x7090550) at 
libelf/libelf_sections.c:207
 #5 0x07028b0c in elf_load (loadinfo=0x7090550) at 
libelf/libelf_load.c:337
 #6 0x070278aa in elf_loadbinary (binp=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0) at elf.c:257
 #7 0x070293ea in load_absmodule (bin=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0) at binfmt_loadmodule.c:115
 #8 0x07029504 in load_module (bin=0x708f5d0, filename=0x704bca8 
"/system/bin/init", exports=0x0, nexports=0) at binfmt_loadmodule.c:219
 #9 0x07027674 in exec_internal (filename=0x704bca8 
"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, actions=0x0, 
attr=0x7090788, spawn=true) at binfmt_exec.c:98
 #10 0x0702779c in exec_spawn (filename=0x704bca8 
"/system/bin/init", argv=0x70907a0, envp=0x0, exports=0x0, nexports=0, actions=0x0, 
attr=0x7090788) at binfmt_exec.c:220
 #11 0x0700299e in nx_start_application () at init/nx_bringup.c:375
 #12 0x070029f0 in nx_start_task (argc=1, argv=0x7090010) at 
init/nx_bringup.c:403
 #13 0x07003f84 in nxtask_start () at task/task_start.c:107



 It looks like mm/mm_heap data structure consistency was broken. As I am 
unfamilar with these internals, I am looking forward to any hints about how 
to find the root cause.







 Regards,

 yf

This does indicate heap corruption:

 240 /* Node next must be alloced, 
otherwise it should be merged.
 241 * Its prenode(the founded 
node) must be free and
 preceding should
 242 * match with nodesize.
 243 */
 244
 245 DEBUGASSERT(MM_NODE_IS_ALLOC(next) 

 MM_PREVNODE_IS_FREE(next) 
 
246
 next-preceding == nodesize);

Heap corruption normally occurs when that this a wild write outside of
the allocated memory region. These kinds of wild writes may clobber
some other threads data and directory or indirectly clobber the heap
meta data. Trying to traverse the damages heap meta data is probably
the root cause of the problem.

Only a kernel thread or interrupt handler could damage the heap.

The cause of this corruption can be really difficult to find because the
reported error does not occur when the heap is damaged but may not
manifest itself until sometime later.

It is unlikely that anyone will be able to solve this by just talking
about it. It might be worth increasing some kernel thread heap sizes
just to eliminate that common cause.





Re: mm/mm_heap assertion error

2024-03-10 Thread Gregory Nutt

On 3/10/2024 4:38 AM, yfliu2008 wrote:

Dear experts,




When doing regression check on K230 with a previously working Kernel mode 
configuration, I got assertion error like below:



#0 _assert (filename=0x704c598 "mm_heap/mm_malloc.c", linenum=245, msg=0x0,regs=0x7082730 


This does indicate heap corruption:

   240   /* Node next must be alloced, otherwise it should be merged.
   241    * Its prenode(the founded node) must be free and
   preceding should
   242    * match with nodesize.
   243    */
   244
   245   DEBUGASSERT(MM_NODE_IS_ALLOC(next) &&
   MM_PREVNODE_IS_FREE(next) &&
   246   next->preceding == nodesize);

Heap corruption normally occurs when that this a wild write outside of 
the allocated memory region.  These kinds of wild writes may clobber 
some other threads data and directory or indirectly clobber the heap 
meta data.  Trying to traverse the damages heap meta data is probably 
the root cause of the problem.


Only a kernel thread or interrupt handler could damage the heap.

The cause of this corruption can be really difficult to find because the 
reported error does not occur when the heap is damaged but may not 
manifest itself until sometime later.


It is unlikely that anyone will be able to solve this by just talking 
about it.  It might be worth increasing some kernel thread heap sizes 
just to eliminate that common cause.