Re: Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Alan C. Assis
Please watch video #54 at NuttX Channel, I explained how to use it.

I think we are missing a documentation about it here:
Documentation/applications/system/stackmonitor/index.rst

Best Regards,

Alan

On Tue, Mar 12, 2024 at 9:15 AM yfliu2008  wrote:

> Alan, thank you!
>
>
> did you mean this SCHED_STACK_RECORD thing? I set 32 for that and can see
> things like below on the target:
>
>
>  remote cat /proc/3/stack 
> StackAlloc: 0x7092000
>  StackBase: 0x7092050
>  StackSize: 4016
>  StackMax:  118042624
>  SizeBacktrace
>  StackUsed: 1552
>
>
>
>
> The "StackMax" above is 0x7093000 (118042624). But how can this work
> for the short-lived threads like "AppBringUp" thread?
>
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Alan C. Assis"< acas...@gmail.com ;
>
> Date:2024/3/12 18:56
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: Re:Re: mm/mm_heap assertion error
>
>
> You can use the stack monitor to see the stack consumption.
>
> Best Regards,
>
> Alan
>
> On Tue, Mar 12, 2024 at 7:38 AM yfliu2008  wrote:
>
>  Dear experts,
> 
> 
> 
>  After enlarging the stack size of "AppBringUp" thread, the
> remote
>  node can boot NSH on RPMSGFS now. I am sorry for not trying this
> earlier. I
>  was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
>  defined in the stack... then I thought it might worth a try so I did
> it.
> 
> 
>  Now I am still unclear about why small stack leads to heap corruption?
>  Also how we can read this stack issue from stackdump logs? Let
> me
>  know if you have any hints.
> 
> 
>  Regards,
>  yf
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  Original
> 
> 
> 
>  From:"yfliu2008"< yfliu2...@qq.com ;
> 
>  Date:2024/3/12 15:10
> 
>  To:"dev"< dev@nuttx.apache.org ;
> 
>  Subject:Re:Re: mm/mm_heap assertion error
> 
> 
>  Nathan,
> 
> 
>  Here I disabled RPMsg UART device initialization but the crash still
>  happens, I don't see other options to disable for now. On the other
> hand,
>  if we choose not mounting NSH from the RPMSGFS, it can boot smoothly
> and
>  after boot we can manually mount the RPMSGFS for playing.
> 
> 
> 
> 
>  I uploaded the logs, callstacks and ELFs at
>  https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There
> are two
>  sets from two ELFs created from same code base but with different
> DEBUG _xx
>  configs. The crash happens earlier in the build with more debug
>  options.
> 
> 
>  Please let me know if you have any more suggestions.
> 
> 
>  Regards,
>  yf
> 
> 
> 
> 
>  Original
> 
> 
> 
>  From:"Nathan Hartman"< hartman.nat...@gmail.com ;
> 
>  Date:2024/3/12 1:27
> 
>  To:"dev"< dev@nuttx.apache.org ;
> 
>  Subject:Re: mm/mm_heap assertion error
> 
> 
>  What's needed is some way to binary search where the culprit is.
> 
>  If I understand correctly, it looks like the crash is happening in the
>  later stages of board bring-up? What is running before that? Can parts
>  be disabled or skipped to see if the problem goes away?
> 
>  Another idea is to try running a static analysis tool on the sources
>  and see if it finds anything suspicious to be looked into more
>  carefully.
> 
> 
>  On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
>  
>   The reason that the error is confusing is because the error
> probably
>  did
>   not occur at the time of the assertion; it probably occurred much
>  earlier.
>  
>   In most crashes due to heap corruption there are two players:
> the
>   culprit and the victim threads.  The culprit thread actually
> cause the
>   corruption.  But at the time of the corruption, no error
> occurs.  The
>   error will not occur until later.
>  
>   So sometime later, the victim thread runs, encounters the
> clobbered
>  heap
>   and crashes.  In this case, "AppBringup" and "rptun" are
> potential
>   victim threads.  The fact that they crash tell you very little
> about
>  the
>   culprit.
>  
>   On 3/10/2024 6:51 PM, yfliu2008 wrote:
>Gregory, thank you for the analysis.
>   
>   
>   
>   
>The crashes happened during system booting up, mostly at
>  "AppBringup" or "rptun" threads, as per the assertion logs. The other
>  threads existing are the "idle" and the "lpwork" threads as per the
> sched
>  logs. There should be no other threads as NSH creation is still
>  ongoing. As for interruptions, the UART and IPI are running in
> kernel
>  space and MTIMER are in NuttSBI space. The NSH is loaded from a
>  RPMSGFS volume, thus there are a lot RPMSG communications.
>   
>   
>   
>   
>Is the KASAN proper for use in Kernel mode?
>   
>   
>With MM_KASAN_ALL it reports a read access error:
>   
>   
>   
>BCkasan_report: kasan detected a read access error, address
> at
>  0x708fe90,size is 8, return address: 0x701aeac
>   
>_assert: Assertion failed panic: at file: kasan/kasan.c:117
>  task: Idle_Task process: Kernel 0x70023c0
>   
>   
>The call stack looks like:
>   
>   
>#0 _assert (filename=0x7060f78 "kasan/kasan.c",
>  linenum=117, 

Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Nathan Hartman
One possibility is stack was too small before and overflowed.

Another possibility is that stack size is OK but some code makes an
out-of-bound write to an array on the stack.

Try Alan's suggestion to use stack monitor, and that will help understand
if there is something wrong. (If it shows that old stack size was OK, while
we know corruption was happening, then we will know to look for some out of
bound write.)

Nathan


On Tue, Mar 12, 2024 at 6:38 AM yfliu2008  wrote:

> Dear experts,
>
>
>
> After enlarging the stack size of "AppBringUp" thread, the remote
> node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I
> was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> defined in the stack... then I thought it might worth a try so I did it.
>
>
> Now I am still unclear about why small stack leads to heap corruption?
> Also how we can read this stack issue from stackdump logs? Let me
> know if you have any hints.
>
>
> Regards,
> yf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Original
>
>
>
> From:"yfliu2008"< yfliu2...@qq.com ;
>
> Date:2024/3/12 15:10
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re:Re: mm/mm_heap assertion error
>
>
> Nathan,
>
>
> Here I disabled RPMsg UART device initialization but the crash still
> happens, I don't see other options to disable for now. On the other hand,
> if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and
> after boot we can manually mount the RPMSGFS for playing.
>
>
>
>
> I uploaded the logs, callstacks and ELFs at
> https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two
> sets from two ELFs created from same code base but with different DEBUG _xx
> configs. The crash happens earlier in the build with more debug
> options.
>
>
> Please let me know if you have any more suggestions.
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Nathan Hartman"< hartman.nat...@gmail.com ;
>
> Date:2024/3/12 1:27
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: mm/mm_heap assertion error
>
>
> What's needed is some way to binary search where the culprit is.
>
> If I understand correctly, it looks like the crash is happening in the
> later stages of board bring-up? What is running before that? Can parts
> be disabled or skipped to see if the problem goes away?
>
> Another idea is to try running a static analysis tool on the sources
> and see if it finds anything suspicious to be looked into more
> carefully.
>
>
> On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> 
>  The reason that the error is confusing is because the error probably
> did
>  not occur at the time of the assertion; it probably occurred much
> earlier.
> 
>  In most crashes due to heap corruption there are two players:  the
>  culprit and the victim threads.  The culprit thread actually cause the
>  corruption.  But at the time of the corruption, no error occurs.  The
>  error will not occur until later.
> 
>  So sometime later, the victim thread runs, encounters the clobbered
> heap
>  and crashes.  In this case, "AppBringup" and "rptun" are potential
>  victim threads.  The fact that they crash tell you very little about
> the
>  culprit.
> 
>  On 3/10/2024 6:51 PM, yfliu2008 wrote:
>   Gregory, thank you for the analysis.
>  
>  
>  
>  
>   The crashes happened during system booting up, mostly at
> "AppBringup" or "rptun" threads, as per the assertion logs. The other
> threads existing are the "idle" and the "lpwork" threads as per the sched
> logs. There should be no other threads as NSH creation is still
> ongoing. As for interruptions, the UART and IPI are running in kernel
> space and MTIMER are in NuttSBI space. The NSH is loaded from a
> RPMSGFS volume, thus there are a lot RPMSG communications.
>  
>  
>  
>  
>   Is the KASAN proper for use in Kernel mode?
>  
>  
>   With MM_KASAN_ALL it reports a read access error:
>  
>  
>  
>   BCkasan_report: kasan detected a read access error, address at
> 0x708fe90,size is 8, return address: 0x701aeac
>  
>   _assert: Assertion failed panic: at file: kasan/kasan.c:117
> task: Idle_Task process: Kernel 0x70023c0
>  
>  
>   The call stack looks like:
>  
>  
>   #0 _assert (filename=0x7060f78 "kasan/kasan.c",
> linenum=117, msg=0x7060ff0 "panic", regs=0x7082720   #2
> 0x070141d6 in kasan_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #3
> 0x07014412 in kasan_check_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #4
> 0x0701468c in __asan_load8_noabort (addr=0x708fe90) at
> kasan/kasan.c:315
>   #5 0x0701aeac in riscv_swint (irq=0,
> context=0x708fe40, arg=0x0) at common/riscv_swint.c:133
>   #6 0x0701b8fe in riscv_perform_syscall
> (regs=0x708fe40) at common/supervisor/riscv_perform_syscall.c:45
>   #7 0x07000570 in sys_call6 ()
>  
>  
>  
>   With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
>  
>  
>   _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245
> 

Re: Re:Re: mm/mm_heap assertion error

2024-03-12 Thread Alan C. Assis
You can use the stack monitor to see the stack consumption.

Best Regards,

Alan

On Tue, Mar 12, 2024 at 7:38 AM yfliu2008  wrote:

> Dear experts,
>
>
>
> After enlarging the stack size of "AppBringUp" thread, the remote
> node can boot NSH on RPMSGFS now. I am sorry for not trying this earlier. I
> was browsing the "rpmsgfs.c" blindly and noticed a few auto variables
> defined in the stack... then I thought it might worth a try so I did it.
>
>
> Now I am still unclear about why small stack leads to heap corruption?
> Also how we can read this stack issue from stackdump logs? Let me
> know if you have any hints.
>
>
> Regards,
> yf
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Original
>
>
>
> From:"yfliu2008"< yfliu2...@qq.com ;
>
> Date:2024/3/12 15:10
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re:Re: mm/mm_heap assertion error
>
>
> Nathan,
>
>
> Here I disabled RPMsg UART device initialization but the crash still
> happens, I don't see other options to disable for now. On the other hand,
> if we choose not mounting NSH from the RPMSGFS, it can boot smoothly and
> after boot we can manually mount the RPMSGFS for playing.
>
>
>
>
> I uploaded the logs, callstacks and ELFs at
> https://github.com/yf13/hello/tree/debug-logs/nsh-rpmsgfs . There are two
> sets from two ELFs created from same code base but with different DEBUG _xx
> configs. The crash happens earlier in the build with more debug
> options.
>
>
> Please let me know if you have any more suggestions.
>
>
> Regards,
> yf
>
>
>
>
> Original
>
>
>
> From:"Nathan Hartman"< hartman.nat...@gmail.com ;
>
> Date:2024/3/12 1:27
>
> To:"dev"< dev@nuttx.apache.org ;
>
> Subject:Re: mm/mm_heap assertion error
>
>
> What's needed is some way to binary search where the culprit is.
>
> If I understand correctly, it looks like the crash is happening in the
> later stages of board bring-up? What is running before that? Can parts
> be disabled or skipped to see if the problem goes away?
>
> Another idea is to try running a static analysis tool on the sources
> and see if it finds anything suspicious to be looked into more
> carefully.
>
>
> On Mon, Mar 11, 2024 at 10:00 AM Gregory Nutt  wrote:
> 
>  The reason that the error is confusing is because the error probably
> did
>  not occur at the time of the assertion; it probably occurred much
> earlier.
> 
>  In most crashes due to heap corruption there are two players:  the
>  culprit and the victim threads.  The culprit thread actually cause the
>  corruption.  But at the time of the corruption, no error occurs.  The
>  error will not occur until later.
> 
>  So sometime later, the victim thread runs, encounters the clobbered
> heap
>  and crashes.  In this case, "AppBringup" and "rptun" are potential
>  victim threads.  The fact that they crash tell you very little about
> the
>  culprit.
> 
>  On 3/10/2024 6:51 PM, yfliu2008 wrote:
>   Gregory, thank you for the analysis.
>  
>  
>  
>  
>   The crashes happened during system booting up, mostly at
> "AppBringup" or "rptun" threads, as per the assertion logs. The other
> threads existing are the "idle" and the "lpwork" threads as per the sched
> logs. There should be no other threads as NSH creation is still
> ongoing. As for interruptions, the UART and IPI are running in kernel
> space and MTIMER are in NuttSBI space. The NSH is loaded from a
> RPMSGFS volume, thus there are a lot RPMSG communications.
>  
>  
>  
>  
>   Is the KASAN proper for use in Kernel mode?
>  
>  
>   With MM_KASAN_ALL it reports a read access error:
>  
>  
>  
>   BCkasan_report: kasan detected a read access error, address at
> 0x708fe90,size is 8, return address: 0x701aeac
>  
>   _assert: Assertion failed panic: at file: kasan/kasan.c:117
> task: Idle_Task process: Kernel 0x70023c0
>  
>  
>   The call stack looks like:
>  
>  
>   #0 _assert (filename=0x7060f78 "kasan/kasan.c",
> linenum=117, msg=0x7060ff0 "panic", regs=0x7082720   #2
> 0x070141d6 in kasan_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #3
> 0x07014412 in kasan_check_report (addr=0x708fe90, size=8,
> is_write=false, return_address=0x701aeac   #4
> 0x0701468c in __asan_load8_noabort (addr=0x708fe90) at
> kasan/kasan.c:315
>   #5 0x0701aeac in riscv_swint (irq=0,
> context=0x708fe40, arg=0x0) at common/riscv_swint.c:133
>   #6 0x0701b8fe in riscv_perform_syscall
> (regs=0x708fe40) at common/supervisor/riscv_perform_syscall.c:45
>   #7 0x07000570 in sys_call6 ()
>  
>  
>  
>   With MM_KASAN_DISABLE_READ_CHECKS=y, it reports:
>  
>  
>   _assert: Assertion failed : at file: mm_heap/mm_malloc.c:245
> task: rptun process: Kernel 0x704a030
>  
>  
>   The call stack is:
>  
>  
>   #0 _assert (filename=0x7056060 "mm_heap/mm_malloc.c",
> linenum=245, msg=0x0, regs=0x7082720   #2 0x07013082
> in mm_malloc (heap=0x7089c00, size=128) at mm_heap/mm_malloc.c:245
>   #3 0x07011694 in kmm_malloc (size=128) at
> kmm_heap/kmm_malloc.c:51