Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Nathan Hartman
On Tue, Apr 4, 2023 at 5:53 PM Gregory Nutt  wrote:

>
> On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:
> > I drafted this change here:
> >
> > https://github.com/apache/nuttx/pull/8965
> >
> > Have a look, please.



PR 8965 looks good to me. In fact I had the exact same garbage access
problem with some in-house software (not related to NuttX) and the solution
was basically exactly the same.

Good catch on the error and the fix!

Cheers,
Nathan


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 6:53 PM, Petro Karashchenko wrote:

Just posted a comment in GitHub, but duplicating it here for visibility.
The change might not be needed with
https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash
table during sem_post()
The change is good in any event.  It is not a good idea to leave 
g_pidhash invalid across function calls.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Petro Karashchenko
Just posted a comment in GitHub, but duplicating it here for visibility.
The change might not be needed with
https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash
table during sem_post()

Best regards,
Petro

On Wed, Apr 5, 2023, 12:53 AM Gregory Nutt  wrote:

>
> On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:
> > I drafted this change here:
> >
> > https://github.com/apache/nuttx/pull/8965
> >
> > Have a look, please.
> >
> > Although, I don't know if there will be any other places to be fixed like
> > that?
> This is a unique problem since kmm_free is the only thing that was
> called in the unhealthy state.  I think this is a pretty robust solution
> and I can't think of any broader implications.
> > Or if this is the best approach to this issue. This was just a quick
> > experiment.
> I think it is pretty clean.  It is really just an ordering problem, so
> changing the ordering seems like a good solution.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:

I drafted this change here:

https://github.com/apache/nuttx/pull/8965

Have a look, please.

Although, I don't know if there will be any other places to be fixed like
that?
This is a unique problem since kmm_free is the only thing that was 
called in the unhealthy state.  I think this is a pretty robust solution 
and I can't think of any broader implications.

Or if this is the best approach to this issue. This was just a quick
experiment.
I think it is pretty clean.  It is really just an ordering problem, so 
changing the ordering seems like a good solution.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
I drafted this change here:

https://github.com/apache/nuttx/pull/8965

Have a look, please.

Although, I don't know if there will be any other places to be fixed like
that?
Or if this is the best approach to this issue. This was just a quick
experiment.

On Wed, Apr 5, 2023 at 12:34 AM Gregory Nutt  wrote:

>
> > Sorry, maybe it was a bad example.
> >
> > Here is a much more minimal config that you can run directly.
> > https://pastebin.com/x775E7iF
> >
> > For me, it crashes almost immediately after starting.
> It actually calls nxsem_checkholder() many, many times before it finally
> triggers the KAZAN failure.
> > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
> > along.
>
> That just removes the call to nxsched_get_tcb() that triggers the KAZAN
> failure.  It is really not otherwise related to the failure as far as I
> can tell.
>
> It appears that that the change generates the right condition to produce
> a failure from some existing problem.  But I might be missing
> something.  Maybe you or someone on the list can add some insight.
>
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 3:32 PM, Fotis Panagiotopoulos wrote:

I am trying again on my original setup (not the simplified defconfig that I
provided later).

I can now see how this is related to Greg's commit.

Here is the stack trace at the time of the error:
https://pasteboard.co/9QEmhZJFvIHC.png


This is what I think happens.

nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after
freeing it, it should set again g_pidhash to pidhash;
kmm_free however uses a semaphore. When free is complete, it posts the
semaphore.
nxsem_post() will internally call nxsem_checkholder() to perform the new
check.
This leads to a call to nxsched_get_tcb() that tries to access g_pidhash.

But! g_pidhash is deallocated at this point! And thus it points to garbage.

KASAN is right to complain.


That sounds right!  I am glad you nailed that.  Stale memory use 
problems can be insidious.


This is a chicken'n'egg problem:  nxsem_post and nxsem_checkholder need 
the pid hash table; freeing hash table entries need to post.  It sounds 
like you have a solution in mind.





Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt




Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.
It actually calls nxsem_checkholder() many, many times before it finally 
triggers the KAZAN failure.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.


That just removes the call to nxsched_get_tcb() that triggers the KAZAN 
failure.  It is really not otherwise related to the failure as far as I 
can tell.


It appears that that the change generates the right condition to produce 
a failure from some existing problem.  But I might be missing 
something.  Maybe you or someone on the list can add some insight.




Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
I am trying again on my original setup (not the simplified defconfig that I
provided later).

I can now see how this is related to Greg's commit.

Here is the stack trace at the time of the error:
https://pasteboard.co/9QEmhZJFvIHC.png


This is what I think happens.

nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after
freeing it, it should set again g_pidhash to pidhash;
kmm_free however uses a semaphore. When free is complete, it posts the
semaphore.
nxsem_post() will internally call nxsem_checkholder() to perform the new
check.
This leads to a call to nxsched_get_tcb() that tries to access g_pidhash.

But! g_pidhash is deallocated at this point! And thus it points to garbage.

KASAN is right to complain.










On Wed, Apr 5, 2023 at 12:24 AM Gregory Nutt  wrote:

>
> On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote:
> > Sorry, maybe it was a bad example.
> >
> > Here is a much more minimal config that you can run directly.
> > https://pastebin.com/x775E7iF
> >
> > For me, it crashes almost immediately after starting.
> >
> > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
> > along.
> >
> Okay, I was able to replicate the error with that configuration, but I
> don't know what it means since I don't understand the kazan stuff:
>
> ...
> ostest_main: setenv(Variable3, BadValue2, FALSE)
> show_variable: Variable=Variable1 has value=GoodValue1
> show_variable: Variable=Variable2 has value=GoodValue2
> show_variable: Variable=Variable3 has value=GoodValue3
> kasan_report: kasan detected a read access error, address at
> 0x73db52c8, size is 8
>
> Breakpoint 1, _assert (
>  filename=0x55576863  "\220H\201\304",
> , linenum=0,
>  msg=0x73de8930 "\001") at misc/assert.c:423
> 423 {
> (gdb) gt
> Undefined command: "gt".  Try "help".
> (gdb) bt
> #0  _assert (filename=0x55576863  "\220H\201\304",
> ,
>  linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423
> #1  0x5557053e in __assert (filename=0x555bd6d0
> "kasan/kasan.c", linenum=114,
>  msg=0x555bd735 "panic") at assert/lib_assert.c:36
> #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
> is_write=false)
>  at kasan/kasan.c:114
> #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
> size=8) at kasan/kasan.c:307
> #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
> at kasan/kasan.c:331
> #5  0xd6e8 in nxsched_get_tcb (pid=3) at
> sched/sched_gettcb.c:79
> #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
> semaphore/sem_holder.c:1106
> #7  0x5556eafe in nxsem_post (sem=0x73db5000) at
> semaphore/sem_post.c:85
> #8  0x5556edfb in sem_post (sem=0x73db5000) at
> semaphore/sem_post.c:256
> #9  0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at
> misc/lib_mutex.c:340
> #10 0x5557899b in mm_unlock (heap=0x73db5000) at
> mm_heap/mm_lock.c:117
>
> The call to mm_unlock() is thing that kicks of the assertion check.
> nxsem_checkholder() is the assertion check
>
> (gdb) up
> #1  0x5557053e in __assert (filename=0x555bd6d0
> "kasan/kasan.c", linenum=114,
>
>  msg=0x555bd735 "panic") at assert/lib_assert.c:36
> 36_assert(filename, linenum, msg);
> (gdb) up
> #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
> is_write=false)
>  at kasan/kasan.c:114
> 114   PANIC();
> (gdb) up
> #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
> size=8) at kasan/kasan.c:307
> 307   kasan_report(addr, size, false);
> (gdb) up
> #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
> at kasan/kasan.c:331
> 331   __asan_loadN_noabort(addr, 8);
>
> The following is the logic that triggered __asan_loadN_noabort(). But I
> don't see any problem.  Could this be a false alarm?
>
> (gdb) up
> #5  0xd6e8 in nxsched_get_tcb (pid=3) at
> sched/sched_gettcb.c:79
> 79if (g_pidhash[hash_ndx] != NULL && pid ==
> g_pidhash[hash_ndx]->pid)
> (gdb) p hash_ndx
> $8 = 3
> (gdb) p pid
> $9 = 3
> (gdb) p g_pidhash[hash_ndx]->pid
> $10 = 3
>
> (gdb) up
> #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
> semaphore/sem_holder.c:1106
> 1106  htcb = nxsched_get_tcb(tid);
> (gdb) p tid
> $11 = 3
>
> Can you explain what happened when __asan_loadN_noabort was called.
> That is the error.  As far as I can tell, everything else looks good to me.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt


On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote:

Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.

Okay, I was able to replicate the error with that configuration, but I 
don't know what it means since I don't understand the kazan stuff:


   ...
   ostest_main: setenv(Variable3, BadValue2, FALSE)
   show_variable: Variable=Variable1 has value=GoodValue1
   show_variable: Variable=Variable2 has value=GoodValue2
   show_variable: Variable=Variable3 has value=GoodValue3
   kasan_report: kasan detected a read access error, address at
   0x73db52c8, size is 8

   Breakpoint 1, _assert (
    filename=0x55576863  "\220H\201\304",
   , linenum=0,
    msg=0x73de8930 "\001") at misc/assert.c:423
   423 {
   (gdb) gt
   Undefined command: "gt".  Try "help".
   (gdb) bt
   #0  _assert (filename=0x55576863  "\220H\201\304",
   ,
    linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423
   #1  0x5557053e in __assert (filename=0x555bd6d0
   "kasan/kasan.c", linenum=114,
    msg=0x555bd735 "panic") at assert/lib_assert.c:36
   #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
   is_write=false)
    at kasan/kasan.c:114
   #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
   size=8) at kasan/kasan.c:307
   #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
   at kasan/kasan.c:331
   #5  0xd6e8 in nxsched_get_tcb (pid=3) at
   sched/sched_gettcb.c:79
   #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
   semaphore/sem_holder.c:1106
   #7  0x5556eafe in nxsem_post (sem=0x73db5000) at
   semaphore/sem_post.c:85
   #8  0x5556edfb in sem_post (sem=0x73db5000) at
   semaphore/sem_post.c:256
   #9  0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at
   misc/lib_mutex.c:340
   #10 0x5557899b in mm_unlock (heap=0x73db5000) at
   mm_heap/mm_lock.c:117

The call to mm_unlock() is thing that kicks of the assertion check.  
nxsem_checkholder() is the assertion check


   (gdb) up
   #1  0x5557053e in __assert (filename=0x555bd6d0
   "kasan/kasan.c", linenum=114,

    msg=0x555bd735 "panic") at assert/lib_assert.c:36
   36    _assert(filename, linenum, msg);
   (gdb) up
   #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
   is_write=false)
    at kasan/kasan.c:114
   114   PANIC();
   (gdb) up
   #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
   size=8) at kasan/kasan.c:307
   307   kasan_report(addr, size, false);
   (gdb) up
   #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
   at kasan/kasan.c:331
   331   __asan_loadN_noabort(addr, 8);

The following is the logic that triggered __asan_loadN_noabort(). But I 
don't see any problem.  Could this be a false alarm?


   (gdb) up
   #5  0xd6e8 in nxsched_get_tcb (pid=3) at
   sched/sched_gettcb.c:79
   79    if (g_pidhash[hash_ndx] != NULL && pid ==
   g_pidhash[hash_ndx]->pid)
   (gdb) p hash_ndx
   $8 = 3
   (gdb) p pid
   $9 = 3
   (gdb) p g_pidhash[hash_ndx]->pid
   $10 = 3

   (gdb) up
   #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
   semaphore/sem_holder.c:1106
   1106  htcb = nxsched_get_tcb(tid);
   (gdb) p tid
   $11 = 3

Can you explain what happened when __asan_loadN_noabort was called.  
That is the error.  As far as I can tell, everything else looks good to me.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.


On Tue, Apr 4, 2023 at 10:21 PM Gregory Nutt  wrote:

>
> > I understand that the commit is NOT the root cause, rather it
> > indicates an existing issue.
> > That's why I am posting here.
> It is difficult to see any relationship between that commit which only
> generates an assertion (and which does not happen in your case).  I
> think you will need to take some time to analyze the a bit.
> > It's impossible for me to track this down today. Maybe I will spend
> > some time on Friday.
> > But I think it indicates a critical problem, in case anyone has the
> > time to check on this.
> I would be interested if anyone else can replicate this problem.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



I understand that the commit is NOT the root cause, rather it 
indicates an existing issue.

That's why I am posting here.
It is difficult to see any relationship between that commit which only 
generates an assertion (and which does not happen in your case).  I 
think you will need to take some time to analyze the a bit.
It's impossible for me to track this down today. Maybe I will spend 
some time on Friday.
But I think it indicates a critical problem, in case anyone has the 
time to check on this.

I would be interested if anyone else can replicate this problem.


os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
Hello,

Recently we saw the following commit get merged:

commit 758e88672b703bcb0d1251ae9731b2ec295e81c0
Author: Gregory Nutt 
Date:   Fri Mar 31 16:25:45 2023 -0600

Assert if a thread attempts to post a semaphore incorrectly.

Assert in nxsem_post if:

- Priority inheritance is enabled on a semaphore
- A thread that does not hold the semaphore attempts to post it

This will detect an error condition described in
https://cwiki.apache.org/confluence/display/NUTTX/Signaling+Semaphores+and+Priority+Inheritance

None.  The debug instrumentation is only enabled if
CONFIG_DEBUG_ASSERTIONS is enabled.

Use sim:ostest.  Verify that no assertions occur.



This makes os_test fail quickly.
In fact, almost all of our internal tests fail.

I understand that the commit is NOT the root cause, rather it indicates an
existing issue.
That's why I am posting here.

Here is the error output (running on sim):

[04/04/23 17:41:59] [ 7] [ EMERG] kasan_report: kasan detected a read
access error, address at 0x7fea58e44ea8, size is 8
[04/04/23 17:41:59] [ 7] [ EMERG] _assert: Current Version: NuttX 10.4.0
758e88672b Apr  4 2023 20:41:42 sim
[04/04/23 17:41:59] [ 7] [ EMERG] _assert: Assertion failed panic: at file:
kasan/kasan.c:114 task: ostest 0x4674fd

I am also attaching the full output.

I guess my defconfig uses options that are not enabled in CI tests?
I am attaching my defconfig too.

It's impossible for me to track this down today. Maybe I will spend some
time on Friday.
But I think it indicates a critical problem, in case anyone has the time to
check on this.
stdio_test: write fd=1
stdio_test: Standard I/O Check: printf
stdio_test: write fd=2
stdio_test: Standard I/O Check: fprintf to stderr
ostest_main: putenv(Variable1=BadValue3)
ostest_main: setenv(Variable1, GoodValue1, TRUE)
ostest_main: setenv(Variable2, BadValue1, FALSE)
ostest_main: setenv(Variable2, GoodValue2, TRUE)
ostest_main: setenv(Variable3, GoodValue3, FALSE)
ostest_main: setenv(Variable3, BadValue2, FALSE)
show_variable: Variable=Variable1 has value=GoodValue1
show_variable: Variable=Variable2 has value=GoodValue2
show_variable: Variable=Variable3 has value=GoodValue3
ostest_main: Started user_main at PID=7

user_main: Begin argument test
user_main: Started with argc=5
user_main: argv[0]="ostest"
user_main: argv[1]="Arg1"
user_main: argv[2]="Arg2"
user_main: argv[3]="Arg3"
user_main: argv[4]="Arg4"

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 22
mxordblk  3e62640  3e62640
uordblks8cfe08cfe0
fordblks  3e72d70  3e72d70

user_main: getopt() test
getopt():  Simple test
getopt():  Invalid argument
getopt():  Missing optional argument
getopt_long():  Simple test
getopt_long():  No short options
getopt_long():  Argument for --option=argument
getopt_long():  Invalid long option
getopt_long():  Mixed long and short options
getopt_long():  Invalid short option
getopt_long():  Missing optional arguments
getopt_long_only():  Mixed long and short options
getopt_long_only():  Single hyphen long options

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 22
mxordblk  3e62640  3e62640
uordblks8cfe08cfe0
fordblks  3e72d70  3e72d70
show_variable: Variable=Variable1 has value=GoodValue1
show_variable: Variable=Variable2 has value=GoodValue2
show_variable: Variable=Variable3 has value=GoodValue3
show_variable: Variable=Variable1 has no value
show_variable: Variable=Variable2 has value=GoodValue2
show_variable: Variable=Variable3 has value=GoodValue3

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 23
mxordblk  3e62640  3e62640
uordblks8cfe08cfc0
fordblks  3e72d70  3e72d90
show_variable: Variable=Variable1 has no value
show_variable: Variable=Variable2 has no value
show_variable: Variable=Variable3 has no value

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 33
mxordblk  3e62640  3e62640
uordblks8cfc08cf30
fordblks  3e72d90  3e72e20
tls: Successfully set 0
tls: Successfully set 
tls: Successfully set 
tls: Successfully set 

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 33
mxordblk  3e62640  3e62640
uordblks8cf308cf30
fordblks  3e72e20  3e72e20

user_main: /dev/null test
dev_null: Read 0 bytes from /dev/null
dev_null: Wrote 1024 bytes to /dev/null

End of test memory usage:
VARIABLE  BEFORE   AFTER
  
arena 3effd50  3effd50
ordblks 33
mxordblk  3e62640  3e62640
uordblks8cf308cf30
fordblks  3e72e20  3e72e20

user_main: task_restart test

Test task_restart()
restart_main: setenv(VarName, VarValue, TRUE)
restart_main: Started