Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Nathan Hartman
On Tue, Apr 4, 2023 at 5:53 PM Gregory Nutt  wrote:

>
> On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:
> > I drafted this change here:
> >
> > https://github.com/apache/nuttx/pull/8965
> >
> > Have a look, please.



PR 8965 looks good to me. In fact I had the exact same garbage access
problem with some in-house software (not related to NuttX) and the solution
was basically exactly the same.

Good catch on the error and the fix!

Cheers,
Nathan


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 6:53 PM, Petro Karashchenko wrote:

Just posted a comment in GitHub, but duplicating it here for visibility.
The change might not be needed with
https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash
table during sem_post()
The change is good in any event.  It is not a good idea to leave 
g_pidhash invalid across function calls.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Petro Karashchenko
Just posted a comment in GitHub, but duplicating it here for visibility.
The change might not be needed with
https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash
table during sem_post()

Best regards,
Petro

On Wed, Apr 5, 2023, 12:53 AM Gregory Nutt  wrote:

>
> On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:
> > I drafted this change here:
> >
> > https://github.com/apache/nuttx/pull/8965
> >
> > Have a look, please.
> >
> > Although, I don't know if there will be any other places to be fixed like
> > that?
> This is a unique problem since kmm_free is the only thing that was
> called in the unhealthy state.  I think this is a pretty robust solution
> and I can't think of any broader implications.
> > Or if this is the best approach to this issue. This was just a quick
> > experiment.
> I think it is pretty clean.  It is really just an ordering problem, so
> changing the ordering seems like a good solution.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote:

I drafted this change here:

https://github.com/apache/nuttx/pull/8965

Have a look, please.

Although, I don't know if there will be any other places to be fixed like
that?
This is a unique problem since kmm_free is the only thing that was 
called in the unhealthy state.  I think this is a pretty robust solution 
and I can't think of any broader implications.

Or if this is the best approach to this issue. This was just a quick
experiment.
I think it is pretty clean.  It is really just an ordering problem, so 
changing the ordering seems like a good solution.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
I drafted this change here:

https://github.com/apache/nuttx/pull/8965

Have a look, please.

Although, I don't know if there will be any other places to be fixed like
that?
Or if this is the best approach to this issue. This was just a quick
experiment.

On Wed, Apr 5, 2023 at 12:34 AM Gregory Nutt  wrote:

>
> > Sorry, maybe it was a bad example.
> >
> > Here is a much more minimal config that you can run directly.
> > https://pastebin.com/x775E7iF
> >
> > For me, it crashes almost immediately after starting.
> It actually calls nxsem_checkholder() many, many times before it finally
> triggers the KAZAN failure.
> > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
> > along.
>
> That just removes the call to nxsched_get_tcb() that triggers the KAZAN
> failure.  It is really not otherwise related to the failure as far as I
> can tell.
>
> It appears that that the change generates the right condition to produce
> a failure from some existing problem.  But I might be missing
> something.  Maybe you or someone on the list can add some insight.
>
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



On 4/4/2023 3:32 PM, Fotis Panagiotopoulos wrote:

I am trying again on my original setup (not the simplified defconfig that I
provided later).

I can now see how this is related to Greg's commit.

Here is the stack trace at the time of the error:
https://pasteboard.co/9QEmhZJFvIHC.png


This is what I think happens.

nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after
freeing it, it should set again g_pidhash to pidhash;
kmm_free however uses a semaphore. When free is complete, it posts the
semaphore.
nxsem_post() will internally call nxsem_checkholder() to perform the new
check.
This leads to a call to nxsched_get_tcb() that tries to access g_pidhash.

But! g_pidhash is deallocated at this point! And thus it points to garbage.

KASAN is right to complain.


That sounds right!  I am glad you nailed that.  Stale memory use 
problems can be insidious.


This is a chicken'n'egg problem:  nxsem_post and nxsem_checkholder need 
the pid hash table; freeing hash table entries need to post.  It sounds 
like you have a solution in mind.





Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt




Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.
It actually calls nxsem_checkholder() many, many times before it finally 
triggers the KAZAN failure.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.


That just removes the call to nxsched_get_tcb() that triggers the KAZAN 
failure.  It is really not otherwise related to the failure as far as I 
can tell.


It appears that that the change generates the right condition to produce 
a failure from some existing problem.  But I might be missing 
something.  Maybe you or someone on the list can add some insight.




Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
I am trying again on my original setup (not the simplified defconfig that I
provided later).

I can now see how this is related to Greg's commit.

Here is the stack trace at the time of the error:
https://pasteboard.co/9QEmhZJFvIHC.png


This is what I think happens.

nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after
freeing it, it should set again g_pidhash to pidhash;
kmm_free however uses a semaphore. When free is complete, it posts the
semaphore.
nxsem_post() will internally call nxsem_checkholder() to perform the new
check.
This leads to a call to nxsched_get_tcb() that tries to access g_pidhash.

But! g_pidhash is deallocated at this point! And thus it points to garbage.

KASAN is right to complain.










On Wed, Apr 5, 2023 at 12:24 AM Gregory Nutt  wrote:

>
> On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote:
> > Sorry, maybe it was a bad example.
> >
> > Here is a much more minimal config that you can run directly.
> > https://pastebin.com/x775E7iF
> >
> > For me, it crashes almost immediately after starting.
> >
> > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
> > along.
> >
> Okay, I was able to replicate the error with that configuration, but I
> don't know what it means since I don't understand the kazan stuff:
>
> ...
> ostest_main: setenv(Variable3, BadValue2, FALSE)
> show_variable: Variable=Variable1 has value=GoodValue1
> show_variable: Variable=Variable2 has value=GoodValue2
> show_variable: Variable=Variable3 has value=GoodValue3
> kasan_report: kasan detected a read access error, address at
> 0x73db52c8, size is 8
>
> Breakpoint 1, _assert (
>  filename=0x55576863  "\220H\201\304",
> , linenum=0,
>  msg=0x73de8930 "\001") at misc/assert.c:423
> 423 {
> (gdb) gt
> Undefined command: "gt".  Try "help".
> (gdb) bt
> #0  _assert (filename=0x55576863  "\220H\201\304",
> ,
>  linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423
> #1  0x5557053e in __assert (filename=0x555bd6d0
> "kasan/kasan.c", linenum=114,
>  msg=0x555bd735 "panic") at assert/lib_assert.c:36
> #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
> is_write=false)
>  at kasan/kasan.c:114
> #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
> size=8) at kasan/kasan.c:307
> #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
> at kasan/kasan.c:331
> #5  0xd6e8 in nxsched_get_tcb (pid=3) at
> sched/sched_gettcb.c:79
> #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
> semaphore/sem_holder.c:1106
> #7  0x5556eafe in nxsem_post (sem=0x73db5000) at
> semaphore/sem_post.c:85
> #8  0x5556edfb in sem_post (sem=0x73db5000) at
> semaphore/sem_post.c:256
> #9  0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at
> misc/lib_mutex.c:340
> #10 0x5557899b in mm_unlock (heap=0x73db5000) at
> mm_heap/mm_lock.c:117
>
> The call to mm_unlock() is thing that kicks of the assertion check.
> nxsem_checkholder() is the assertion check
>
> (gdb) up
> #1  0x5557053e in __assert (filename=0x555bd6d0
> "kasan/kasan.c", linenum=114,
>
>  msg=0x555bd735 "panic") at assert/lib_assert.c:36
> 36_assert(filename, linenum, msg);
> (gdb) up
> #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
> is_write=false)
>  at kasan/kasan.c:114
> 114   PANIC();
> (gdb) up
> #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
> size=8) at kasan/kasan.c:307
> 307   kasan_report(addr, size, false);
> (gdb) up
> #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
> at kasan/kasan.c:331
> 331   __asan_loadN_noabort(addr, 8);
>
> The following is the logic that triggered __asan_loadN_noabort(). But I
> don't see any problem.  Could this be a false alarm?
>
> (gdb) up
> #5  0xd6e8 in nxsched_get_tcb (pid=3) at
> sched/sched_gettcb.c:79
> 79if (g_pidhash[hash_ndx] != NULL && pid ==
> g_pidhash[hash_ndx]->pid)
> (gdb) p hash_ndx
> $8 = 3
> (gdb) p pid
> $9 = 3
> (gdb) p g_pidhash[hash_ndx]->pid
> $10 = 3
>
> (gdb) up
> #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
> semaphore/sem_holder.c:1106
> 1106  htcb = nxsched_get_tcb(tid);
> (gdb) p tid
> $11 = 3
>
> Can you explain what happened when __asan_loadN_noabort was called.
> That is the error.  As far as I can tell, everything else looks good to me.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt


On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote:

Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.

Okay, I was able to replicate the error with that configuration, but I 
don't know what it means since I don't understand the kazan stuff:


   ...
   ostest_main: setenv(Variable3, BadValue2, FALSE)
   show_variable: Variable=Variable1 has value=GoodValue1
   show_variable: Variable=Variable2 has value=GoodValue2
   show_variable: Variable=Variable3 has value=GoodValue3
   kasan_report: kasan detected a read access error, address at
   0x73db52c8, size is 8

   Breakpoint 1, _assert (
    filename=0x55576863  "\220H\201\304",
   , linenum=0,
    msg=0x73de8930 "\001") at misc/assert.c:423
   423 {
   (gdb) gt
   Undefined command: "gt".  Try "help".
   (gdb) bt
   #0  _assert (filename=0x55576863  "\220H\201\304",
   ,
    linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423
   #1  0x5557053e in __assert (filename=0x555bd6d0
   "kasan/kasan.c", linenum=114,
    msg=0x555bd735 "panic") at assert/lib_assert.c:36
   #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
   is_write=false)
    at kasan/kasan.c:114
   #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
   size=8) at kasan/kasan.c:307
   #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
   at kasan/kasan.c:331
   #5  0xd6e8 in nxsched_get_tcb (pid=3) at
   sched/sched_gettcb.c:79
   #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
   semaphore/sem_holder.c:1106
   #7  0x5556eafe in nxsem_post (sem=0x73db5000) at
   semaphore/sem_post.c:85
   #8  0x5556edfb in sem_post (sem=0x73db5000) at
   semaphore/sem_post.c:256
   #9  0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at
   misc/lib_mutex.c:340
   #10 0x5557899b in mm_unlock (heap=0x73db5000) at
   mm_heap/mm_lock.c:117

The call to mm_unlock() is thing that kicks of the assertion check.  
nxsem_checkholder() is the assertion check


   (gdb) up
   #1  0x5557053e in __assert (filename=0x555bd6d0
   "kasan/kasan.c", linenum=114,

    msg=0x555bd735 "panic") at assert/lib_assert.c:36
   36    _assert(filename, linenum, msg);
   (gdb) up
   #2  0x5557798e in kasan_report (addr=0x73db52c8, size=8,
   is_write=false)
    at kasan/kasan.c:114
   114   PANIC();
   (gdb) up
   #3  0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8,
   size=8) at kasan/kasan.c:307
   307   kasan_report(addr, size, false);
   (gdb) up
   #4  0x5557802b in __asan_load8_noabort (addr=0x73db52c8)
   at kasan/kasan.c:331
   331   __asan_loadN_noabort(addr, 8);

The following is the logic that triggered __asan_loadN_noabort(). But I 
don't see any problem.  Could this be a false alarm?


   (gdb) up
   #5  0xd6e8 in nxsched_get_tcb (pid=3) at
   sched/sched_gettcb.c:79
   79    if (g_pidhash[hash_ndx] != NULL && pid ==
   g_pidhash[hash_ndx]->pid)
   (gdb) p hash_ndx
   $8 = 3
   (gdb) p pid
   $9 = 3
   (gdb) p g_pidhash[hash_ndx]->pid
   $10 = 3

   (gdb) up
   #6  0xa258 in nxsem_checkholder (sem=0x73db5000) at
   semaphore/sem_holder.c:1106
   1106  htcb = nxsched_get_tcb(tid);
   (gdb) p tid
   $11 = 3

Can you explain what happened when __asan_loadN_noabort was called.  
That is the error.  As far as I can tell, everything else looks good to me.


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Fotis Panagiotopoulos
Sorry, maybe it was a bad example.

Here is a much more minimal config that you can run directly.
https://pastebin.com/x775E7iF

For me, it crashes almost immediately after starting.

Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves
along.


On Tue, Apr 4, 2023 at 10:21 PM Gregory Nutt  wrote:

>
> > I understand that the commit is NOT the root cause, rather it
> > indicates an existing issue.
> > That's why I am posting here.
> It is difficult to see any relationship between that commit which only
> generates an assertion (and which does not happen in your case).  I
> think you will need to take some time to analyze the a bit.
> > It's impossible for me to track this down today. Maybe I will spend
> > some time on Friday.
> > But I think it indicates a critical problem, in case anyone has the
> > time to check on this.
> I would be interested if anyone else can replicate this problem.
>


Re: os_test failure - Improper use of semaphores?

2023-04-04 Thread Gregory Nutt



I understand that the commit is NOT the root cause, rather it 
indicates an existing issue.

That's why I am posting here.
It is difficult to see any relationship between that commit which only 
generates an assertion (and which does not happen in your case).  I 
think you will need to take some time to analyze the a bit.
It's impossible for me to track this down today. Maybe I will spend 
some time on Friday.
But I think it indicates a critical problem, in case anyone has the 
time to check on this.

I would be interested if anyone else can replicate this problem.