Re: os_test failure - Improper use of semaphores?
On Tue, Apr 4, 2023 at 5:53 PM Gregory Nutt wrote: > > On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote: > > I drafted this change here: > > > > https://github.com/apache/nuttx/pull/8965 > > > > Have a look, please. PR 8965 looks good to me. In fact I had the exact same garbage access problem with some in-house software (not related to NuttX) and the solution was basically exactly the same. Good catch on the error and the fix! Cheers, Nathan
Re: os_test failure - Improper use of semaphores?
On 4/4/2023 6:53 PM, Petro Karashchenko wrote: Just posted a comment in GitHub, but duplicating it here for visibility. The change might not be needed with https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash table during sem_post() The change is good in any event. It is not a good idea to leave g_pidhash invalid across function calls.
Re: os_test failure - Improper use of semaphores?
Just posted a comment in GitHub, but duplicating it here for visibility. The change might not be needed with https://github.com/apache/nuttx/pull/8951 as it removes access to pid hash table during sem_post() Best regards, Petro On Wed, Apr 5, 2023, 12:53 AM Gregory Nutt wrote: > > On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote: > > I drafted this change here: > > > > https://github.com/apache/nuttx/pull/8965 > > > > Have a look, please. > > > > Although, I don't know if there will be any other places to be fixed like > > that? > This is a unique problem since kmm_free is the only thing that was > called in the unhealthy state. I think this is a pretty robust solution > and I can't think of any broader implications. > > Or if this is the best approach to this issue. This was just a quick > > experiment. > I think it is pretty clean. It is really just an ordering problem, so > changing the ordering seems like a good solution. >
Re: os_test failure - Improper use of semaphores?
On 4/4/2023 3:43 PM, Fotis Panagiotopoulos wrote: I drafted this change here: https://github.com/apache/nuttx/pull/8965 Have a look, please. Although, I don't know if there will be any other places to be fixed like that? This is a unique problem since kmm_free is the only thing that was called in the unhealthy state. I think this is a pretty robust solution and I can't think of any broader implications. Or if this is the best approach to this issue. This was just a quick experiment. I think it is pretty clean. It is really just an ordering problem, so changing the ordering seems like a good solution.
Re: os_test failure - Improper use of semaphores?
I drafted this change here: https://github.com/apache/nuttx/pull/8965 Have a look, please. Although, I don't know if there will be any other places to be fixed like that? Or if this is the best approach to this issue. This was just a quick experiment. On Wed, Apr 5, 2023 at 12:34 AM Gregory Nutt wrote: > > > Sorry, maybe it was a bad example. > > > > Here is a much more minimal config that you can run directly. > > https://pastebin.com/x775E7iF > > > > For me, it crashes almost immediately after starting. > It actually calls nxsem_checkholder() many, many times before it finally > triggers the KAZAN failure. > > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves > > along. > > That just removes the call to nxsched_get_tcb() that triggers the KAZAN > failure. It is really not otherwise related to the failure as far as I > can tell. > > It appears that that the change generates the right condition to produce > a failure from some existing problem. But I might be missing > something. Maybe you or someone on the list can add some insight. > >
Re: os_test failure - Improper use of semaphores?
On 4/4/2023 3:32 PM, Fotis Panagiotopoulos wrote: I am trying again on my original setup (not the simplified defconfig that I provided later). I can now see how this is related to Greg's commit. Here is the stack trace at the time of the error: https://pasteboard.co/9QEmhZJFvIHC.png This is what I think happens. nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after freeing it, it should set again g_pidhash to pidhash; kmm_free however uses a semaphore. When free is complete, it posts the semaphore. nxsem_post() will internally call nxsem_checkholder() to perform the new check. This leads to a call to nxsched_get_tcb() that tries to access g_pidhash. But! g_pidhash is deallocated at this point! And thus it points to garbage. KASAN is right to complain. That sounds right! I am glad you nailed that. Stale memory use problems can be insidious. This is a chicken'n'egg problem: nxsem_post and nxsem_checkholder need the pid hash table; freeing hash table entries need to post. It sounds like you have a solution in mind.
Re: os_test failure - Improper use of semaphores?
Sorry, maybe it was a bad example. Here is a much more minimal config that you can run directly. https://pastebin.com/x775E7iF For me, it crashes almost immediately after starting. It actually calls nxsem_checkholder() many, many times before it finally triggers the KAZAN failure. Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves along. That just removes the call to nxsched_get_tcb() that triggers the KAZAN failure. It is really not otherwise related to the failure as far as I can tell. It appears that that the change generates the right condition to produce a failure from some existing problem. But I might be missing something. Maybe you or someone on the list can add some insight.
Re: os_test failure - Improper use of semaphores?
I am trying again on my original setup (not the simplified defconfig that I provided later). I can now see how this is related to Greg's commit. Here is the stack trace at the time of the error: https://pasteboard.co/9QEmhZJFvIHC.png This is what I think happens. nxtask_assign_pid() calls kmm_free(g_pidhash). Supposedly, right after freeing it, it should set again g_pidhash to pidhash; kmm_free however uses a semaphore. When free is complete, it posts the semaphore. nxsem_post() will internally call nxsem_checkholder() to perform the new check. This leads to a call to nxsched_get_tcb() that tries to access g_pidhash. But! g_pidhash is deallocated at this point! And thus it points to garbage. KASAN is right to complain. On Wed, Apr 5, 2023 at 12:24 AM Gregory Nutt wrote: > > On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote: > > Sorry, maybe it was a bad example. > > > > Here is a much more minimal config that you can run directly. > > https://pastebin.com/x775E7iF > > > > For me, it crashes almost immediately after starting. > > > > Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves > > along. > > > Okay, I was able to replicate the error with that configuration, but I > don't know what it means since I don't understand the kazan stuff: > > ... > ostest_main: setenv(Variable3, BadValue2, FALSE) > show_variable: Variable=Variable1 has value=GoodValue1 > show_variable: Variable=Variable2 has value=GoodValue2 > show_variable: Variable=Variable3 has value=GoodValue3 > kasan_report: kasan detected a read access error, address at > 0x73db52c8, size is 8 > > Breakpoint 1, _assert ( > filename=0x55576863 "\220H\201\304", > , linenum=0, > msg=0x73de8930 "\001") at misc/assert.c:423 > 423 { > (gdb) gt > Undefined command: "gt". Try "help". > (gdb) bt > #0 _assert (filename=0x55576863 "\220H\201\304", > , > linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423 > #1 0x5557053e in __assert (filename=0x555bd6d0 > "kasan/kasan.c", linenum=114, > msg=0x555bd735 "panic") at assert/lib_assert.c:36 > #2 0x5557798e in kasan_report (addr=0x73db52c8, size=8, > is_write=false) > at kasan/kasan.c:114 > #3 0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8, > size=8) at kasan/kasan.c:307 > #4 0x5557802b in __asan_load8_noabort (addr=0x73db52c8) > at kasan/kasan.c:331 > #5 0xd6e8 in nxsched_get_tcb (pid=3) at > sched/sched_gettcb.c:79 > #6 0xa258 in nxsem_checkholder (sem=0x73db5000) at > semaphore/sem_holder.c:1106 > #7 0x5556eafe in nxsem_post (sem=0x73db5000) at > semaphore/sem_post.c:85 > #8 0x5556edfb in sem_post (sem=0x73db5000) at > semaphore/sem_post.c:256 > #9 0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at > misc/lib_mutex.c:340 > #10 0x5557899b in mm_unlock (heap=0x73db5000) at > mm_heap/mm_lock.c:117 > > The call to mm_unlock() is thing that kicks of the assertion check. > nxsem_checkholder() is the assertion check > > (gdb) up > #1 0x5557053e in __assert (filename=0x555bd6d0 > "kasan/kasan.c", linenum=114, > > msg=0x555bd735 "panic") at assert/lib_assert.c:36 > 36_assert(filename, linenum, msg); > (gdb) up > #2 0x5557798e in kasan_report (addr=0x73db52c8, size=8, > is_write=false) > at kasan/kasan.c:114 > 114 PANIC(); > (gdb) up > #3 0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8, > size=8) at kasan/kasan.c:307 > 307 kasan_report(addr, size, false); > (gdb) up > #4 0x5557802b in __asan_load8_noabort (addr=0x73db52c8) > at kasan/kasan.c:331 > 331 __asan_loadN_noabort(addr, 8); > > The following is the logic that triggered __asan_loadN_noabort(). But I > don't see any problem. Could this be a false alarm? > > (gdb) up > #5 0xd6e8 in nxsched_get_tcb (pid=3) at > sched/sched_gettcb.c:79 > 79if (g_pidhash[hash_ndx] != NULL && pid == > g_pidhash[hash_ndx]->pid) > (gdb) p hash_ndx > $8 = 3 > (gdb) p pid > $9 = 3 > (gdb) p g_pidhash[hash_ndx]->pid > $10 = 3 > > (gdb) up > #6 0xa258 in nxsem_checkholder (sem=0x73db5000) at > semaphore/sem_holder.c:1106 > 1106 htcb = nxsched_get_tcb(tid); > (gdb) p tid > $11 = 3 > > Can you explain what happened when __asan_loadN_noabort was called. > That is the error. As far as I can tell, everything else looks good to me. >
Re: os_test failure - Improper use of semaphores?
On 4/4/2023 2:43 PM, Fotis Panagiotopoulos wrote: Sorry, maybe it was a bad example. Here is a much more minimal config that you can run directly. https://pastebin.com/x775E7iF For me, it crashes almost immediately after starting. Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves along. Okay, I was able to replicate the error with that configuration, but I don't know what it means since I don't understand the kazan stuff: ... ostest_main: setenv(Variable3, BadValue2, FALSE) show_variable: Variable=Variable1 has value=GoodValue1 show_variable: Variable=Variable2 has value=GoodValue2 show_variable: Variable=Variable3 has value=GoodValue3 kasan_report: kasan detected a read access error, address at 0x73db52c8, size is 8 Breakpoint 1, _assert ( filename=0x55576863 "\220H\201\304", , linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423 423 { (gdb) gt Undefined command: "gt". Try "help". (gdb) bt #0 _assert (filename=0x55576863 "\220H\201\304", , linenum=0, msg=0x73de8930 "\001") at misc/assert.c:423 #1 0x5557053e in __assert (filename=0x555bd6d0 "kasan/kasan.c", linenum=114, msg=0x555bd735 "panic") at assert/lib_assert.c:36 #2 0x5557798e in kasan_report (addr=0x73db52c8, size=8, is_write=false) at kasan/kasan.c:114 #3 0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8, size=8) at kasan/kasan.c:307 #4 0x5557802b in __asan_load8_noabort (addr=0x73db52c8) at kasan/kasan.c:331 #5 0xd6e8 in nxsched_get_tcb (pid=3) at sched/sched_gettcb.c:79 #6 0xa258 in nxsem_checkholder (sem=0x73db5000) at semaphore/sem_holder.c:1106 #7 0x5556eafe in nxsem_post (sem=0x73db5000) at semaphore/sem_post.c:85 #8 0x5556edfb in sem_post (sem=0x73db5000) at semaphore/sem_post.c:256 #9 0x55570ad5 in nxmutex_unlock (mutex=0x73db5000) at misc/lib_mutex.c:340 #10 0x5557899b in mm_unlock (heap=0x73db5000) at mm_heap/mm_lock.c:117 The call to mm_unlock() is thing that kicks of the assertion check. nxsem_checkholder() is the assertion check (gdb) up #1 0x5557053e in __assert (filename=0x555bd6d0 "kasan/kasan.c", linenum=114, msg=0x555bd735 "panic") at assert/lib_assert.c:36 36 _assert(filename, linenum, msg); (gdb) up #2 0x5557798e in kasan_report (addr=0x73db52c8, size=8, is_write=false) at kasan/kasan.c:114 114 PANIC(); (gdb) up #3 0x55577f84 in __asan_loadN_noabort (addr=0x73db52c8, size=8) at kasan/kasan.c:307 307 kasan_report(addr, size, false); (gdb) up #4 0x5557802b in __asan_load8_noabort (addr=0x73db52c8) at kasan/kasan.c:331 331 __asan_loadN_noabort(addr, 8); The following is the logic that triggered __asan_loadN_noabort(). But I don't see any problem. Could this be a false alarm? (gdb) up #5 0xd6e8 in nxsched_get_tcb (pid=3) at sched/sched_gettcb.c:79 79 if (g_pidhash[hash_ndx] != NULL && pid == g_pidhash[hash_ndx]->pid) (gdb) p hash_ndx $8 = 3 (gdb) p pid $9 = 3 (gdb) p g_pidhash[hash_ndx]->pid $10 = 3 (gdb) up #6 0xa258 in nxsem_checkholder (sem=0x73db5000) at semaphore/sem_holder.c:1106 1106 htcb = nxsched_get_tcb(tid); (gdb) p tid $11 = 3 Can you explain what happened when __asan_loadN_noabort was called. That is the error. As far as I can tell, everything else looks good to me.
Re: os_test failure - Improper use of semaphores?
Sorry, maybe it was a bad example. Here is a much more minimal config that you can run directly. https://pastebin.com/x775E7iF For me, it crashes almost immediately after starting. Trying again with 4ff4562401401a3a86c74cb2bda9a1a2b8d94e6d and it moves along. On Tue, Apr 4, 2023 at 10:21 PM Gregory Nutt wrote: > > > I understand that the commit is NOT the root cause, rather it > > indicates an existing issue. > > That's why I am posting here. > It is difficult to see any relationship between that commit which only > generates an assertion (and which does not happen in your case). I > think you will need to take some time to analyze the a bit. > > It's impossible for me to track this down today. Maybe I will spend > > some time on Friday. > > But I think it indicates a critical problem, in case anyone has the > > time to check on this. > I would be interested if anyone else can replicate this problem. >
Re: os_test failure - Improper use of semaphores?
I understand that the commit is NOT the root cause, rather it indicates an existing issue. That's why I am posting here. It is difficult to see any relationship between that commit which only generates an assertion (and which does not happen in your case). I think you will need to take some time to analyze the a bit. It's impossible for me to track this down today. Maybe I will spend some time on Friday. But I think it indicates a critical problem, in case anyone has the time to check on this. I would be interested if anyone else can replicate this problem.