Hi Heinrich, On Sun, 6 Jun 2021 at 15:35, Heinrich Schuchardt <xypron.g...@gmx.de> wrote: > > Am 6. Juni 2021 20:07:31 MESZ schrieb Sean Anderson <sean...@gmail.com>: > >On 6/6/21 1:57 PM, Heinrich Schuchardt wrote: > >> On 6/6/21 7:52 PM, Sean Anderson wrote: > >>> On 6/6/21 1:28 PM, Heinrich Schuchardt wrote: > >>>> On 6/6/21 6:44 PM, Simon Glass wrote: > >>>>> Hi Heinrich, > >>>>> > >>>>> On Mon, 22 Mar 2021 at 18:56, Simon Glass <s...@chromium.org> > >wrote: > >>>>>> > >>>>>> Hi Heinrich, > >>>>>> > >>>>>> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt > >>>>>> <xypron.g...@gmx.de> wrote: > >>>>>>> > >>>>>>> On 22.03.21 06:21, Simon Glass wrote: > >>>>>>>> At present if sandbox crashes it prints a message and tries to > >>>>>>>> exit. But > >>>>>>>> with the recently introduced signal handler, it often seems to > >get > >>>>>>>> stuck > >>>>>>>> in a loop until the stack overflows: > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> > >>>>>>>> Segmentation violation > >>>>>>>> ... > >>>>>>> > >>>>>>> Hello Simon, > >>>>>>> > >>>>>>> do you have a reproducible example? I never have seen this. > >>>>>> > >>>>>> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433 > >>>>>> > >>>>>> You need to run that commit with pytest though...it does not > >happen > >>>>>> when run directly. > >>>>>> > >>>>>> BTW this sems to expose some rather nasty bug in dlmalloc or how > >it is > >>>>>> used. I notice that as soon as the first test is run, the 'top' > >value > >>>>>> in dlmalloc is outside the range of the malloc pool, which seems > >>>>>> wrong. I wonder if there is something broken with how > >>>>>> dm_test_pre_run() and dm_test_post_run() work. > >>>>>> > >>>>>>> > >>>>>>> Corrupting gd could cause an endless recursive loop, as these > >lines > >>>>>>> follow printing the observed string: > >>>>>>> > >>>>>>> printf("pc = 0x%lx, ", pc); > >>>>>>> printf("pc_reloc =0x%lx\n\n", pc - gd->reloc_off); > >>>>>> > >>>>>> Yes I suspect printf() is dead. > >>>>>> > >>>>>>> > >>>>>>> If we remove SA_NODEFER from the signal mask in > >arch/sandbox/cpu/os.c, > >>>>>>> recursion cannot occur anymore. If a segmentation violation > >occurs > >>>>>>> inside the handler it will be delegated to the default handler. > >>>>>>> > >>>>>>> Furthermore we could consider removing the signal handler at the > >start > >>>>>>> of os_signal_action(). > >>>>>> > >>>>>> The issue is that if you get a segfault you really don't know if > >you > >>>>>> can continue and do anything else. > >>>>>> > >>>>>> What is the goal with the signal handler? I don't think the user > >can > >>>>>> do anything about it. > >>>> > >>>> Hello Simon, > >>>> > >>>> the signal handler prints out the crash location and this makes > >>>> analyzing problems much easier. It proved valuable to me several > >times. > >>> > >>> Can't you just rerun with gdb? > >> > >> This would require that the problem is easily reproducible which may > >not > >> be the case. > > > >Hm, perhaps you could keep track of how many times we've tried to catch > >a signal, and bail if this is the second time around. E.g. something > >like > > > > Removing SA_NODEFER from the signal mask will let the OS kick in if an > exception occurs in a signal handler. > > No counter is needed.
Yes that is correct. However I am still going to apply this patch, since I would prefer that the default behaviour is to exit with a signal, rather than continue. It indicates some sort of error (except if we are running a strange test), so it seems wrong to try to continue. It may produce other issues and sandbox is in an unknown state. I am happy to discuss another way / patch for doing what you need. Regards, Simon [..] Applied to u-boot-dm/next, thanks!