Hi Heinrich, On Sun, 6 Jun 2021 at 11:28, Heinrich Schuchardt <xypron.g...@gmx.de> wrote: > > On 6/6/21 6:44 PM, Simon Glass wrote: > > Hi Heinrich, > > > > On Mon, 22 Mar 2021 at 18:56, Simon Glass <s...@chromium.org> wrote: > >> > >> Hi Heinrich, > >> > >> On Mon, 22 Mar 2021 at 23:02, Heinrich Schuchardt <xypron.g...@gmx.de> > >> wrote: > >>> > >>> On 22.03.21 06:21, Simon Glass wrote: > >>>> At present if sandbox crashes it prints a message and tries to exit. But > >>>> with the recently introduced signal handler, it often seems to get stuck > >>>> in a loop until the stack overflows: > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> > >>>> Segmentation violation > >>>> ... > >>> > >>> Hello Simon, > >>> > >>> do you have a reproducible example? I never have seen this. > >> > >> https://source.denx.de/u-boot/custodians/u-boot-dm/-/jobs/242433 > >> > >> You need to run that commit with pytest though...it does not happen > >> when run directly. > >> > >> BTW this sems to expose some rather nasty bug in dlmalloc or how it is > >> used. I notice that as soon as the first test is run, the 'top' value > >> in dlmalloc is outside the range of the malloc pool, which seems > >> wrong. I wonder if there is something broken with how > >> dm_test_pre_run() and dm_test_post_run() work. > >> > >>> > >>> Corrupting gd could cause an endless recursive loop, as these lines > >>> follow printing the observed string: > >>> > >>> printf("pc = 0x%lx, ", pc); > >>> printf("pc_reloc = 0x%lx\n\n", pc - gd->reloc_off); > >> > >> Yes I suspect printf() is dead. > >> > >>> > >>> If we remove SA_NODEFER from the signal mask in arch/sandbox/cpu/os.c, > >>> recursion cannot occur anymore. If a segmentation violation occurs > >>> inside the handler it will be delegated to the default handler. > >>> > >>> Furthermore we could consider removing the signal handler at the start > >>> of os_signal_action(). > >> > >> The issue is that if you get a segfault you really don't know if you > >> can continue and do anything else. > >> > >> What is the goal with the signal handler? I don't think the user can > >> do anything about it. > > Hello Simon, > > the signal handler prints out the crash location and this makes > analyzing problems much easier. It proved valuable to me several times.
Well I think we are at a draw on that point, as the patch has caused me pain many times! > > > > > I keep hitting this problem during development with sandbox, so I > > think I need to apply this patch. > > > > Does anything need to be updated in the tests? > > > > Regards, > > Simon > > > > Did you try removing SA_NODEFER as proposed? But what is the goal here...do you mean you want it to crash later? I just want it to crash immediately. What's actually wrong with putting this behaviour behind a flag? You could always run with the flag enabled if needed. But I just don't think it makes sense for the default behaviour to be to try to continue operation. Regards, Simon