On 11/15/2023 3:41 AM, Markus Armbruster wrote: > Daniel P. Berrangé <berra...@redhat.com> writes: > >> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote: >>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote: >>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote: >>>>> Buffered monitor output is lost when abort() is called. The pattern >>>>> error_report() followed by abort() occurs about 60 times, so valuable >>>>> information is being lost when the abort is called in the context of a >>>>> monitor command. >>>> >>>> I'm curious, was there a particular abort() scenario that you hit ? >>> >>> Yes, while tweaking the suspended state, and forgetting to add transitions: >>> >>> error_report("invalid runstate transition: '%s' -> '%s'", >>> abort(); >>> >>> But I have previously hit this for other errors. > > Can you provide a reproducer?
I sometimes hit this when developing new code. I do not have a reproducer for upstream branches. The patch is aimed at helping developers, not users. >>>> For some crude statistics: >>>> >>>> $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after >>>> 1 error_report | grep $i | wc -l ; done >>>> abort: 47 >>>> return: 512 >>>> exit: 458 >>>> goto: 177 >>>> >>>> to me those numbers say that calling "abort()" after error_report >>>> should be considered a bug, and we can blanket replace all the >>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to >>>> special case flushing the monitor. >>> >>> And presumably add an atexit handler to flush the monitor ala monitor_abort. >>> AFAICT currently no destructor is called for the monitor at exit time. >> >> The HMP monitor flushes at each newline, and exit() will take care of >> flushing stdout, so I don't think there's anything else needed. > > Correct. > >>>> Also I think there's a decent case to be made for error_report() >>>> to call monitor_flush(). > > No, because the messages printed by error_report() all end in newline, > and printing a newline to a monitor triggers monitor_flush_locked(). > >>> A good start, but that would not help for monitors with skip_flush=true, >>> which >>> need to format the buffered string in a json response, which is the case I >>> tripped over. >> >> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking >> "hmp-monitor-command". > > Correct. > >> In such a case, the error message needs to be built into a JSON error >> reply and sent over the socket. Your patch doesn't help this case >> since you've just printed to stderr. I don't think it is reasonable >> to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't >> think the skip_flush=true scenario is a problem to be concerned with. >> >>>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr. >>>>> >>>>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> >>>>> --- >>>>> monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 38 insertions(+) >>>>> >>>>> diff --git a/monitor/monitor.c b/monitor/monitor.c >>>>> index dc352f9..65dace0 100644 >>>>> --- a/monitor/monitor.c >>>>> +++ b/monitor/monitor.c >>>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void) >>>>> } >>>>> } >>>>> >>>>> +#ifdef CONFIG_LINUX >>>>> + >>>>> +static void monitor_abort(int signal, siginfo_t *info, void *c) >>>>> +{ >>>>> + Monitor *mon = monitor_cur(); >>>>> + >>>>> + if (!mon || qemu_mutex_trylock(&mon->mon_lock)) { >>>>> + return; >>>>> + } >>>>> + >>>>> + if (mon->outbuf && mon->outbuf->len) { >>>>> + fputs("SIGABRT received: ", stderr); >>>>> + fputs(mon->outbuf->str, stderr); >>>>> + if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') { >>>>> + fputc('\n', stderr); >>>>> + } >>>>> + } >>>>> + >>>>> + qemu_mutex_unlock(&mon->mon_lock); >>>> >>>> The SIGABRT handling does not only fire in response to abort() >>>> calls, but also in response to bad memory scenarios, so we have >>>> to be careful what we do in signal handlers. >>>> >>>> In particular using mutexes in signal handlers is a big red >>>> flag generally. Mutex APIs are not declare async signal >>>> safe, so this code is technically a POSIX compliance >>>> violation. > > "Technically a POSIX compliance violation" sounds like something only > pedants would care about. It's actually a recipe for deadlocks and > crashes. > >>> Righto. I would need to mask all signals in the sigaction to be on the >>> safe(r) side. >> >> This is still doomed, because SIGABRT could fire while 'mon_lock' is >> already held, and so this code would deadlock trying to acquire the >> lock. > > Yup. > > There is no way to make async signal unsafe code safe. The handler calls trylock, not lock. If it cannot get the lock, it bails. However, I suppose pthread_mutex_trylock could in theory take and briefly hold some internal lock as part of its implementation. - Steve >>>> So I think we'd be safer just eliminating the explicit abort() >>>> calls and adding monitor_flush call to error_report. >>> >>> I like adding a handler because it is future proof. No need to play >>> whack-a-mole when >>> developers re-introduce abort() calls in the future. A minor benefit is I >>> would not >>> need ack's from 50 maintainers to change 50 call sites from abort to exit. >> >> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different >> subsystems, and we don't require explicit acks from every subsystem >> maintainer >> for trivial cleanups like this. >> >>> A slight risk of the exit solution is that something bad happened at the >>> call site, so >>> qemu state can no longer be trusted. Calling abort immediately may be >>> safer than calling >>> exit which will call the existing atexit handlers and could have side >>> effects. >> >> If that was a real problem, then we already face it because we have >> ~500 places already calling exit() and only 50 calling abort(). >> >>> A third option is to define qemu_abort() which flushes the monitor, and >>> replaces all abort >>> calls. That avoids async-signal-mutex hand wringing, but is still subject >>> to whack-a-mole. >>> >>> So: atexit, signal handler, or qemu_abort? I will go with your preference. >> >> Just replace abort -> exit. > > I disagree. > > Use exit(1) for fatal errors. > > Use abort() for programming errors (a.k.a. bugs). > >> I'm not seeing a need for an atexit handler on top. > > I'm not yet seeing a need for anything. A reproducer may change that. >