Re: [PATCH] monitor: flush messages on abort

Steven Sistare Wed, 15 Nov 2023 07:31:55 -0800

On 11/6/2023 5:10 AM, Daniel P. Berrangé wrote:
> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>>> On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>>>> Buffered monitor output is lost when abort() is called.  The pattern
>>>> error_report() followed by abort() occurs about 60 times, so valuable
>>>> information is being lost when the abort is called in the context of a
>>>> monitor command.
>>>
>>> I'm curious, was there a particular abort() scenario that you hit ?
>>
>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>>
>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>         abort();
>>
>> But I have previously hit this for other errors.
>>
>>> For some crude statistics:
>>>
>>>   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 
>>> 1 error_report | grep $i | wc -l ; done
>>>   abort: 47
>>>   return: 512
>>>   exit: 458
>>>   goto: 177
>>>
>>> to me those numbers say that calling "abort()" after error_report
>>> should be considered a bug, and we can blanket replace all the
>>> abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>>> special case flushing the monitor.
>>
>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>> AFAICT currently no destructor is called for the monitor at exit time.
> 
> The HMP monitor flushes at each newline,  and exit() will take care of
> flushing stdout, so I don't think there's anything else needed.
> 
>>> Also I think there's a decent case to be made for error_report()
>>> to call monitor_flush().
>>
>> A good start, but that would not help for monitors with skip_flush=true, 
>> which 
>> need to format the buffered string in a json response, which is the case I 
>> tripped over.
> 
> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
> "hmp-monitor-command".

OK, that is narrower than I thought.  Now I see that other QMP monitors send 
error_report() to stderr, hence it is visible after abort and exit:

int error_vprintf(const char *fmt, va_list ap) {
    if (cur_mon && !monitor_cur_is_qmp())
        return monitor_vprintf(cur_mon, fmt, ap);
    return vfprintf(stderr, fmt, ap);                <-- HERE

That surprises me, I thought that would be returned to the monitor caller in the
json response. I guess the rationale is that the "main" error, if any, will be
set and returned by the err object that is passed down the stack during command
evaluation.

> In such a case, the error message needs to be built into a JSON error
> reply and sent over the socket. Your patch doesn't help this case
> since you've just printed to stderr.  

Same as vfprintf above!

> I don't think it is reasonable
> to expect QMP monitors to send replies on SIG_ABRT anyway. 

I agree.  My patch causes the error to be seen somewhere, anywhere, instead
of being dropped on the floor.

> So I don't
> think the skip_flush=true scenario is a problem to be concerned with.

It is indeed a narrow case, and not worth much effort or code change.
I'm inclined to drop it, but I appreciate the time you have spent reviewing it.

- Steve

>>>> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com>
>>>> ---
>>>>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 38 insertions(+)
>>>>
>>>> diff --git a/monitor/monitor.c b/monitor/monitor.c
>>>> index dc352f9..65dace0 100644
>>>> --- a/monitor/monitor.c
>>>> +++ b/monitor/monitor.c
>>>> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>>>>      }
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_LINUX
>>>> +
>>>> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>>>> +{
>>>> +    Monitor *mon = monitor_cur();
>>>> +
>>>> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (mon->outbuf && mon->outbuf->len) {
>>>> +        fputs("SIGABRT received: ", stderr);
>>>> +        fputs(mon->outbuf->str, stderr);
>>>> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>>>> +            fputc('\n', stderr);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    qemu_mutex_unlock(&mon->mon_lock);
>>>
>>> The SIGABRT handling does not only fire in response to abort()
>>> calls, but also in response to bad memory scenarios, so we have
>>> to be careful what we do in signal handlers.
>>>
>>> In particular using mutexes in signal handlers is a big red
>>> flag generally. Mutex APIs are not declare async signal
>>> safe, so this code is technically a POSIX compliance
>>> violation.
>>
>> Righto.  I would need to mask all signals in the sigaction to be on the 
>> safe(r) side.
> 
> This is still doomed, because SIGABRT could fire while 'mon_lock' is
> already held, and so this code would deadlock trying to acquire the
> lock.
> 
>>> So I think we'd be safer just eliminating the explicit abort()
>>> calls and adding monitor_flush call to error_report.
>>
>> I like adding a handler because it is future proof.  No need to play 
>> whack-a-mole when
>> developers re-introduce abort() calls in the future.  A minor benefit is I 
>> would not
>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
> 
> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
> subsystems, and we don't require explicit acks from every subsystem maintainer
> for trivial cleanups like this.
> 
>> A slight risk of the exit solution is that something bad happened at the 
>> call site, so 
>> qemu state can no longer be trusted.  Calling abort immediately may be safer 
>> than calling 
>> exit which will call the existing atexit handlers and could have side 
>> effects.
> 
> If that was a real problem, then we already face it because we have
> ~500 places already calling exit() and only 50 calling abort().
> 
>> A third option is to define qemu_abort() which flushes the monitor, and 
>> replaces all abort
>> calls.  That avoids async-signal-mutex hand wringing, but is still subject 
>> to whack-a-mole.
>>
>> So: atexit, signal handler, or qemu_abort?  I will go with your preference.
> 
> Just replace abort -> exit.
> 
> I'm not seeing a need for an atexit handler on top.
> 
> With regards,
> Daniel
Re: [PATCH] monitor: flush messages on abort

Reply via email to