Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

Roman Kagan Tue, 21 Jun 2022 05:04:21 -0700

On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote:
> Roman Kagan <rvka...@yandex-team.ru> writes:
> 
> > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> >> > Roman Kagan <rvka...@yandex-team.ru> writes:
> >> > 
> >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> > >> Konstantin Khlebnikov <khlebni...@yandex-team.ru> writes:
> >> > >> 
> >> > >> > This event represents device runtime errors to give time and
> >> > >> > reason why device is broken.
> >> > >> 
> >> > >> Can you give an or more examples of the "device runtime errors" you 
> >> > >> have
> >> > >> in mind?
> >> > >
> >> > > Initially we wanted to address a situation when a vhost device
> >> > > discovered an inconsistency during virtqueue processing and silently
> >> > > stopped the virtqueue.  This resulted in device stall (partial for
> >> > > multiqueue devices) and we were the last to notice that.
> >> > >
> >> > > The solution appeared to be to employ errfd and, upon receiving a
> >> > > notification through it, to emit a QMP event which is actionable in the
> >> > > management layer or further up the stack.
> >> > >
> >> > > Then we observed that virtio (non-vhost) devices suffer from the same
> >> > > issue: they only log the error but don't signal it to the management
> >> > > layer.  The case was very similar so we thought it would make sense to
> >> > > share the infrastructure and the QMP event between virtio and vhost.
> >> > >
> >> > > Then Konstantin went a bit further and generalized the concept into
> >> > > generic "device runtime error".  I'm personally not completely 
> >> > > convinced
> >> > > this generalization is appropriate here; we'd appreciate the opinions
> >> > > from the community on the matter.
> >> > 
> >> > "Device emulation sending an even on entering certain error states, so
> >> > that a management application can do something about it" feels
> >> > reasonable enough to me as a general concept.
> >> > 
> >> > The key point is of course "can do something": the event needs to be
> >> > actionable.  Can you describe possible actions for the cases you
> >> > implement?
> >> 
> >> The first one that we had in mind was informational, like triggering an
> >> alert in the monitoring system and/or painting the VM as malfunctioning
> >> in the owner's UI.
> >> 
> >> There can be more advanced scenarios like autorecovery by resetting the
> >> faulty VM, or fencing it if it's a cluster member.
> >
> > The discussion kind of stalled here.
> 
> My apologies...
> 
> >                                       Do you think the approach makes
> > sense or not?  Should we try and resubmit the series with a proper cover
> > letter and possibly other improvements or is it a dead end?
> 
> As QAPI schema maintainer, my concern is interface design.  To sell this
> interface to me (so to speak), you have to show it's useful and
> reasonably general.  Reasonably general, because we don't want to
> accumulate one-offs, even if they have their uses.
> 
> I think this is mostly a matter of commit message(s) and documentation
> here.  Explain your intended use cases.  Maybe hand-wave at other use
> cases you can think of.  Document that you're implementing the event
> only for the specific errors you need, but that it could be implemented
> more widely as needed.  "Complete" feels impractical, though.
> 
> Makes sense?


Absolutely.  We'll rework and resubmit the series addressing the issues
you've noted, and we'll see how it goes.

Thanks,
Roman.

Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

Reply via email to