On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote: > Roman Kagan <rvka...@yandex-team.ru> writes: > > > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote: > >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote: > >> > Roman Kagan <rvka...@yandex-team.ru> writes: > >> > > >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote: > >> > >> Konstantin Khlebnikov <khlebni...@yandex-team.ru> writes: > >> > >> > >> > >> > This event represents device runtime errors to give time and > >> > >> > reason why device is broken. > >> > >> > >> > >> Can you give an or more examples of the "device runtime errors" you > >> > >> have > >> > >> in mind? > >> > > > >> > > Initially we wanted to address a situation when a vhost device > >> > > discovered an inconsistency during virtqueue processing and silently > >> > > stopped the virtqueue. This resulted in device stall (partial for > >> > > multiqueue devices) and we were the last to notice that. > >> > > > >> > > The solution appeared to be to employ errfd and, upon receiving a > >> > > notification through it, to emit a QMP event which is actionable in the > >> > > management layer or further up the stack. > >> > > > >> > > Then we observed that virtio (non-vhost) devices suffer from the same > >> > > issue: they only log the error but don't signal it to the management > >> > > layer. The case was very similar so we thought it would make sense to > >> > > share the infrastructure and the QMP event between virtio and vhost. > >> > > > >> > > Then Konstantin went a bit further and generalized the concept into > >> > > generic "device runtime error". I'm personally not completely > >> > > convinced > >> > > this generalization is appropriate here; we'd appreciate the opinions > >> > > from the community on the matter. > >> > > >> > "Device emulation sending an even on entering certain error states, so > >> > that a management application can do something about it" feels > >> > reasonable enough to me as a general concept. > >> > > >> > The key point is of course "can do something": the event needs to be > >> > actionable. Can you describe possible actions for the cases you > >> > implement? > >> > >> The first one that we had in mind was informational, like triggering an > >> alert in the monitoring system and/or painting the VM as malfunctioning > >> in the owner's UI. > >> > >> There can be more advanced scenarios like autorecovery by resetting the > >> faulty VM, or fencing it if it's a cluster member. > > > > The discussion kind of stalled here. > > My apologies... > > > Do you think the approach makes > > sense or not? Should we try and resubmit the series with a proper cover > > letter and possibly other improvements or is it a dead end? > > As QAPI schema maintainer, my concern is interface design. To sell this > interface to me (so to speak), you have to show it's useful and > reasonably general. Reasonably general, because we don't want to > accumulate one-offs, even if they have their uses. > > I think this is mostly a matter of commit message(s) and documentation > here. Explain your intended use cases. Maybe hand-wave at other use > cases you can think of. Document that you're implementing the event > only for the specific errors you need, but that it could be implemented > more widely as needed. "Complete" feels impractical, though. > > Makes sense?
Absolutely. We'll rework and resubmit the series addressing the issues you've noted, and we'll see how it goes. Thanks, Roman.