On 03/23/2017 06:09 PM, Michael S. Tsirkin wrote: > On Mon, Mar 06, 2017 at 05:04:41PM +0100, Cornelia Huck wrote: >> On Mon, 6 Mar 2017 16:21:13 +0100 >> Halil Pasic <pa...@linux.vnet.ibm.com> wrote: >> >>> On 03/06/2017 03:56 PM, Cornelia Huck wrote: >>>> On Fri, 3 Mar 2017 14:08:37 +0100 >>>> Halil Pasic <pa...@linux.vnet.ibm.com> wrote: >>>> >>>>> On 03/03/2017 01:50 PM, Cornelia Huck wrote: >>>>>> On Fri, 3 Mar 2017 13:43:32 +0100 >>>>>> Halil Pasic <pa...@linux.vnet.ibm.com> wrote: >>>>>> >>>>>>> On 03/03/2017 01:21 PM, Cornelia Huck wrote: >>>>>>>> On Thu, 2 Mar 2017 19:59:42 +0100 >>>>>>>> Halil Pasic <pa...@linux.vnet.ibm.com> wrote: >>> [...] >>>>> I admit, I did not investigate this thoroughly, also because the patch >>>>> is flawed regarding multi-thread anyway. After a quick investigation >>>>> it seems the linux guest won't auto-reset the device so the guest should >>>>> end up with a not working device. I think it's pretty likely that the >>>>> admin will check the logs if the device was important. >>>> >>>> Thinking a bit more about this, it seems setting the device broken is >>>> not the right solution for exactly that reason. Setting the virtio >>>> device broken is a way to signal the guest to 'you did something >>>> broken; please reset the device and start anew' (and that's how current >>>> callers use it). In our case, this is not the guest's fault. >>> >>> Do we have something to just say stuff broken without blaming the guest? >>> And device reset might not be that stupid at all in the given situation, >>> if we want to save what can be saved from the perspective of the guest. >>> (After reset stuff should work again until we hit the race again -- and >>> since turning ioeventfd on/off should not happen that often during normal >>> operation it could help limit damage suffered -- e.g. controlled shutdown). >> >> Checking again, the spec says >> >> DEVICE_NEEDS_RESET (64) Indicates that the device has experienced an >> error from which it can’t recover. >> >> Nothing about 'guest error'. >> >> The only problem is that legacy devices don't have that state, which >> means they'll have a broken device through no fault of their own. >> >>> >>>> >>>> Maybe go back to the assert 'solution'? But I'm not sure that's enough >>>> if production builds disable asserts... >>>> >>> >>> I will wait a bit, maybe other virtio folks are going to have an >>> opinion too. >>> >>> My concern about the assert solution is that for production it is >>> either too rigorous (kill off, hopefully with a dump) or not >>> enough (as you have mentioned, if NDEBUG assert does nothing). >>> >>> >>> I think there are setups where a loss of device does not have to be >>> fatal, and I would not like to be the one who makes it fatal (for the >>> guest). >> >> Basically, it's a host bug (and not a bug specific to a certain >> device). Moving the device which was impacted to a broken state may be >> a useful mitigation. >> >> But yes, let's hear some other opinions. > > We don't support NDEBUG really so I think an assert is fine for now.
Thanks for your reply! You are right, virtio.c makes sure we do not have NDEBUG defined, but at the very same place a todo comment suggests that not supporting NDEBUG is a temporary measure (and a so called technical debt). If my understanding is correct, I should use assert instead of virtio_error for the next version. I that right? > Handling unexpected errors more gracefully is laudable but I think we > want a more systematic approach than just open-coding it in > this specific place. > Sorry, I mistook virtio_error for the systematic approach for handling (expected and unexpected) errors in virtio. Should I create a patch which documents what is considered an (in)appropriate usage of virtio_error? AFAIU (based on Connie's comments) virtio_error should not be used for reporting QEMU errors (or should only be used to report guest errors). By the way I have read the exchange on '[PATCH 1/5] virtio: Error object based virtio_error()'. The things you suggest there do make a lot of sense to me. Unfortunately I can't tell if I'm going to have the time to come up with an appropriate patch(set), so I won't volunteer for now. Regards, Halil