On Mon, May 08 2023, Markus Armbruster <arm...@redhat.com> wrote:

> css_clear_io_interrupt() aborts on unexpected ioctl() errors, and I
> wonder whether that's appropriate.  Let's have a closer look:
>
>     static void css_clear_io_interrupt(uint16_t subchannel_id,
>                                        uint16_t subchannel_nr)
>     {
>         Error *err = NULL;
>         static bool no_clear_irq;
>         S390FLICState *fs = s390_get_flic();
>         S390FLICStateClass *fsc = s390_get_flic_class(fs);
>         int r;
>
>         if (unlikely(no_clear_irq)) {
>             return;
>         }
>         r = fsc->clear_io_irq(fs, subchannel_id, subchannel_nr);
>         switch (r) {
>         case 0:
>             break;
>         case -ENOSYS:
>             no_clear_irq = true;
>             /*
>             * Ignore unavailability, as the user can't do anything
>             * about it anyway.
>             */
>             break;
>         default:
>             error_setg_errno(&err, -r, "unexpected error condition");
>             error_propagate(&error_abort, err);
>         }
>     }
>
> The default case is abort() with a liberal amount of lipstick applied.
> Let's ignore the lipstick and focus on the abort().
>
> fsc->clear_io_irq ist either qemu_s390_clear_io_flic() order
> kvm_s390_clear_io_flic().
>
> Only kvm_s390_clear_io_flic() can return non-zero: -errno when ioctl()
> fails.
>
> The ioctl() is KVM_SET_DEVICE_ATTR for KVM_DEV_FLIC_CLEAR_IO_IRQ with
> subchannel_id and subchannel_nr.  I.e. we assume that this can only fail
> with ENOSYS, und crash hard when the assumption turns out to be wrong.
>
> Is this error condition a programming error?  I figure it can be one
> only if the ioctl()'s contract promises us it cannot fail in any other
> way unless we violate preconditions.
>
> Is the error condition fatal, i.e. continuing would be unsafe?
>
> If it's a fatal programming error, then abort() is appropriate.
>
> If it's fatal, but not a programming error, we should exit(1) instead.
>
> If it's a survivable programming error, use of abort() is a matter of
> taste.

>From what I remember, this was introduced to clean up a potentially
queued interrupt that is not supposed to be delivered, so the worst
thing that could happen on failure is a spurious interrupt (same as what
could happen if the kernel flic doesn't provide this function in the
first place.) My main worry would be changes/breakages on the kernel
side (while the QEMU side remains unchanged).

So, I think we should continue to log the error in any case; but I don't
have a strong opinion as to whether we should use exit(1) (as I wouldn't
consider it a programming error) or just continue. Halil, your choice :)


Reply via email to