On 08/31/2017 11:19 AM, Cornelia Huck wrote: > On Wed, 30 Aug 2017 18:36:02 +0200 > Halil Pasic <pa...@linux.vnet.ibm.com> wrote: > >> According to the POP a start subchannel instruction (SSCH) returning with >> cc 1 implies that the subchannel was status pending when SSCH executed. >> >> Due to a somewhat confusing error handling, where error codes are mapped >> to cc value, sane looking error codes result in non AR compliant >> behavior. >> >> Let's fix this! Instead of cc 1 we use cc 3 which means device not >> operational, and is much closer to the truth in the given cases. >> >> Signed-off-by: Halil Pasic <pa...@linux.vnet.ibm.com> >> Acked-by: Pierre Morel<pmo...@linux.vnet.ibm.com> >> --- >> >> This patch turned out quite controversial. We did not reach a consensus >> during the internal review. >> >> The most of the discussion revolved around the ORB flag which >> architecturally must be supported, but are currently not supported by >> vfio-ccw (not yet, or can't be). The idea showing the most promise for >> consensus was to handle this via device status (along the lines better a >> strange acting device than a non-conform machine) but since it's a >> radical change we decided to first discuss upstream and then do whatever >> needs to be done. >> --- >> hw/s390x/css.c | 15 ++++++--------- >> hw/s390x/s390-ccw.c | 2 +- >> 2 files changed, 7 insertions(+), 10 deletions(-) >> >> diff --git a/hw/s390x/css.c b/hw/s390x/css.c >> index a50fb0727e..0822538cde 100644 >> --- a/hw/s390x/css.c >> +++ b/hw/s390x/css.c >> @@ -1034,7 +1034,7 @@ static int sch_handle_start_func_passthrough(SubchDev >> *sch) >> */ >> if (!(orb->ctrl0 & ORB_CTRL0_MASK_PFCH) || >> !(orb->ctrl0 & ORB_CTRL0_MASK_C64)) { >> - return -EINVAL; >> + return -ENODEV; > > This feels wrong. If we don't support this yet, doing something like a > channel-program check or an operand exception feels closer to the > architecture than indicating a gone device.
I disagree, a channel-program check or an operand exception, or cc 1 (current solution) makes the machine obviously non-conform. My train of thought was that architecturally you can loose connection to the device at any time (you can't prohibit admins pulling cables or smashing equipment with a 10kg hammer). Also from the guest OS perspective I think saying device not operational could provoke a proper reaction form the guest OS: that is just give up on the device. The things you propose would in my opinion put the blame on the guest OS driver (making non-conform requests) so in that case it would make sense to give up on the driver (but the same driver could wonderfully work with let's say a fully emulated device). As I have stated in the cover letter of this patch, I would find setting device status even better, but I wanted to discuss first before going from setting cc to something else. Setting cc was not my idea in the first place (AFAIK the -EINVAL here effectively triggers cc 1). > >> } >> >> ret = s390_ccw_cmd_request(orb, s, sch->driver_data); >> @@ -1046,16 +1046,13 @@ static int >> sch_handle_start_func_passthrough(SubchDev *sch) >> break; >> case -ENODEV: >> break; >> + case -EFAULT: >> + break; >> case -EACCES: >> /* Let's reflect an inaccessible host device by cc 3. */ >> - ret = -ENODEV; >> - break; >> default: >> - /* >> - * All other return codes will trigger a program check, >> - * or set cc to 1. >> - */ >> - break; >> + /* Let's make all other return codes map to cc 3. */ >> + ret = -ENODEV; > > Why? This feels wrong. For those cases where we want to signal an error > but cc 1 is conceptually wrong, either an operand exception (for very > few cases) or a channel-program check feels more in line with the > architecture. You mean the original code feels wrong, or? I keep the program check for -EFAULT (that's why it's added) and just change cc 1 to cc 3 for the not explicitly handled error codes (reason stated in the commit message). > > That's a general problem with doing stuff in the hypervisor: We have > sets of internal problems that obviously don't show up in the > architecture, and we can either handle them internally or use what the > architecture offers for problem signaling. z/VM has probably faced the > same problems :) I agree. > >> }; >> >> return ret; >> @@ -1115,7 +1112,7 @@ static int do_subchannel_work(SubchDev *sch) >> if (sch->do_subchannel_work) { >> return sch->do_subchannel_work(sch); >> } else { >> - return -EINVAL; >> + return -ENODEV; > > This rather seems like a job for an assert? If we don't have a function > for the 'asynchronous' handling of the various functions assigned for a > subchannel, that looks like an internal error. > IMHO it depends. Aborting qemu is heavy handed, and as an user I would not be happy about it. But certainly it is an assert situation. We can look for an even better solution, but I think this is an improvement. The logic behind is that the device is broken and can't be talked to properly. >> } >> } >> >> diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c >> index 8614dda6f8..2b0741741c 100644 >> --- a/hw/s390x/s390-ccw.c >> +++ b/hw/s390x/s390-ccw.c >> @@ -25,7 +25,7 @@ int s390_ccw_cmd_request(ORB *orb, SCSW *scsw, void *data) >> if (cdc->handle_request) { >> return cdc->handle_request(orb, scsw, data); >> } else { >> - return -ENOSYS; >> + return -ENODEV; > > If we get here, it means that we called a request handler (which is > only done for the passthrough variety) without having assigned a > request handler beforehand. This also looks like an internal error to > me... > Certainly. Again I was not the one who wrote or accepted the original code. My previous comment about whether assert or not applies here as well. >> } >> } >> > >