Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

2021-12-17 Thread Pierre Morel




On 12/16/21 16:16, Matthew Rosato wrote:

On 12/16/21 3:15 AM, Pierre Morel wrote:



On 12/7/21 22:04, Matthew Rosato wrote:

Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups, 
subsequent

intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are 
marked

as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.


I have a problem here.
We will have the same hardware viewed by 2 different VFIO 
implementation (interpretation vs interception) reporting different 
groups ID.


Yes -- To be clear, this patch proposes that the interpreted device will 
continue to report the passthrough group ID and the intercept device 
will use a simulated group ID.




The alternative is to have them reporting same group ID with different 
values.




I don't think we can do this.  For starters, we would have to throw out 
the group tracking we do in QEMU; but for all we know the guest could be 
doing similar tracking -- the implication of the group ID is that 
everyone shares the same values so I don't think we can get away with 
reporting different values for 2 members of the same group.


I think the other alternative is rather to always do something like...

1) host reports its value via vfio capabilities as 'this is what an 
interpreted device can use'
2) QEMU must accept those values as-is OR reduce them to some subset of 
what both interpretation and intercept can support, and report only 
those values for all devices in the group.  (More on this further down)




I fear both are wrong.

On the other hand, should we have a difference in the QEMU command 
line between intercepted and interpreted devices for default values.


I'm not sure I follow what you suggest here.  Even if we somehow 
provided a command-line means for specifying some of these values, they 
would still be presented to the guest via clp and if the guest has 2 
devices in the same group the clp results had better be the same.


If not why not give up the host values so that in an hypothetical 
future migration we are clean with the GID ?




Well, the interpreted device will use the passthrough group ID so in a 
hypothetical future migration scenario we should be good there.


OK, sorry, I must be more clear.
I try to reformulate.
If have the same hardware and use VFIO, shouldn't we have the same 
hardawre description for the guest whenever the admin chose interception 
or interpretation?


If the answer is we do not care, what it may be, then it is OK.
Afterall interpretation should be the priviledged configuration, we can 
accept that there can be drawback using interception.


so I think I worry too much on this and having different groups is fine.



And simulated devices will still use the default group, so we should 
also be OK there.


This really changes the behavior for 2 other classes of device:

1) Intercept passthrough devices -- Yes, I agree that doing this is a 
bit weird.  But my thinking was that these devices should be the 
exception case rather than the norm moving forward, and it would clearly 
dilineate the different in Q PCI FNGRP values.


2) nested simulated devices -- These aren't using real GIDs anyway and I 
would expect them to also be using the default group already -- forcing 
these to the default group was basically to make sure they didn't 
conflict with the simulated groups being created for intercept devices 
above.



I am not sure of this, just want to open a little discussion on this.


FWIW, I'm not 100% on this either, so a better idea is welcome.  One 
thing I don't like, for example, is that we only have 16 simulated 
groups to work with, and for example we might find it useful later to 
split simulated devices into different groups based on type.




For example what could go wrong to keep the host values returned by 
the CAP?


As-is, we risk advertising the wrong maxstbl and dtsm value for some 
devices in the group, depending on which device is plugged first. 
Imagine you have 2 devices on group 5; one will be interpreted and the 
other intercepted.


If the interpreted device plugs first, we will use the passthrough 
maxstbl and dtsm for all devices in the group; so the intercept device 
gets these values too.


If the intercept device plugs first, we will use the QEMU value for DTSM 
and the smaller maxstbl requried for intercept passthrough.  So the 
interpreted device gets these values too.


Worth noting, we could have more of these differences later -- But if we 
want to avoid splitting the group, then we I think we have to circle 
back to my 'alternative idea' above and provide equivalent support or 
toleration for intercept devices so that we can report a single group 
value that both types can support.


So 

Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

2021-12-16 Thread Matthew Rosato

On 12/16/21 3:15 AM, Pierre Morel wrote:



On 12/7/21 22:04, Matthew Rosato wrote:

Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups, 
subsequent

intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are marked
as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.


I have a problem here.
We will have the same hardware viewed by 2 different VFIO implementation 
(interpretation vs interception) reporting different groups ID.


Yes -- To be clear, this patch proposes that the interpreted device will 
continue to report the passthrough group ID and the intercept device 
will use a simulated group ID.




The alternative is to have them reporting same group ID with different 
values.




I don't think we can do this.  For starters, we would have to throw out 
the group tracking we do in QEMU; but for all we know the guest could be 
doing similar tracking -- the implication of the group ID is that 
everyone shares the same values so I don't think we can get away with 
reporting different values for 2 members of the same group.


I think the other alternative is rather to always do something like...

1) host reports its value via vfio capabilities as 'this is what an 
interpreted device can use'
2) QEMU must accept those values as-is OR reduce them to some subset of 
what both interpretation and intercept can support, and report only 
those values for all devices in the group.  (More on this further down)




I fear both are wrong.

On the other hand, should we have a difference in the QEMU command line 
between intercepted and interpreted devices for default values.


I'm not sure I follow what you suggest here.  Even if we somehow 
provided a command-line means for specifying some of these values, they 
would still be presented to the guest via clp and if the guest has 2 
devices in the same group the clp results had better be the same.


If not why not give up the host values so that in an hypothetical future 
migration we are clean with the GID ?




Well, the interpreted device will use the passthrough group ID so in a 
hypothetical future migration scenario we should be good there.


And simulated devices will still use the default group, so we should 
also be OK there.


This really changes the behavior for 2 other classes of device:

1) Intercept passthrough devices -- Yes, I agree that doing this is a 
bit weird.  But my thinking was that these devices should be the 
exception case rather than the norm moving forward, and it would clearly 
dilineate the different in Q PCI FNGRP values.


2) nested simulated devices -- These aren't using real GIDs anyway and I 
would expect them to also be using the default group already -- forcing 
these to the default group was basically to make sure they didn't 
conflict with the simulated groups being created for intercept devices 
above.



I am not sure of this, just want to open a little discussion on this.


FWIW, I'm not 100% on this either, so a better idea is welcome.  One 
thing I don't like, for example, is that we only have 16 simulated 
groups to work with, and for example we might find it useful later to 
split simulated devices into different groups based on type.




For example what could go wrong to keep the host values returned by the 
CAP?


As-is, we risk advertising the wrong maxstbl and dtsm value for some 
devices in the group, depending on which device is plugged first. 
Imagine you have 2 devices on group 5; one will be interpreted and the 
other intercepted.


If the interpreted device plugs first, we will use the passthrough 
maxstbl and dtsm for all devices in the group; so the intercept device 
gets these values too.


If the intercept device plugs first, we will use the QEMU value for DTSM 
and the smaller maxstbl requried for intercept passthrough.  So the 
interpreted device gets these values too.


Worth noting, we could have more of these differences later -- But if we 
want to avoid splitting the group, then we I think we have to circle 
back to my 'alternative idea' above and provide equivalent support or 
toleration for intercept devices so that we can report a single group 
value that both types can support.


So insofar as dealing with the differences today...  maxstbl is pretty 
easy, we can just tolerate supporting the larger maxstbl in QEMU by 
adding logic to break up the I/O in pcistb_service_call.  We might have 
to provide 2 different maxstbl values over vfio capabilities however 
(what interpretation can support vs what kernel api supports for 
intercept as this could change between host kernel versions)


DTSM is a little trickier.  We are actually OK today because both 
intercept and interpreted devices will report the same 

Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

2021-12-16 Thread Pierre Morel




On 12/7/21 22:04, Matthew Rosato wrote:

Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups, subsequent
intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are marked
as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.


I have a problem here.
We will have the same hardware viewed by 2 different VFIO implementation 
(interpretation vs interception) reporting different groups ID.


The alternative is to have them reporting same group ID with different 
values.


I fear both are wrong.

On the other hand, should we have a difference in the QEMU command line 
between intercepted and interpreted devices for default values.
If not why not give up the host values so that in an hypothetical future 
migration we are clean with the GID ?


I am not sure of this, just want to open a little discussion on this.

For example what could go wrong to keep the host values returned by the CAP?




Signed-off-by: Matthew Rosato 
---
  hw/s390x/s390-pci-bus.c | 19 ++--
  hw/s390x/s390-pci-vfio.c| 40 ++---
  include/hw/s390x/s390-pci-bus.h |  6 -
  3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ab442f17fb..8b0f3ef120 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s, PCIBus 
*bus, int32_t devfn)
  object_unref(OBJECT(iommu));
  }
  
-S390PCIGroup *s390_group_create(int id)

+S390PCIGroup *s390_group_create(int id, int host_id)
  {
  S390PCIGroup *group;
  S390pciState *s = s390_get_phb();
  
  group = g_new0(S390PCIGroup, 1);

  group->id = id;
+group->host_id = host_id;
  QTAILQ_INSERT_TAIL(>zpci_groups, group, link);
  return group;
  }
@@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id)
  return NULL;
  }
  
+S390PCIGroup *s390_group_find_host_sim(int host_id)

+{
+S390PCIGroup *group;
+S390pciState *s = s390_get_phb();
+
+QTAILQ_FOREACH(group, >zpci_groups, link) {
+if (group->id >= ZPCI_SIM_GRP_START && group->host_id == host_id) {
+return group;
+}
+}
+return NULL;
+}
+
  static void s390_pci_init_default_group(void)
  {
  S390PCIGroup *group;
  ClpRspQueryPciGrp *resgrp;
  
-group = s390_group_create(ZPCI_DEFAULT_FN_GRP);

+group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);
  resgrp = >zpci_group;
  resgrp->fr = 1;
  resgrp->dasm = 0;
@@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error 
**errp)
 NULL, g_free);
  s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal, NULL, 
NULL);
  s->bus_no = 0;
+s->next_sim_grp = ZPCI_SIM_GRP_START;
  QTAILQ_INIT(>pending_sei);
  QTAILQ_INIT(>zpci_devs);
  QTAILQ_INIT(>zpci_dma_limit);
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index c9269683f5..bdc5892287 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
  {
  struct vfio_info_cap_header *hdr;
  struct vfio_device_info_cap_zpci_group *cap;
+S390pciState *s = s390_get_phb();
  ClpRspQueryPciGrp *resgrp;
  VFIOPCIDevice *vpci =  container_of(pbdev->pdev, VFIOPCIDevice, pdev);
  
  hdr = vfio_get_device_info_cap(info, VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
  
-/* If capability not provided, just use the default group */

-if (hdr == NULL) {
+/*
+ * If capability not provided or the underlying hostdev is simulated, just
+ * use the default group.
+ */
+if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) {
  trace_s390_pci_clp_cap(vpci->vbasedev.name,
 VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
  pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
@@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
  }
  cap = (void *) hdr;
  
+/*

+ * For an intercept device, let's use an existing simulated group if one
+ * one was already created for other intercept devices in this group.
+ * If not, create a new simulated group if any are still available.
+ * If all else fails, just fall back on the default group.
+ */
+if (!pbdev->interp) {
+pbdev->pci_group = s390_group_find_host_sim(pbdev->zpci_fn.pfgid);
+if (pbdev->pci_group) {
+/* Use existing simulated group */
+pbdev->zpci_fn.pfgid = pbdev->pci_group->id;
+return;
+} else {
+if (s->next_sim_grp == 

[PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

2021-12-07 Thread Matthew Rosato
Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups, subsequent
intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are marked
as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c | 19 ++--
 hw/s390x/s390-pci-vfio.c| 40 ++---
 include/hw/s390x/s390-pci-bus.h |  6 -
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ab442f17fb..8b0f3ef120 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s, PCIBus 
*bus, int32_t devfn)
 object_unref(OBJECT(iommu));
 }
 
-S390PCIGroup *s390_group_create(int id)
+S390PCIGroup *s390_group_create(int id, int host_id)
 {
 S390PCIGroup *group;
 S390pciState *s = s390_get_phb();
 
 group = g_new0(S390PCIGroup, 1);
 group->id = id;
+group->host_id = host_id;
 QTAILQ_INSERT_TAIL(>zpci_groups, group, link);
 return group;
 }
@@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id)
 return NULL;
 }
 
+S390PCIGroup *s390_group_find_host_sim(int host_id)
+{
+S390PCIGroup *group;
+S390pciState *s = s390_get_phb();
+
+QTAILQ_FOREACH(group, >zpci_groups, link) {
+if (group->id >= ZPCI_SIM_GRP_START && group->host_id == host_id) {
+return group;
+}
+}
+return NULL;
+}
+
 static void s390_pci_init_default_group(void)
 {
 S390PCIGroup *group;
 ClpRspQueryPciGrp *resgrp;
 
-group = s390_group_create(ZPCI_DEFAULT_FN_GRP);
+group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);
 resgrp = >zpci_group;
 resgrp->fr = 1;
 resgrp->dasm = 0;
@@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error 
**errp)
NULL, g_free);
 s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal, NULL, NULL);
 s->bus_no = 0;
+s->next_sim_grp = ZPCI_SIM_GRP_START;
 QTAILQ_INIT(>pending_sei);
 QTAILQ_INIT(>zpci_devs);
 QTAILQ_INIT(>zpci_dma_limit);
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index c9269683f5..bdc5892287 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 {
 struct vfio_info_cap_header *hdr;
 struct vfio_device_info_cap_zpci_group *cap;
+S390pciState *s = s390_get_phb();
 ClpRspQueryPciGrp *resgrp;
 VFIOPCIDevice *vpci =  container_of(pbdev->pdev, VFIOPCIDevice, pdev);
 
 hdr = vfio_get_device_info_cap(info, VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
 
-/* If capability not provided, just use the default group */
-if (hdr == NULL) {
+/*
+ * If capability not provided or the underlying hostdev is simulated, just
+ * use the default group.
+ */
+if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) {
 trace_s390_pci_clp_cap(vpci->vbasedev.name,
VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
 pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
@@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 }
 cap = (void *) hdr;
 
+/*
+ * For an intercept device, let's use an existing simulated group if one
+ * one was already created for other intercept devices in this group.
+ * If not, create a new simulated group if any are still available.
+ * If all else fails, just fall back on the default group.
+ */
+if (!pbdev->interp) {
+pbdev->pci_group = s390_group_find_host_sim(pbdev->zpci_fn.pfgid);
+if (pbdev->pci_group) {
+/* Use existing simulated group */
+pbdev->zpci_fn.pfgid = pbdev->pci_group->id;
+return;
+} else {
+if (s->next_sim_grp == ZPCI_DEFAULT_FN_GRP) {
+/* All out of simulated groups, use default */
+trace_s390_pci_clp_cap(vpci->vbasedev.name,
+   VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
+pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
+pbdev->pci_group = s390_group_find(ZPCI_DEFAULT_FN_GRP);
+return;
+} else {
+/* We can assign a new simulated group */
+pbdev->zpci_fn.pfgid = s->next_sim_grp;
+s->next_sim_grp++;
+/* Fall through to create the new sim group using CLP info */
+}
+}
+}
+
 /* See if the PCI group is already defined, create if not */
 pbdev->pci_group =