On 2024/02/20 23:53, Kevin Wolf wrote:
Am 20.02.2024 um 15:29 hat Kevin Wolf geschrieben:
Am 20.02.2024 um 13:24 hat Akihiko Odaki geschrieben:
nvme_sriov_pre_write_ctrl() used to directly inspect SR-IOV
configurations to know the number of VFs being disabled due to SR-IOV
configuration writes, but the logic was flawed and resulted in
out-of-bound memory access.

It assumed PCI_SRIOV_NUM_VF always has the number of currently enabled
VFs, but it actually doesn't in the following cases:
- PCI_SRIOV_NUM_VF has been set but PCI_SRIOV_CTRL_VFE has never been.
- PCI_SRIOV_NUM_VF was written after PCI_SRIOV_CTRL_VFE was set.
- VFs were only partially enabled because of realization failure.

It is a responsibility of pcie_sriov to interpret SR-IOV configurations
and pcie_sriov does it correctly, so use pcie_sriov_num_vfs(), which it
provides, to get the number of enabled VFs before and after SR-IOV
configuration writes.

Cc: qemu-sta...@nongnu.org
Fixes: CVE-2024-26328
Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management 
command")
Suggested-by: Michael S. Tsirkin <m...@redhat.com>
Signed-off-by: Akihiko Odaki <akihiko.od...@daynix.com>
---
  hw/nvme/ctrl.c | 26 ++++++++------------------
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f026245d1e9e..7a56e7b79b4d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -8466,36 +8466,26 @@ static void nvme_pci_reset(DeviceState *qdev)
      nvme_ctrl_reset(n, NVME_RESET_FUNCTION);
  }
-static void nvme_sriov_pre_write_ctrl(PCIDevice *dev, uint32_t address,
-                                      uint32_t val, int len)
+static void nvme_sriov_post_write_config(PCIDevice *dev, uint16_t old_num_vfs)
  {
      NvmeCtrl *n = NVME(dev);
      NvmeSecCtrlEntry *sctrl;
-    uint16_t sriov_cap = dev->exp.sriov_cap;
-    uint32_t off = address - sriov_cap;
-    int i, num_vfs;
+    int i;
- if (!sriov_cap) {
-        return;
-    }
-
-    if (range_covers_byte(off, len, PCI_SRIOV_CTRL)) {
-        if (!(val & PCI_SRIOV_CTRL_VFE)) {
-            num_vfs = pci_get_word(dev->config + sriov_cap + PCI_SRIOV_NUM_VF);
-            for (i = 0; i < num_vfs; i++) {
-                sctrl = &n->sec_ctrl_list.sec[i];
-                nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
-            }
-        }
+    for (i = pcie_sriov_num_vfs(dev); i < old_num_vfs; i++) {
+        sctrl = &n->sec_ctrl_list.sec[i];
+        nvme_virt_set_state(n, le16_to_cpu(sctrl->scid), false);
      }
  }

Maybe I'm missing something, but if the concern is that 'i' could run
beyond the end of the array, I don't see anything that limits
pcie_sriov_num_vfs() to the static size of 127 that n->sec_ctrl_list.sec
has. register_vfs() seems to just take whatever 16 bit value the guest
wrote without imposing additional restrictions.

If there is some mechanism that makes register_vf() fail if we exceed
the limit, maybe an assertion with a comment would be in order because
it doesn't seem obvious. I couldn't find any code that enforces it,
sriov_max_vfs only ever seems to be used as a hint for the guest.

If not, do we need another check that fails gracefully in the error
case?

Ok, I see now that patch 2 fixes this. But then the commit message is
wrong because it implies that this patch is the only thing you need to
fix the problem with nvme. You can't say "Fixes: CVE-2024-26328" if half
of the fix is still missing.

I didn't assign CVE-2024-26328 for the case that the value of PCI_SRIOV_NUM_VF is greater than PCI_SRIOV_TOTAL_VF; it's what CVE-2024-26327 deals with.

The problem I dealt here is that the value of PCI_SRIOV_NUM_VF may not represent the actual number of enabled VFs because another register (PCI_SRIOV_CTRL_VFE) is not set, for example.

If an assertion to be added, I think it should be in pcie_sriov_num_vfs() and ensure the returning value is less than the value of PCI_SRIOV_TOTAL_VF (aka sriov_max_vfs in hw/nvme/ctrl.c), but I think it's fine without it.


Maybe asserting old_num_vfs < n->sec_ctrl_list.numcntl would still be a
good idea. But looking at this one, it seems to me that numcntl isn't
completely correct either:

     list->numcntl = cpu_to_le16(max_vfs);

Both list->numcntl and max_vfs are uint8_t, so I think this will always
be 0 on big endian hosts?

Indeed it looks wrong. Will you write a patch?

Regards,
Akihiko Odaki

Reply via email to