On Mon, Nov 25, 2019 at 09:16:10AM +0000, Denis Plotnikov wrote: > > > On 06.11.2019 15:03, Michael S. Tsirkin wrote: > > On Thu, Oct 24, 2019 at 11:34:34AM +0000, Denis Lunev wrote: > >> On 10/24/19 12:28 AM, Michael S. Tsirkin wrote: > >>> On Fri, Oct 18, 2019 at 02:55:47PM +0300, Denis Plotnikov wrote: > >>>> From: "Denis V. Lunev" <d...@openvz.org> > >>>> > >>>> Linux guests submit IO requests no longer than PAGE_SIZE * max_seg > >>>> field reported by SCSI controler. Thus typical sequential read with > >>>> 1 MB size results in the following pattern of the IO from the guest: > >>>> 8,16 1 15754 2.766095122 2071 D R 2095104 + 1008 [dd] > >>>> 8,16 1 15755 2.766108785 2071 D R 2096112 + 1008 [dd] > >>>> 8,16 1 15756 2.766113486 2071 D R 2097120 + 32 [dd] > >>>> 8,16 1 15757 2.767668961 0 C R 2095104 + 1008 [0] > >>>> 8,16 1 15758 2.768534315 0 C R 2096112 + 1008 [0] > >>>> 8,16 1 15759 2.768539782 0 C R 2097120 + 32 [0] > >>>> The IO was generated by > >>>> dd if=/dev/sda of=/dev/null bs=1024 iflag=direct > >>>> > >>>> This effectively means that on rotational disks we will observe 3 IOPS > >>>> for each 2 MBs processed. This definitely negatively affects both > >>>> guest and host IO performance. > >>>> > >>>> The cure is relatively simple - we should report lengthy scatter-gather > >>>> ability of the SCSI controller. Fortunately the situation here is very > >>>> good. VirtIO transport layer can accomodate 1024 items in one request > >>>> while we are using only 128. This situation is present since almost > >>>> very beginning. 2 items are dedicated for request metadata thus we > >>>> should publish VIRTQUEUE_MAX_SIZE - 2 as max_seg. > >>>> > >>>> The following pattern is observed after the patch: > >>>> 8,16 1 9921 2.662721340 2063 D R 2095104 + 1024 [dd] > >>>> 8,16 1 9922 2.662737585 2063 D R 2096128 + 1024 [dd] > >>>> 8,16 1 9923 2.665188167 0 C R 2095104 + 1024 [0] > >>>> 8,16 1 9924 2.665198777 0 C R 2096128 + 1024 [0] > >>>> which is much better. > >>>> > >>>> The dark side of this patch is that we are tweaking guest visible > >>>> parameter, though this should be relatively safe as above transport > >>>> layer support is present in QEMU/host Linux for a very long time. > >>>> The patch adds configurable property for VirtIO SCSI with a new default > >>>> and hardcode option for VirtBlock which does not provide good > >>>> configurable framework. > >>>> > >>>> Unfortunately the commit can not be applied as is. For the real cure we > >>>> need guest to be fixed to accomodate that queue length, which is done > >>>> only in the latest 4.14 kernel. Thus we are going to expose the property > >>>> and tweak it on machine type level. > >>>> > >>>> The problem with the old kernels is that they have > >>>> max_segments <= virtqueue_size restriction which cause the guest > >>>> crashing in the case of violation. > >>> This isn't just in the guests: virtio spec also seems to imply this, > >>> or at least be vague on this point. > >>> > >>> So I think it'll need a feature bit. > >>> Doing that in a safe way will also allow being compatible with old guests. > >>> > >>> The only downside is it's a bit more work as we need to > >>> spec this out and add guest support. > >>> > >>>> To fix the case described above in the old kernels we can increase > >>>> virtqueue_size to 256 and max_segments to 254. The pitfall here is > >>>> that seabios allows the virtqueue_size-s < 128, however, the seabios > >>>> patch extending that value to 256 is pending. > >>> And the fix here is just to limit large vq size to virtio 1.0. > >>> In that mode it's fine I think: > >>> > >>> > >>> /* check if the queue is available */ > >>> if (vp->use_modern) { > >>> num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size); > >>> if (num > MAX_QUEUE_NUM) { > >>> vp_write(&vp->common, virtio_pci_common_cfg, queue_size, > >>> MAX_QUEUE_NUM); > >>> num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size); > >>> } > >>> } else { > >>> num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num); > >>> } > The same seabios snippet, but more detailed: > > vp_find_vq() > { > ... > /* check if the queue is available */ > if (vp->use_modern) { > num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size); > if (num > MAX_QUEUE_NUM) { > vp_write(&vp->common, virtio_pci_common_cfg, queue_size, > MAX_QUEUE_NUM); > num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size);
So how about we drop this last line in bios? Will fix things for existing hypervisors. spec does not say guests need to re-read it. > } > } else { > num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num); > } > if (!num) { > dprintf(1, "ERROR: queue size is 0\n"); > goto fail; > } > if (num > MAX_QUEUE_NUM) { > dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM); > goto fail; > } > ... > } > > It turned out that the problem is here, but not because of the seabios code. > The virtqueue size is written and then incorrect value is re-read. > Thanks to Roman Kagan (rka...@virtuozzo.com) for investigating the root > cause of the problem. > > As the code states, for the modern devices, seabios reads the queue size > and if it's > greater than seabios can support, reduce the queue size to the max > seabios supported value. > > This doesn't work. > > The reason is that the size is read from the virtio device, > > virtio_pci_common_read() > { > ... > case VIRTIO_PCI_COMMON_Q_SIZE: > val = virtio_queue_get_num(vdev, vdev->queue_sel); > break; > ... > } > > but is written to the proxy > > virtio_pci_common_write() > { > ... > case VIRTIO_PCI_COMMON_Q_SIZE: > proxy->vqs[vdev->queue_sel].num = val; > break; > ... > }. Yea that's a bug. Here's a hacky way to fix it. But I think really we should just get rid of the two copies down the road. diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index c6b47a9c73..e5c759e19e 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1256,6 +1256,8 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr, break; case VIRTIO_PCI_COMMON_Q_SIZE: proxy->vqs[vdev->queue_sel].num = val; + virtio_queue_set_num(vdev, vdev->queue_sel, + proxy->vqs[vdev->queue_sel].num); break; case VIRTIO_PCI_COMMON_Q_MSIX: msix_vector_unuse(&proxy->pci_dev, > The final stage of the size setting is propagated it from the proxy to > the device on virtqueue enabling: > > virtio_cpi_common_write() > { > ... > case VIRTIO_PCI_COMMON_Q_ENABLE: > virtio_queue_set_num(vdev, vdev->queue_sel, > proxy->vqs[vdev->queue_sel].num); > virtio_queue_set_rings(vdev, vdev->queue_sel, > ((uint64_t)proxy->vqs[vdev->queue_sel].desc[1]) << 32 | > proxy->vqs[vdev->queue_sel].desc[0], > ((uint64_t)proxy->vqs[vdev->queue_sel].avail[1]) << 32 | > proxy->vqs[vdev->queue_sel].avail[0], > ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 | > proxy->vqs[vdev->queue_sel].used[0]); > proxy->vqs[vdev->queue_sel].enabled = 1; > break; > ... > }. > > So we have the following workflow: > suppose the device has virtqueue size = 256 and seabios MAX_QUEUE_NUM = 128. > In that case the seabios works like: > > 1. if vp_modern read the size (256) > 2. 256 > 128 > 3. write virtqueue size = 128 > 4. re-read virtqueue size = 256 !!! bios probably should not re-read size, it's a waste of cpu cycles anyway. > 5. fail because of the check > if (num > MAX_QUEUE_NUM) { > dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM); > goto fail; > } > > To fix the issue, we need to read and write the virtqueue size from the > same place. > Should we do with the proxy? > Is there any reason to read from the device and write to the proxy? > > Furthermore, the size setting has a few flaws: > > 1. The size being set should be a power of 2 > 2. The size being set should be less or equal to the virtqueue size (and > be greater that 2?) I think 1 is checked in virtio_queue_set_num. I guess we should check 2 as well? > > Denis > >> you mean to put the code like this into virtio_pci_realize() inside QEMU? > >> > >> If no, can you pls clarify which component should be touched. > >> > >> Den > > I mean: > > - add an API to change the default queue size > > - add a validate features callback, in there check and for modern > > flag set in features increase the queue size > > > > maybe all this is too much work, we could block this > > for transitional devices, but your patch does not do it, > > you need to check that legacy is enabled not that modern > > is not disabled. > > > > > > >