On 05.12.2019 10:59, Denis Plotnikov wrote: > Ping! > > On 25.11.2019 12:16, Denis Plotnikov wrote: >> >> >> On 06.11.2019 15:03, Michael S. Tsirkin wrote: >>> On Thu, Oct 24, 2019 at 11:34:34AM +0000, Denis Lunev wrote: >>>> On 10/24/19 12:28 AM, Michael S. Tsirkin wrote: >>>>> On Fri, Oct 18, 2019 at 02:55:47PM +0300, Denis Plotnikov wrote: >>>>>> From: "Denis V. Lunev" <d...@openvz.org> >>>>>> >>>>>> Linux guests submit IO requests no longer than PAGE_SIZE * max_seg >>>>>> field reported by SCSI controler. Thus typical sequential read with >>>>>> 1 MB size results in the following pattern of the IO from the guest: >>>>>> 8,16 1 15754 2.766095122 2071 D R 2095104 + 1008 >>>>>> [dd] >>>>>> 8,16 1 15755 2.766108785 2071 D R 2096112 + 1008 >>>>>> [dd] >>>>>> 8,16 1 15756 2.766113486 2071 D R 2097120 + 32 [dd] >>>>>> 8,16 1 15757 2.767668961 0 C R 2095104 + 1008 [0] >>>>>> 8,16 1 15758 2.768534315 0 C R 2096112 + 1008 [0] >>>>>> 8,16 1 15759 2.768539782 0 C R 2097120 + 32 [0] >>>>>> The IO was generated by >>>>>> dd if=/dev/sda of=/dev/null bs=1024 iflag=direct >>>>>> >>>>>> This effectively means that on rotational disks we will observe 3 >>>>>> IOPS >>>>>> for each 2 MBs processed. This definitely negatively affects both >>>>>> guest and host IO performance. >>>>>> >>>>>> The cure is relatively simple - we should report lengthy >>>>>> scatter-gather >>>>>> ability of the SCSI controller. Fortunately the situation here is >>>>>> very >>>>>> good. VirtIO transport layer can accomodate 1024 items in one >>>>>> request >>>>>> while we are using only 128. This situation is present since almost >>>>>> very beginning. 2 items are dedicated for request metadata thus we >>>>>> should publish VIRTQUEUE_MAX_SIZE - 2 as max_seg. >>>>>> >>>>>> The following pattern is observed after the patch: >>>>>> 8,16 1 9921 2.662721340 2063 D R 2095104 + 1024 >>>>>> [dd] >>>>>> 8,16 1 9922 2.662737585 2063 D R 2096128 + 1024 >>>>>> [dd] >>>>>> 8,16 1 9923 2.665188167 0 C R 2095104 + 1024 [0] >>>>>> 8,16 1 9924 2.665198777 0 C R 2096128 + 1024 [0] >>>>>> which is much better. >>>>>> >>>>>> The dark side of this patch is that we are tweaking guest visible >>>>>> parameter, though this should be relatively safe as above transport >>>>>> layer support is present in QEMU/host Linux for a very long time. >>>>>> The patch adds configurable property for VirtIO SCSI with a new >>>>>> default >>>>>> and hardcode option for VirtBlock which does not provide good >>>>>> configurable framework. >>>>>> >>>>>> Unfortunately the commit can not be applied as is. For the real >>>>>> cure we >>>>>> need guest to be fixed to accomodate that queue length, which is >>>>>> done >>>>>> only in the latest 4.14 kernel. Thus we are going to expose the >>>>>> property >>>>>> and tweak it on machine type level. >>>>>> >>>>>> The problem with the old kernels is that they have >>>>>> max_segments <= virtqueue_size restriction which cause the guest >>>>>> crashing in the case of violation. >>>>> This isn't just in the guests: virtio spec also seems to imply this, >>>>> or at least be vague on this point. >>>>> >>>>> So I think it'll need a feature bit. >>>>> Doing that in a safe way will also allow being compatible with old >>>>> guests. >>>>> >>>>> The only downside is it's a bit more work as we need to >>>>> spec this out and add guest support. >>>>> >>>>>> To fix the case described above in the old kernels we can increase >>>>>> virtqueue_size to 256 and max_segments to 254. The pitfall here is >>>>>> that seabios allows the virtqueue_size-s < 128, however, the seabios >>>>>> patch extending that value to 256 is pending. >>>>> And the fix here is just to limit large vq size to virtio 1.0. >>>>> In that mode it's fine I think: >>>>> >>>>> >>>>> /* check if the queue is available */ >>>>> if (vp->use_modern) { >>>>> num = vp_read(&vp->common, virtio_pci_common_cfg, >>>>> queue_size); >>>>> if (num > MAX_QUEUE_NUM) { >>>>> vp_write(&vp->common, virtio_pci_common_cfg, queue_size, >>>>> MAX_QUEUE_NUM); >>>>> num = vp_read(&vp->common, virtio_pci_common_cfg, >>>>> queue_size); >>>>> } >>>>> } else { >>>>> num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num); >>>>> } >> The same seabios snippet, but more detailed: >> >> vp_find_vq() >> { >> ... >> /* check if the queue is available */ >> if (vp->use_modern) { >> num = vp_read(&vp->common, virtio_pci_common_cfg, queue_size); >> if (num > MAX_QUEUE_NUM) { >> vp_write(&vp->common, virtio_pci_common_cfg, queue_size, >> MAX_QUEUE_NUM); >> num = vp_read(&vp->common, virtio_pci_common_cfg, >> queue_size); >> } >> } else { >> num = vp_read(&vp->legacy, virtio_pci_legacy, queue_num); >> } >> if (!num) { >> dprintf(1, "ERROR: queue size is 0\n"); >> goto fail; >> } >> if (num > MAX_QUEUE_NUM) { >> dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM); >> goto fail; >> } >> ... >> } >> >> It turned out that the problem is here, but not because of the >> seabios code. >> The virtqueue size is written and then incorrect value is re-read. >> Thanks to Roman Kagan (rka...@virtuozzo.com) for investigating the >> root cause of the problem. >> >> As the code states, for the modern devices, seabios reads the queue >> size and if it's >> greater than seabios can support, reduce the queue size to the max >> seabios supported value. >> >> This doesn't work. >> >> The reason is that the size is read from the virtio device, >> >> virtio_pci_common_read() >> { >> ... >> case VIRTIO_PCI_COMMON_Q_SIZE: >> val = virtio_queue_get_num(vdev, vdev->queue_sel); >> break; >> ... >> } >> >> but is written to the proxy >> >> virtio_pci_common_write() >> { >> ... >> case VIRTIO_PCI_COMMON_Q_SIZE: >> proxy->vqs[vdev->queue_sel].num = val; >> break; >> ... >> }. >> >> The final stage of the size setting is propagated it from the proxy >> to the device on virtqueue enabling: >> >> virtio_cpi_common_write() >> { >> ... >> case VIRTIO_PCI_COMMON_Q_ENABLE: >> virtio_queue_set_num(vdev, vdev->queue_sel, >> proxy->vqs[vdev->queue_sel].num); >> virtio_queue_set_rings(vdev, vdev->queue_sel, >> ((uint64_t)proxy->vqs[vdev->queue_sel].desc[1]) << 32 | >> proxy->vqs[vdev->queue_sel].desc[0], >> ((uint64_t)proxy->vqs[vdev->queue_sel].avail[1]) << 32 | >> proxy->vqs[vdev->queue_sel].avail[0], >> ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 | >> proxy->vqs[vdev->queue_sel].used[0]); >> proxy->vqs[vdev->queue_sel].enabled = 1; >> break; >> ... >> }. >> >> So we have the following workflow: >> suppose the device has virtqueue size = 256 and seabios MAX_QUEUE_NUM >> = 128. >> In that case the seabios works like: >> >> 1. if vp_modern read the size (256) >> 2. 256 > 128 >> 3. write virtqueue size = 128 >> 4. re-read virtqueue size = 256 !!! >> 5. fail because of the check >> if (num > MAX_QUEUE_NUM) { >> dprintf(1, "ERROR: queue size %d > %d\n", num, MAX_QUEUE_NUM); >> goto fail; >> } >> >> To fix the issue, we need to read and write the virtqueue size from >> the same place. >> Should we do with the proxy? >> Is there any reason to read from the device and write to the proxy? >> >> Furthermore, the size setting has a few flaws: >> >> 1. The size being set should be a power of 2 >> 2. The size being set should be less or equal to the virtqueue size >> (and be greater that 2?) >> >> Denis >>>> you mean to put the code like this into virtio_pci_realize() inside >>>> QEMU? >>>> >>>> If no, can you pls clarify which component should be touched. >>>> >>>> Den >>> I mean: >>> - add an API to change the default queue size >>> - add a validate features callback, in there check and for modern >>> flag set in features increase the queue size >>> >>> maybe all this is too much work, we could block this >>> for transitional devices, but your patch does not do it, >>> you need to check that legacy is enabled not that modern >>> is not disabled. >>> >>> >>> >> >