On 4/20/26 12:33 PM, Stefan Hajnoczi wrote: > On Fri, Apr 17, 2026 at 05:57:20PM -0500, Mike Christie wrote: >> The following patches were made over Linus's and Martin's 7.1 trees. >> They fix an issue where for virtio-scsi we export a lot of non-scsi >> devices but are getting throttled by the cmd_per_lun_limit too early. >> For example we export 1 or more NVMe or block devices and would like >> to just pass command to them in way where virtio-scsi's hw queue >> limits match the physical hardware. Or in some cases we are doing >> cgroup based throttling on the host side, and we don't want the guest >> to block IO when the host knows we have extra bandwidth. >> >> The patches add a new cmd_per_lun value so drivers can indicate >> when to avoid tracking queueing at the device wide level. They >> then rely on just the block layer hw queue limits. And the patches >> convert virtio-scsi. They also fix some can_queue related issues >> discovered while testing/reviewing. > > Hi Mike, > Is there a difference between setting cmd_per_lun to U32_MAX with your > patches versus setting cmd_per_lun to the virtqueue size without your > patches (this can already be done today without code changes in the > driver)?
The problem today is that cmd_per_lun doesn't take into account the multiqueue queues (virtqueues in virtio) so we have a low limit of 1024 commands total. On a 32-128 vCPU VM we can easily hit that as there's lots of IO submission threads spread over lots of those CPUs. CPUs are then mapped to block mq queues which are mapped to virtqueues so we are hitting them hard. That 1024 value comes from QEMU which limits virtqueue_size to 1024. We could increase that to 4096 or 32K or whatever. The problem is that we would then be wasting a lot of memory as we would be allocating lots of really large virtqueues that would go underutilized (we are submitting 10s of thousands of total IOs but not to just a single queue). So a possibly good balance between not having to use a magic number (U32_MAX) plus having to update the spec would be to: 1. Fix up scsi-ml and virtio-scsi so they allow cmd_per_lun to be greater than can_queue (virtqueue_size for virtio-scsi). 2. Increase the scsi-ml cap cmd_per_lun cap from 4096 to S16_MAX (scsi-ml uses a short for cmd_per_lun). The only drawback to this would be that for each scsi_device we track running IO with a sbitmap. For my cases, we don't need it, so it would be a waste of memory. For a S16_MAX worth of commands I think it would be 128K wasted so not too bad for us as we don't have lots of these types of high perf devices per VM.

