On 16/12/2014 12:07, Kevin Wolf wrote: > Am 11.12.2014 um 14:52 hat Paolo Bonzini geschrieben: >> Keep a queue of requests that were not submitted; pass them to >> the kernel when a completion is reported, unless the queue is >> plugged. >> >> The array of iocbs is rebuilt every time from scratch. This >> avoids keeping the iocbs array and list synchronized. >> >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > Just found out that in qemu-img bench, this patch seems to cost about > 5-8% for me.
What execution? Queue depth=1? For me it was noisy but I couldn't see a pessimization, and this patch should only add a handful of pointer accesses. Also, does perf point at a culprit, and does patch 5 restore some of the performance? Weird guess: TLB misses from accessing iocbs[0] on the stack (using a different coroutine stack every time)? Perf would report that as a large cost of this line: iocbs[len++] = &aiocb->iocb; > An optimisation for the unplugged case would probably be easy, but that > would be cheating, as the devices that we're really interested in always > plug the queue (perhaps I should extend qemu-img bench to do that > optionally, too). If you want to do that, you also have to move the "refilling" of the queue to a bottom half. If you refill from the completion routine, you always have a single empty slot and plugging doesn't do anything. Paolo > Anything clever that we can do about this? Or will we just have to live > with the fact that sending a single request is now slower than it used > to be before bdrv_plug?