Am 16.12.2014 um 12:28 hat Paolo Bonzini geschrieben: > > > On 16/12/2014 12:07, Kevin Wolf wrote: > > Am 11.12.2014 um 14:52 hat Paolo Bonzini geschrieben: > >> Keep a queue of requests that were not submitted; pass them to > >> the kernel when a completion is reported, unless the queue is > >> plugged. > >> > >> The array of iocbs is rebuilt every time from scratch. This > >> avoids keeping the iocbs array and list synchronized. > >> > >> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > > > Just found out that in qemu-img bench, this patch seems to cost about > > 5-8% for me. > > What execution? Queue depth=1?
My usual one: $ ./qemu-img bench -t none -c 10000000 -n /dev/loop0 Sending 10000000 requests, 4096 bytes each, 64 in parallel > For me it was noisy but I couldn't see a pessimization, and this patch > should only add a handful of pointer accesses. Also, does perf point at > a culprit, and does patch 5 restore some of the performance? > > Weird guess: TLB misses from accessing iocbs[0] on the stack (using a > different coroutine stack every time)? Perf would report that as a > large cost of this line: > > iocbs[len++] = &aiocb->iocb; No, I can't seem to read much from the perf results. The cost seems to be spread fairly evenly across ioq_submit(), with the exception of the instruction after the call to io_submit(). Not sure why the next instruction always takes so much time (independent of what it is), but it has been this way before. I was surprised to see a "rep stos" scoring at 10% in laio_submit(), apparently io_prep_*() do a memset on the iocb. Not sure if that is necessary, but again, it has always been this way. Patch 5 doesn't restore the performance, which makes sense, as qemu-img only sends single requests. Kevin