On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote: > On Tue, Nov 20, 2012 at 10:02 AM, Asias He <as...@redhat.com> wrote: >> Hello Stefan, >> >> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote: >>> This series adds the -device virtio-blk-pci,x-data-plane=on property that >>> enables a high performance I/O codepath. A dedicated thread is used to >>> process >>> virtio-blk requests outside the global mutex and without going through the >>> QEMU >>> block layer. >>> >>> Khoa Huynh <k...@us.ibm.com> reported an increase from 140,000 IOPS to >>> 600,000 >>> IOPS for a single VM using virtio-blk-data-plane in July: >>> >>> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580 >>> >>> The virtio-blk-data-plane approach was originally presented at Linux >>> Plumbers >>> Conference 2010. The following slides contain a brief overview: >>> >>> >>> http://linuxplumbersconf.org/2010/ocw/system/presentations/651/original/Optimizing_the_QEMU_Storage_Stack.pdf >>> >>> The basic approach is: >>> 1. Each virtio-blk device has a thread dedicated to handling ioeventfd >>> signalling when the guest kicks the virtqueue. >>> 2. Requests are processed without going through the QEMU block layer using >>> Linux AIO directly. >>> 3. Completion interrupts are injected via irqfd from the dedicated thread. >>> >>> To try it out: >>> >>> qemu -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=... >>> -device virtio-blk-pci,drive=drive0,scsi=off,x-data-plane=on >> >> >> Is this the latest dataplane bits: >> (git://github.com/stefanha/qemu.git virtio-blk-data-plane) >> >> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328 >> Author: Stefan Hajnoczi <stefa...@redhat.com> >> Date: Wed Nov 14 15:45:38 2012 +0100 >> >> virtio-blk: add x-data-plane=on|off performance feature >> >> >> With this commit on a ramdisk based box, I am seeing about 10K IOPS with >> x-data-plane on and 90K IOPS with x-data-plane off. >> >> Any ideas? >> >> Command line I used: >> >> IMG=/dev/ram0 >> x86_64-softmmu/qemu-system-x86_64 \ >> -drive file=/root/img/sid.img,if=ide \ >> -drive file=${IMG},if=none,cache=none,aio=native,id=disk1 -device >> virtio-blk-pci,x-data-plane=off,drive=disk1,scsi=off \ >> -kernel $KERNEL -append "root=/dev/sdb1 console=tty0" \ >> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m >> 2048 -smp 4 -cpu qemu64,+x2apic -M pc > > Was just about to send out the latest patch series which addresses > review comments, so I have tested the latest code > (61b70fef489ce51ecd18d69afb9622c110b9315c). > > I was unable to reproduce a ramdisk performance regression on Linux > 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with > 8 GB RAM.
I am using the latest upstream kernel. > The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 guest. > > Summary results: > x-data-plane-on: iops=132856 aggrb=1039.1MB/s > x-data-plane-off: iops=126236 aggrb=988.40MB/s > > virtio-blk-data-plane is ~5% faster in this benchmark. > > fio jobfile: > [global] > filename=/dev/vda > blocksize=8k > ioengine=libaio > direct=1 > iodepth=8 > runtime=120 > time_based=1 > > [reads] > readwrite=randread > numjobs=4 > > Perf top (data-plane-on): > 3.71% [kvm] [k] kvm_arch_vcpu_ioctl_run > 3.27% [kernel] [k] memset <--- ramdisk > 2.98% [kernel] [k] do_blockdev_direct_IO > 2.82% [kvm_intel] [k] vmx_vcpu_run > 2.66% [kernel] [k] _raw_spin_lock_irqsave > 2.06% [kernel] [k] put_compound_page > 2.06% [kernel] [k] __get_page_tail > 1.83% [i915] [k] __gen6_gt_force_wake_mt_get > 1.75% [kernel] [k] _raw_spin_unlock_irqrestore > 1.33% qemu-system-x86_64 [.] vring_pop <--- virtio-blk-data-plane > 1.19% [kernel] [k] compound_unlock_irqrestore > 1.13% [kernel] [k] gup_huge_pmd > 1.11% [kernel] [k] __audit_syscall_exit > 1.07% [kernel] [k] put_page_testzero > 1.01% [kernel] [k] fget > 1.01% [kernel] [k] do_io_submit > > Since the ramdisk (memset and page-related functions) is so prominent > in perf top, I also tried a 1-job 8k dd sequential write test on a > Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than > virtio-blk. Optimizing against ramdisk isn't a good idea IMO because > it acts very differently from real hardware where the driver relies on > mmio, DMA, and interrupts (vs synchronous memcpy/memset). For the memset in ramdisk, you can simply patch drivers/block/brd.c to do nop instead of memset for testing. Yes, if you have fast SSD device (sometimes you need multiple which I do not have), it makes more sense to test on real hardware. However, ramdisk test is still useful. It gives rough performance numbers. If A and B are both tested against ramdisk. The difference between A and B are still useful. > Full results: > $ cat data-plane-off > reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8 > ... > reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8 > fio 1.57 > Starting 4 processes > > reads: (groupid=0, jobs=1): err= 0: pid=1851 > read : io=29408MB, bw=250945KB/s, iops=31368 , runt=120001msec > slat (usec): min=2 , max=27829 , avg=11.06, stdev=78.05 > clat (usec): min=1 , max=28028 , avg=241.41, stdev=388.47 > lat (usec): min=33 , max=28035 , avg=253.17, stdev=396.66 > bw (KB/s) : min=197141, max=335365, per=24.78%, avg=250797.02, > stdev=29376.35 > cpu : usr=6.55%, sys=31.34%, ctx=310932, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3764202/0/0, short=0/0/0 > lat (usec): 2=0.01%, 4=0.01%, 20=0.01%, 50=1.78%, 100=27.11% > lat (usec): 250=38.97%, 500=27.11%, 750=2.09%, 1000=0.71% > lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01%, 50=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1852 > read : io=29742MB, bw=253798KB/s, iops=31724 , runt=120001msec > slat (usec): min=2 , max=17007 , avg=10.61, stdev=67.51 > clat (usec): min=1 , max=41531 , avg=239.00, stdev=379.03 > lat (usec): min=32 , max=41547 , avg=250.33, stdev=385.21 > bw (KB/s) : min=194336, max=347497, per=25.02%, avg=253204.25, > stdev=31172.37 > cpu : usr=6.66%, sys=32.58%, ctx=327250, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3806999/0/0, short=0/0/0 > lat (usec): 2=0.01%, 20=0.01%, 50=1.54%, 100=26.45%, 250=40.04% > lat (usec): 500=27.15%, 750=1.95%, 1000=0.71% > lat (msec): 2=1.29%, 4=0.68%, 10=0.18%, 20=0.01%, 50=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1853 > read : io=29859MB, bw=254797KB/s, iops=31849 , runt=120001msec > slat (usec): min=2 , max=16821 , avg=11.35, stdev=76.54 > clat (usec): min=1 , max=17659 , avg=237.25, stdev=375.31 > lat (usec): min=31 , max=17673 , avg=249.27, stdev=383.62 > bw (KB/s) : min=194864, max=345280, per=25.15%, avg=254534.63, > stdev=30549.32 > cpu : usr=6.52%, sys=31.84%, ctx=303763, majf=0, minf=39 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3821989/0/0, short=0/0/0 > lat (usec): 2=0.01%, 10=0.01%, 20=0.01%, 50=2.09%, 100=29.19% > lat (usec): 250=37.31%, 500=26.41%, 750=2.08%, 1000=0.71% > lat (msec): 2=1.32%, 4=0.70%, 10=0.20%, 20=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1854 > read : io=29598MB, bw=252565KB/s, iops=31570 , runt=120001msec > slat (usec): min=2 , max=26413 , avg=11.21, stdev=78.32 > clat (usec): min=16 , max=27993 , avg=239.56, stdev=381.67 > lat (usec): min=34 , max=28006 , avg=251.49, stdev=390.13 > bw (KB/s) : min=194256, max=369424, per=24.94%, avg=252462.86, > stdev=29420.58 > cpu : usr=6.57%, sys=31.33%, ctx=305623, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3788507/0/0, short=0/0/0 > lat (usec): 20=0.01%, 50=2.13%, 100=28.30%, 250=37.74%, 500=26.66% > lat (usec): 750=2.17%, 1000=0.75% > lat (msec): 2=1.35%, 4=0.70%, 10=0.19%, 20=0.01%, 50=0.01% > > Run status group 0 (all jobs): > READ: io=118607MB, aggrb=988.40MB/s, minb=256967KB/s, > maxb=260912KB/s, mint=120001msec, maxt=120001msec > > Disk stats (read/write): > vda: ios=15148328/0, merge=0/0, ticks=1550570/0, in_queue=1536232, > util=96.56% > > $ cat data-plane-on > reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8 > ... > reads: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=libaio, iodepth=8 > fio 1.57 > Starting 4 processes > > reads: (groupid=0, jobs=1): err= 0: pid=1796 > read : io=32081MB, bw=273759KB/s, iops=34219 , runt=120001msec > slat (usec): min=1 , max=20404 , avg=21.08, stdev=125.49 > clat (usec): min=10 , max=135743 , avg=207.62, stdev=532.90 > lat (usec): min=21 , max=136055 , avg=229.60, stdev=556.82 > bw (KB/s) : min=56480, max=951952, per=25.49%, avg=271488.81, > stdev=149773.57 > cpu : usr=7.01%, sys=43.26%, ctx=336854, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=4106413/0/0, short=0/0/0 > lat (usec): 20=0.01%, 50=2.46%, 100=61.13%, 250=21.58%, 500=3.11% > lat (usec): 750=3.04%, 1000=3.88% > lat (msec): 2=4.50%, 4=0.13%, 10=0.11%, 20=0.06%, 50=0.01% > lat (msec): 250=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1797 > read : io=30104MB, bw=256888KB/s, iops=32110 , runt=120001msec > slat (usec): min=1 , max=17595 , avg=22.20, stdev=120.29 > clat (usec): min=13 , max=136264 , avg=221.21, stdev=528.19 > lat (usec): min=22 , max=136280 , avg=244.35, stdev=551.73 > bw (KB/s) : min=57312, max=838880, per=23.93%, avg=254798.51, > stdev=139546.57 > cpu : usr=6.82%, sys=41.87%, ctx=360348, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3853351/0/0, short=0/0/0 > lat (usec): 20=0.01%, 50=2.10%, 100=58.47%, 250=22.38%, 500=3.68% > lat (usec): 750=3.69%, 1000=4.52% > lat (msec): 2=4.87%, 4=0.14%, 10=0.11%, 20=0.05%, 250=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1798 > read : io=31698MB, bw=270487KB/s, iops=33810 , runt=120001msec > slat (usec): min=1 , max=17457 , avg=20.93, stdev=125.33 > clat (usec): min=16 , max=134663 , avg=210.19, stdev=535.77 > lat (usec): min=21 , max=134671 , avg=232.02, stdev=559.27 > bw (KB/s) : min=57248, max=841952, per=25.29%, avg=269330.21, > stdev=148661.08 > cpu : usr=6.92%, sys=42.81%, ctx=337799, majf=0, minf=39 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=4057340/0/0, short=0/0/0 > lat (usec): 20=0.01%, 50=1.98%, 100=62.00%, 250=20.70%, 500=3.22% > lat (usec): 750=3.23%, 1000=4.16% > lat (msec): 2=4.41%, 4=0.13%, 10=0.10%, 20=0.06%, 250=0.01% > reads: (groupid=0, jobs=1): err= 0: pid=1799 > read : io=30913MB, bw=263789KB/s, iops=32973 , runt=120000msec > slat (usec): min=1 , max=17565 , avg=21.52, stdev=120.17 > clat (usec): min=15 , max=136064 , avg=215.53, stdev=529.56 > lat (usec): min=27 , max=136070 , avg=237.99, stdev=552.50 > bw (KB/s) : min=57632, max=900896, per=24.74%, avg=263431.57, > stdev=148379.15 > cpu : usr=6.90%, sys=42.56%, ctx=348217, majf=0, minf=41 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued r/w/d: total=3956830/0/0, short=0/0/0 > lat (usec): 20=0.01%, 50=1.76%, 100=59.96%, 250=22.21%, 500=3.45% > lat (usec): 750=3.35%, 1000=4.33% > lat (msec): 2=4.65%, 4=0.13%, 10=0.11%, 20=0.05%, 250=0.01% > > Run status group 0 (all jobs): > READ: io=124796MB, aggrb=1039.1MB/s, minb=263053KB/s, > maxb=280328KB/s, mint=120000msec, maxt=120001msec > > Disk stats (read/write): > vda: ios=15942789/0, merge=0/0, ticks=336240/0, in_queue=317832, util=97.47% > -- Asias