Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
>> If virtio-blk and virtio-serial share an IRQ, the guest operating system has to check each virtqueue for activity. Maybe there is some inefficiency doing that. >> AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) even if everything is unused. > > That could be the case if MSI is disabled. Do the windows virtio drivers enable MSIs, in their inf file? >>> >>> It depends on the version of the drivers, but it is a reasonable guess >>> at what differs between Linux and Windows. Haoyu, can you give us the >>> output of lspci from a Linux guest? >>> >> I made a test with fio on rhel-6.5 guest, the same degradation happened too, this degradation can be reproduced on rhel6.5 guest 100%. >> virtio_console module installed: >> 64K-write-sequence: 285 MBPS, 4380 IOPS >> virtio_console module uninstalled: >> 64K-write-sequence: 370 MBPS, 5670 IOPS >> >I use top -d 1 -H -p to monitor the cpu usage, and found that, >virtio_console module installed: >qemu main thread cpu usage: 98% >virtio_console module uninstalled: >qemu main thread cpu usage: 60% > I found that the statement "err = register_virtio_driver(&virtio_console);" in virtio_console module's init() function will cause the degradation, if I directly return before "err = register_virtio_driver(&virtio_console);", then the degradation disappeared, if directly return after "err = register_virtio_driver(&virtio_console);", the degradation is still there. I will try below test case, 1. Dose not emulate virito-serial deivce, then install/uninstall virtio_console driver in guest, to see whether there is difference in virtio-blk performance and cpu usage. 2. Does not emulate virito-serial deivce, then install virtio_balloon driver (and also dose not emulate virtio-balloon device), to see whether virtio-blk performance degradation will happen. 3. Emulating virtio-balloon device instead of virtio-serial deivce , then to see whether the virtio-blk performance is hampered. Base on the test result, corresponding analysis will be performed. Any ideas? Thanks, Zhang Haoyu
Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
On 12.09.2014 14:38, Stefan Hajnoczi wrote: Max: Unrelated to this performance issue but I notice that the qcow2 metadata overlap check is high in the host CPU profile. Have you had any thoughts about optimizing the check? Stefan In fact, I have done so (albeit only briefly). Instead of gathering all the information in the overlap function itself, we could either have a generic list of typed ranges (e.g. "cluster 0: header", "clusters 1 to 5: L1 table", etc.) or a not-really-bitmap (with 4 bits per entry specifying the cluster type (header, L1 table, free or data cluster, etc.)). The disadvantage of the former would be that in its simplest form we'd have to run through the whole list to find out whether a cluster is already reserved for metadata or not. We could easily optimize this by keeping the list in order and then performing a binary search. The disadvantage of the latter would obviously be its memory size. For a 1 TB image with 64 kB clusters, it would be 8 MB in size. Could be considered acceptable, but I deem it too large. The advantage would be constant access time, of course. We could combine both approaches, that is, using the bitmap as a cache: Whenever a cluster is overlap checked, the corresponding bitmap range (or "bitmap window") is requested; if that is not available, it is generated from the range list and then put into the cache. The remaining question is how large the range list would be in memory. Basically, its size would be comparable to an RLE version of the bitmap. In contrast to a raw RLE version, however, we'd have to add the start cluster to each entry in order to be able to perform binary search and we'd omit free and/or data clusters. So, we'd have 4 bits for the cluster type, let's say 12 bits for the cluster count and of course 64 bits for the first cluster index. Or, for maximum efficiency, we'd have 64 - 9 - 1 = 54 bits for the cluster index, 4 bits for the type and then 6 bits for the cluster count. The first variant gives us 10 bytes per metadata range, the second 8. Considering one refcount block can handle cluster_size / 2 entries and one L2 table can handle cluster_size / 8 entries, we have (for images with a cluster size of 64 kB) a ratio of about 1/32768 refcount blocks per cluster and 1/8192 L2 tables per cluster. I guess we therefore have a metadata ratio of about 1/6000. At the worst, each metadata cluster requires its own range list entry, which for 10 bytes per entry means less than 30 kB for the list of a 1 TB image with 64 kB clusters. I think that's acceptable. We could compress that list even more by making it a real RLE version of the bitmap, removing the cluster index from each entry; remember that for this mixed range list/bitmap approach we no longer need to be able to perform exact binary search but only need to be able to quickly seek to the beginning of a bitmap window. This can be achieved by forcing breaks in the range list at every window border and keeping track of those offsets along with the corresponding bitmap window index. When we want to generate a bitmap window, we look up the start offset in the range list (constant time), then generate it (linear to window size) and can then perform constant-time lookups for each overlap checks in that window. I think that could greatly speed things up and also allow us to always perform range checks on data structures not kept in memory (inactive L1 and L2 tables). The only question now remaining to me is whether that caching is actually feasible or whether binary search into the range list (which then would have to include the cluster index for each entry) would be faster than generating bitmap windows which might suffer from ping-pong effects. Max
Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
On Fri, Sep 12, 2014 at 11:21:37AM +0800, Zhang Haoyu wrote: > >>> > > If virtio-blk and virtio-serial share an IRQ, the guest operating > >>> > > system has to check each virtqueue for activity. Maybe there is some > >>> > > inefficiency doing that. > >>> > > AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) > >>> > > even if everything is unused. > >>> > > >>> > That could be the case if MSI is disabled. > >>> > >>> Do the windows virtio drivers enable MSIs, in their inf file? > >> > >>It depends on the version of the drivers, but it is a reasonable guess > >>at what differs between Linux and Windows. Haoyu, can you give us the > >>output of lspci from a Linux guest? > >> > >I made a test with fio on rhel-6.5 guest, the same degradation happened too, > > this degradation can be reproduced on rhel6.5 guest 100%. > >virtio_console module installed: > >64K-write-sequence: 285 MBPS, 4380 IOPS > >virtio_console module uninstalled: > >64K-write-sequence: 370 MBPS, 5670 IOPS > > > I use top -d 1 -H -p to monitor the cpu usage, and found that, > virtio_console module installed: > qemu main thread cpu usage: 98% > virtio_console module uninstalled: > qemu main thread cpu usage: 60% > > perf top -p result, > virtio_console module installed: >PerfTop:9868 irqs/sec kernel:76.4% exact: 0.0% [4000Hz cycles], > (target_pid: 88381) > -- > > 11.80% [kernel] [k] _raw_spin_lock_irqsave > 8.42% [kernel] [k] _raw_spin_unlock_irqrestore > 7.33% [kernel] [k] fget_light > 6.28% [kernel] [k] fput > 3.61% [kernel] [k] do_sys_poll > 3.30% qemu-system-x86_64 [.] qcow2_check_metadata_overlap > 3.10% [kernel] [k] __pollwait > 2.15% qemu-system-x86_64 [.] qemu_iohandler_poll > 1.44% libglib-2.0.so.0.3200.4 [.] g_array_append_vals > 1.36% libc-2.13.so [.] 0x0011fc2a > 1.31% libpthread-2.13.so [.] pthread_mutex_lock > 1.24% libglib-2.0.so.0.3200.4 [.] 0x0001f961 > 1.20% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt > 0.99% [kernel] [k] eventfd_poll > 0.98% [vdso] [.] 0x0771 > 0.97% [kernel] [k] remove_wait_queue > 0.96% qemu-system-x86_64 [.] qemu_iohandler_fill > 0.95% [kernel] [k] add_wait_queue > 0.69% [kernel] [k] __srcu_read_lock > 0.58% [kernel] [k] poll_freewait > 0.57% [kernel] [k] _raw_spin_lock_irq > 0.54% [kernel] [k] __srcu_read_unlock > 0.47% [kernel] [k] copy_user_enhanced_fast_string > 0.46% [kvm_intel] [k] vmx_vcpu_run > 0.46% [kvm][k] vcpu_enter_guest > 0.42% [kernel] [k] tcp_poll > 0.41% [kernel] [k] system_call_after_swapgs > 0.40% libglib-2.0.so.0.3200.4 [.] g_slice_alloc > 0.40% [kernel] [k] system_call > 0.38% libpthread-2.13.so [.] 0xe18d > 0.38% libglib-2.0.so.0.3200.4 [.] g_slice_free1 > 0.38% qemu-system-x86_64 [.] address_space_translate_internal > 0.38% [kernel] [k] _raw_spin_lock > 0.37% qemu-system-x86_64 [.] phys_page_find > 0.36% [kernel] [k] get_page_from_freelist > 0.35% [kernel] [k] sock_poll > 0.34% [kernel] [k] fsnotify > 0.31% libglib-2.0.so.0.3200.4 [.] g_main_context_check > 0.30% [kernel] [k] do_direct_IO > 0.29% libpthread-2.13.so [.] pthread_getspecific > > virtio_console module uninstalled: >PerfTop:9138 irqs/sec kernel:71.7% exact: 0.0% [4000Hz cycles], > (target_pid: 88381) > -- > > 5.72% qemu-system-x86_64 [.] qcow2_check_metadata_overlap > 4.51% [kernel] [k] fget_light > 3.98% [kernel] [k] _raw_spin_lock_irqsave > 2.55% [kernel] [k] fput > 2.48% libpthread-2.13.so [.] pthread_mutex_lock > 2.46% [kernel] [k] _raw_spin_unlock_irqrestore > 2.21% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt > 1.71% [vdso] [.] 0x060c > 1.68% libc-2.13.so [.] 0x000e751f > 1.64% libglib-2.0.so.0.3200.4 [.] 0x0004fca0 > 1.20% [kernel] [k] __srcu_read_lock > 1.14% [kernel] [k] do_s
Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
>>> > > If virtio-blk and virtio-serial share an IRQ, the guest operating >>> > > system has to check each virtqueue for activity. Maybe there is some >>> > > inefficiency doing that. >>> > > AFAIK virtio-serial registers 64 virtqueues (on 31 ports + console) >>> > > even if everything is unused. >>> > >>> > That could be the case if MSI is disabled. >>> >>> Do the windows virtio drivers enable MSIs, in their inf file? >> >>It depends on the version of the drivers, but it is a reasonable guess >>at what differs between Linux and Windows. Haoyu, can you give us the >>output of lspci from a Linux guest? >> >I made a test with fio on rhel-6.5 guest, the same degradation happened too, >this degradation can be reproduced on rhel6.5 guest 100%. >virtio_console module installed: >64K-write-sequence: 285 MBPS, 4380 IOPS >virtio_console module uninstalled: >64K-write-sequence: 370 MBPS, 5670 IOPS > I use top -d 1 -H -p to monitor the cpu usage, and found that, virtio_console module installed: qemu main thread cpu usage: 98% virtio_console module uninstalled: qemu main thread cpu usage: 60% perf top -p result, virtio_console module installed: PerfTop:9868 irqs/sec kernel:76.4% exact: 0.0% [4000Hz cycles], (target_pid: 88381) -- 11.80% [kernel] [k] _raw_spin_lock_irqsave 8.42% [kernel] [k] _raw_spin_unlock_irqrestore 7.33% [kernel] [k] fget_light 6.28% [kernel] [k] fput 3.61% [kernel] [k] do_sys_poll 3.30% qemu-system-x86_64 [.] qcow2_check_metadata_overlap 3.10% [kernel] [k] __pollwait 2.15% qemu-system-x86_64 [.] qemu_iohandler_poll 1.44% libglib-2.0.so.0.3200.4 [.] g_array_append_vals 1.36% libc-2.13.so [.] 0x0011fc2a 1.31% libpthread-2.13.so [.] pthread_mutex_lock 1.24% libglib-2.0.so.0.3200.4 [.] 0x0001f961 1.20% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0.99% [kernel] [k] eventfd_poll 0.98% [vdso] [.] 0x0771 0.97% [kernel] [k] remove_wait_queue 0.96% qemu-system-x86_64 [.] qemu_iohandler_fill 0.95% [kernel] [k] add_wait_queue 0.69% [kernel] [k] __srcu_read_lock 0.58% [kernel] [k] poll_freewait 0.57% [kernel] [k] _raw_spin_lock_irq 0.54% [kernel] [k] __srcu_read_unlock 0.47% [kernel] [k] copy_user_enhanced_fast_string 0.46% [kvm_intel] [k] vmx_vcpu_run 0.46% [kvm][k] vcpu_enter_guest 0.42% [kernel] [k] tcp_poll 0.41% [kernel] [k] system_call_after_swapgs 0.40% libglib-2.0.so.0.3200.4 [.] g_slice_alloc 0.40% [kernel] [k] system_call 0.38% libpthread-2.13.so [.] 0xe18d 0.38% libglib-2.0.so.0.3200.4 [.] g_slice_free1 0.38% qemu-system-x86_64 [.] address_space_translate_internal 0.38% [kernel] [k] _raw_spin_lock 0.37% qemu-system-x86_64 [.] phys_page_find 0.36% [kernel] [k] get_page_from_freelist 0.35% [kernel] [k] sock_poll 0.34% [kernel] [k] fsnotify 0.31% libglib-2.0.so.0.3200.4 [.] g_main_context_check 0.30% [kernel] [k] do_direct_IO 0.29% libpthread-2.13.so [.] pthread_getspecific virtio_console module uninstalled: PerfTop:9138 irqs/sec kernel:71.7% exact: 0.0% [4000Hz cycles], (target_pid: 88381) -- 5.72% qemu-system-x86_64 [.] qcow2_check_metadata_overlap 4.51% [kernel] [k] fget_light 3.98% [kernel] [k] _raw_spin_lock_irqsave 2.55% [kernel] [k] fput 2.48% libpthread-2.13.so [.] pthread_mutex_lock 2.46% [kernel] [k] _raw_spin_unlock_irqrestore 2.21% libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 1.71% [vdso] [.] 0x060c 1.68% libc-2.13.so [.] 0x000e751f 1.64% libglib-2.0.so.0.3200.4 [.] 0x0004fca0 1.20% [kernel] [k] __srcu_read_lock 1.14% [kernel] [k] do_sys_poll 0.96% [kernel] [k] _raw_spin_lock_irq 0.95% [kernel] [k] __pollwait 0.91% [kernel] [k] __srcu_read_unlock 0.78% [kernel] [k] tcp_poll 0.74% [
Re: [Qemu-devel] [question] virtio-blk performance degradation happened with virito-serial
On (Fri) 29 Aug 2014 [15:45:30], Zhang Haoyu wrote: > Hi, all > > I start a VM with virtio-serial (default ports number: 31), and found that > virtio-blk performance degradation happened, about 25%, this problem can be > reproduced 100%. > without virtio-serial: > 4k-read-random 1186 IOPS > with virtio-serial: > 4k-read-random 871 IOPS > > but if use max_ports=2 option to limit the max number of virio-serial ports, > then the IO performance degradation is not so serious, about 5%. > > And, ide performance degradation does not happen with virtio-serial. Pretty sure it's related to MSI vectors in use. It's possible that the virtio-serial device takes up all the avl vectors in the guests, leaving old-style irqs for the virtio-blk device. If you restrict the number of vectors the virtio-serial device gets (using the -device virtio-serial-pci,vectors= param), does that make things better for you? Amit