On Sun, Mar 10, 2013 at 10:14 AM, Dietmar Maurer <diet...@proxmox.com> wrote: >> The objection to this approach has been performance. Exporting vmstate and >> disk data over UNIX domain sockets to an external process incurs IPC >> overhead. >> This prototype shows that even Python code hacked up in a day achieves decent >> performance. >> >> I'm leaving benchmarking as an exercise for the reader. I tested a single >> scenario >> with a 16 GB raw disk and 1 GB RAM, backup time was 28% longer (+30 >> seconds) than Dietmar's series. Below are a few starting points: > > I already have an implementation based on nbd.c, and that also shows > considerable overhead. > A Backup task is extremely performance critical, so any additional overhead > is bad. > I can see the advantage to move the code out of qemu, but I want to avoid > that overhead > by all means. So are there any other ideas to avoid the overhead of a socket > based IPC?
There are enough avenues to investigate better performance that I think the solution is to just profile and optimize a little. More ideas below... >> * I moved the buffer_is_zero() check from the VMA writer into the block job. >> We now skip writing zero clusters and the file contains no extents for >> them. > > With VMA we track zero blocks at 4KB level. If you move that code, you need to > test for zero regions two times, because NBD offers no way to pass that > information (sending > multiple discard message at 4KB level is no option because that adds to much > overhead). About 4KB zero tracking: The vma.py module does not check for zeros, the mask is always 0xffff. There is a pathalogical case of a disk with every 2nd 4 KB block zeroed, here vma.py would create a fully-allocated file.But the more common case is that the whole 64 KB cluster is zero - think about the fact that qcow2/qed only do discard (zero clusters) at cluster level (64 KB). Also, if you try hard to zero 4 KB blocks then two things can happen: 1. You fragment the file, saving space now but turning sequential access into random access once those zero regions get filled in. 2. The underlying file system ignores the holes to reduce fragmentation and all the effort was wasted - I guess this can be checked with ioctl(FIEMAP/FIBMAP) to see how ext4/xfs/btrfs handle it. For these reasons I figure it is better to call buffer_is_zero() in the block job and not make use of the VMA 4 KB zero feature. We'll catch the big zero regions anyway. About NBD performance: The trick to good NBD performance is to pipeline commands instead of doing them synchronously. The protocol supports it, there is nothing stopping us from sending a single buffer with 16 discard commands and getting a single buffer back with 16 replies. This is like setsockopt(TCP_CORK) or sendmsg(MSG_MORE). > NBD does not allow to pass additional infos, and it is quite slow. So it > would be even faster > to write to a pipe and define our own backup protocol instead (less > syscalls). But I have not done > any performance measurement for that. We're not talking about 10x or even 2x overhead, this is within the range of incremental optimization. There are two performance factors that can be controlled: 1. Frequency of writes - use a larger block job buffer like I suggested to reduce back and forth. 2. Asynchronous I/O like block/mirror.c. Don't wait for the NBD write to complete. I'm suspect that implementing either or both of these will reduce overhead to <10%. > Many thanks for your efforts on that topic. It would be great if we can get > at least the first > patch "add basic backup support to block driver" into qemu. This is the basic > framework, and > a starting point for anyone who want to play around with backup. Feel free to take patches that are useful when you create your patch series. For your Patch 2/6, you could even squash all or part of my backup.c changes. Stefan