Re: qcow2 overlay performance
On Thu 22 Oct 2020 10:56:46 PM CEST, Yoonho Park wrote: > I am still seeing the performance degradation, but I did find something > interesting (and promising) with qemu 5.1.50. Enabling the subcluster > allocation support in qemu 5.1.50 (extended_l2=on) eliminates the > performance degradation of adding an overlay. Without subcluster allocation > enabled, I still see the performance degradation in qemu 5.1.50 when adding > an overlay. For these experiments, I used 64K blocks and 2M qcow2 cluster > size. Well, 2MB clusters have 64KB subclusters, so your request size is equal to the subcluster size. If the requests are aligned there should be no copy-on-write and therefore no performance degradation if you have backing images. Berto
Re: qcow2 overlay performance
I am still seeing the performance degradation, but I did find something interesting (and promising) with qemu 5.1.50. Enabling the subcluster allocation support in qemu 5.1.50 (extended_l2=on) eliminates the performance degradation of adding an overlay. Without subcluster allocation enabled, I still see the performance degradation in qemu 5.1.50 when adding an overlay. For these experiments, I used 64K blocks and 2M qcow2 cluster size. On Mon, Oct 19, 2020 at 12:51 PM Alberto Garcia wrote: > On Thu 27 Aug 2020 06:29:15 PM CEST, Yoonho Park wrote: > > Below is the data with the cache disabled ("virsh attach-disk ... --cache > > none"). I added the previous data for reference. Overall, random read > > performance was not affected significantly. This makes sense because a > > cache is probably not going to help random read performance much. BTW how > > big the cache is by default? Random write performance for 4K blocks seems > > more "sane" now. Random write performance for 64K blocks is interesting > > because base image (0 overlay) performance is 2X slower than 1-5 > overlays. > > We believe this is because the random writes to an overlay actually turn > > into sequential writes (appends to the overlay). Does this make sense? > > > > > > NO CACHE > > > > 4K blocks64K blocks > > > > olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops > > > > 0 4478 11194684 117157001 890 42050 657 > > > > 1 4490 11222503 625 56656 885 93483 1460 > > I haven't been able to reproduce this (I tried the scenarios with 0 and > 1 overlays), did you figure out anything new or what's the situation? > > Berto >
Re: qcow2 overlay performance
On Thu 27 Aug 2020 06:29:15 PM CEST, Yoonho Park wrote: > Below is the data with the cache disabled ("virsh attach-disk ... --cache > none"). I added the previous data for reference. Overall, random read > performance was not affected significantly. This makes sense because a > cache is probably not going to help random read performance much. BTW how > big the cache is by default? Random write performance for 4K blocks seems > more "sane" now. Random write performance for 64K blocks is interesting > because base image (0 overlay) performance is 2X slower than 1-5 overlays. > We believe this is because the random writes to an overlay actually turn > into sequential writes (appends to the overlay). Does this make sense? > > > NO CACHE > > 4K blocks64K blocks > > olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops > > 0 4478 11194684 117157001 890 42050 657 > > 1 4490 11222503 625 56656 885 93483 1460 I haven't been able to reproduce this (I tried the scenarios with 0 and 1 overlays), did you figure out anything new or what's the situation? Berto
Re: qcow2 overlay performance
Below is the data with the cache disabled ("virsh attach-disk ... --cache none"). I added the previous data for reference. Overall, random read performance was not affected significantly. This makes sense because a cache is probably not going to help random read performance much. BTW how big the cache is by default? Random write performance for 4K blocks seems more "sane" now. Random write performance for 64K blocks is interesting because base image (0 overlay) performance is 2X slower than 1-5 overlays. We believe this is because the random writes to an overlay actually turn into sequential writes (appends to the overlay). Does this make sense? NO CACHE 4K blocks64K blocks olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops 0 4478 11194684 117157001 890 42050 657 1 4490 11222503 625 56656 885 93483 1460 2 4385 10962425 606 56055 875 94445 1475 3 4334 10832307 576 55422 865 95826 1497 4 4356 10892168 542 56070 876 95957 1499 5 4234 10582308 577 54039 844 92936 1452 DEFAULT CACHE (WRITEBACK) 4K blocks64K blocks olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops 0 4510 1127438028 109507 67854 1060521808 8153 1 4692 11732924 731 66801 1043104297 1629 2 4524 11312781 695 66801 1043104297 1629 3 4573 11433034 758 65500 102395627 1494 4 4556 11392971 742 67973 1062108099 1689 5 4471 11172937 734 66615 104098472 1538 On Wed, Aug 26, 2020 at 9:18 AM Kevin Wolf wrote: > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > > I have been measuring the performance of qcow2 overlays, and I am hoping > to > > get some help in understanding the data I collected. In my experiments, I > > created a VM and attached a 16G qcow2 disk to it using "qemu-img create" > > and "virsh attach-disk". I use fio to fill it. I create some number of > > snapshots (overlays) using "virsh snapshot-create-as". To mimic user > > activity between taking snapshots, I use fio to randomly write to 10% of > > each overlay right after I create it. After creating the overlays, I use > > fio to measure random read performance and random write performance with > 2 > > different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by > > the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio > is > > using the attached disk as a block device to avoid as much file system > > overhead as possible. The VM, 16G disk, and snapshots (overlays) all > reside > > on local disk. Below are the measurements I collected for up to 5 > overlays. > > > > > > 4K blocks64K blocks > > > > olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops > > > > 0 4510 1127438028 109507 67854 1060521808 8153 > > > > 1 4692 11732924 731 66801 1043104297 1629 > > > > 2 4524 11312781 695 66801 1043104297 1629 > > > > 3 4573 11433034 758 65500 102395627 1494 > > > > 4 4556 11392971 742 67973 1062108099 1689 > > > > 5 4471 11172937 734 66615 104098472 1538 > > > > > > Read performance is not affected by overlays. However, write performance > > drops even with a single overlay. My understanding is that writing 4K > > blocks requires a read-modify-write because you must fetch a complete > > cluster from deeper in the overlay chain before writing to the active > > overlay. However, this does not explain the drop in performance when > > writing 64K blocks. The performance drop is not as significant, but if > the > > write block size matches the cluster size then it seems that there should > > not be any performance drop because the write can go directly to the > active > > overlay. > > Can you share the QEMU command line you used? > > As you say, it is expected that layer 0 is a bit faster, however not to > this degree. My guess would be that you use the default cache mode > (which includes cache.direct=off), so your results are skewed because > the first requests will only write to memory (the host page cache) and > only later requests will actually hit the disk. > > For benchmarking, you should always use cache.direct=on (or an alias > that contains it, such as cache=none). > > > Another issue I hit is that I cannot set or change the cluster size of > > overlays. Is this possible with "virsh snapshot-create-as"? > > That's a libvirt question. Peter, can you help? > > > I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0. > > > > > > Thank you for any insights or advice you have. > > Kevin > >
Re: qcow2 overlay performance
I used strace to collect the writes, and as far as I can tell they are aligned to the cluster size (64K). Below are some examples... 1520414 pwritev(35, [{iov_base="\5\277z\314\24\305\177\r\340\327\f:e\222\10\33\374\232Q;FuN\t\0\0\325\275\0\0\0\0"..., iov_len=4096}, {iov_base="v\24\324\337\347\364i\26\216\202\277\361\262N\23\22Q\360\360\234\366\360J\1\0\20\325\275\0\0\0\0"..., iov_len=61440}], 2, 326696960 1520424 pwrite64(35, "\200\263\303V)\2779Zpv\30Q\203\2142\24\316\16\353.\4\251T\35\331av\376\252}J\33"..., 65536, 381353984 1520424 pwrite64(35, "\0\0\333\204\0\0\0\0\nT\0a\17JR\v\201\n\33\302\301\230(\16P\341\2\263\n2\253\22"..., 65536, 381419520 1520419 pwrite64(35, "\220\255\361.\v$C1\2625\26\217\10\315\22\17\266\306\t\367\25\22\274\16\3268\206\333\275\206F\22"..., 65536, 381485056 1520424 pwrite64(35, "Z\224\265`\325t0\2\0\0005\271\1\0\0\0Q\266\367\"\323:\247\f\0\0\205\207\1\0\0\0"..., 65536, 381550592 1520419 pwritev(35, [{iov_base="M=Fsw\234pd\0\09\32\1\0\0\0\365\310#V\\\303\356\25\36\371I\246Y\255\202\6"..., iov_len=61440}, {iov_base="\314\242k\370\345;w\22\0\3609\32\1\0\0\0\213n\210#\225\206)\31\321\215+m\16\237\347\36"..., iov_len=4096}], 2, 381616128 On Wed, Aug 26, 2020 at 2:00 PM Alberto Garcia wrote: > On Wed 26 Aug 2020 03:18:32 PM CEST, Kevin Wolf wrote: > >> My understanding is that writing 4K blocks requires a > >> read-modify-write because you must fetch a complete cluster from > >> deeper in the overlay chain before writing to the active > >> overlay. However, this does not explain the drop in performance when > >> writing 64K blocks. > > Your understanding is correct. Apart from what Kevin wrote, I assume > that the write requests are always aligned to the cluster size, is that > right? > > Berto >
Re: qcow2 overlay performance
Great. I will give your patches a try. Also, your workaround suggestion and a discussion with a colleague gave me another idea for an experiment. Is it possible that some of the overhead I am seeing is from the operations necessary to increase the size of the overlay? Would it make sense to use the workaround to create a "full-sized" overlay and then measure fio performance? On Wed, Aug 26, 2020 at 10:57 AM Peter Krempa wrote: > On Wed, Aug 26, 2020 at 15:30:03 +0200, Peter Krempa wrote: > > On Wed, Aug 26, 2020 at 15:18:32 +0200, Kevin Wolf wrote: > > > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > > > > Another issue I hit is that I cannot set or change the cluster size > of > > > > overlays. Is this possible with "virsh snapshot-create-as"? > > > > > > That's a libvirt question. Peter, can you help? > > [...] > > > If you figure out that cluster size matters a lot and qemu will not want > > to change the default, please file a feature request for libvirt. > > After sending the previous mail I thought about it for a bit and IMO a > majority of cases would be fixed by using the cluster size of the > original image in the newly created image. I've created patches which do > this with libvirt. You can get them here: > > https://www.redhat.com/archives/libvir-list/2020-August/msg00995.html > >
Re: qcow2 overlay performance
I create the attached disk with the following commands. The "qemu-img info" is to double check the cluster size. I am running the same experiments now with "--cache none" attached to the "virsh attach-disk". Is this sufficient to avoid the host page cache? qemu-img create -f qcow2 ${dir}/${disk}.qcow2 ${size} -o preallocation=full,cluster_size=${cs} qemu-img info ${dir}/${disk}.qcow2 virsh attach-disk --domain ${domain} ${dir}/${disk}.qcow2 --target ${disk} --persistent --config --live On Wed, Aug 26, 2020 at 9:18 AM Kevin Wolf wrote: > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > > I have been measuring the performance of qcow2 overlays, and I am hoping > to > > get some help in understanding the data I collected. In my experiments, I > > created a VM and attached a 16G qcow2 disk to it using "qemu-img create" > > and "virsh attach-disk". I use fio to fill it. I create some number of > > snapshots (overlays) using "virsh snapshot-create-as". To mimic user > > activity between taking snapshots, I use fio to randomly write to 10% of > > each overlay right after I create it. After creating the overlays, I use > > fio to measure random read performance and random write performance with > 2 > > different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by > > the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio > is > > using the attached disk as a block device to avoid as much file system > > overhead as possible. The VM, 16G disk, and snapshots (overlays) all > reside > > on local disk. Below are the measurements I collected for up to 5 > overlays. > > > > > > 4K blocks64K blocks > > > > olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops > > > > 0 4510 1127438028 109507 67854 1060521808 8153 > > > > 1 4692 11732924 731 66801 1043104297 1629 > > > > 2 4524 11312781 695 66801 1043104297 1629 > > > > 3 4573 11433034 758 65500 102395627 1494 > > > > 4 4556 11392971 742 67973 1062108099 1689 > > > > 5 4471 11172937 734 66615 104098472 1538 > > > > > > Read performance is not affected by overlays. However, write performance > > drops even with a single overlay. My understanding is that writing 4K > > blocks requires a read-modify-write because you must fetch a complete > > cluster from deeper in the overlay chain before writing to the active > > overlay. However, this does not explain the drop in performance when > > writing 64K blocks. The performance drop is not as significant, but if > the > > write block size matches the cluster size then it seems that there should > > not be any performance drop because the write can go directly to the > active > > overlay. > > Can you share the QEMU command line you used? > > As you say, it is expected that layer 0 is a bit faster, however not to > this degree. My guess would be that you use the default cache mode > (which includes cache.direct=off), so your results are skewed because > the first requests will only write to memory (the host page cache) and > only later requests will actually hit the disk. > > For benchmarking, you should always use cache.direct=on (or an alias > that contains it, such as cache=none). > > > Another issue I hit is that I cannot set or change the cluster size of > > overlays. Is this possible with "virsh snapshot-create-as"? > > That's a libvirt question. Peter, can you help? > > > I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0. > > > > > > Thank you for any insights or advice you have. > > Kevin > >
Re: qcow2 overlay performance
On Wed 26 Aug 2020 03:18:32 PM CEST, Kevin Wolf wrote: >> My understanding is that writing 4K blocks requires a >> read-modify-write because you must fetch a complete cluster from >> deeper in the overlay chain before writing to the active >> overlay. However, this does not explain the drop in performance when >> writing 64K blocks. Your understanding is correct. Apart from what Kevin wrote, I assume that the write requests are always aligned to the cluster size, is that right? Berto
Re: qcow2 overlay performance
On Wed, Aug 26, 2020 at 15:30:03 +0200, Peter Krempa wrote: > On Wed, Aug 26, 2020 at 15:18:32 +0200, Kevin Wolf wrote: > > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > > > Another issue I hit is that I cannot set or change the cluster size of > > > overlays. Is this possible with "virsh snapshot-create-as"? > > > > That's a libvirt question. Peter, can you help? [...] > If you figure out that cluster size matters a lot and qemu will not want > to change the default, please file a feature request for libvirt. After sending the previous mail I thought about it for a bit and IMO a majority of cases would be fixed by using the cluster size of the original image in the newly created image. I've created patches which do this with libvirt. You can get them here: https://www.redhat.com/archives/libvir-list/2020-August/msg00995.html
Re: qcow2 overlay performance
On Wed, Aug 26, 2020 at 15:18:32 +0200, Kevin Wolf wrote: > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > > Another issue I hit is that I cannot set or change the cluster size of > > overlays. Is this possible with "virsh snapshot-create-as"? > > That's a libvirt question. Peter, can you help? Currently the libvirt snapshot API doesn't support configuration of the cluster size of the new image, (it should be straightforward to implement it now that we create images via blockdev-create, so the only open question is how to expose it in the XML). There's a reasonalby straightforward workaround (at least for testing), depending on how you want to approach it. You can create your own overlay file using qemu-img create (make sure to properly specify -F and -b) using your desired cluster size. You then use 'virsh snapshot-create --reuse-external' which expects that the images described by the snapshot XML are already existing files and uses those (without touching the metadata at all) rather than creating new files. If you figure out that cluster size matters a lot and qemu will not want to change the default, please file a feature request for libvirt. Either way, the best performing option should be the default though.
Re: qcow2 overlay performance
Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben: > I have been measuring the performance of qcow2 overlays, and I am hoping to > get some help in understanding the data I collected. In my experiments, I > created a VM and attached a 16G qcow2 disk to it using "qemu-img create" > and "virsh attach-disk". I use fio to fill it. I create some number of > snapshots (overlays) using "virsh snapshot-create-as". To mimic user > activity between taking snapshots, I use fio to randomly write to 10% of > each overlay right after I create it. After creating the overlays, I use > fio to measure random read performance and random write performance with 2 > different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by > the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio is > using the attached disk as a block device to avoid as much file system > overhead as possible. The VM, 16G disk, and snapshots (overlays) all reside > on local disk. Below are the measurements I collected for up to 5 overlays. > > > 4K blocks64K blocks > > olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops > > 0 4510 1127438028 109507 67854 1060521808 8153 > > 1 4692 11732924 731 66801 1043104297 1629 > > 2 4524 11312781 695 66801 1043104297 1629 > > 3 4573 11433034 758 65500 102395627 1494 > > 4 4556 11392971 742 67973 1062108099 1689 > > 5 4471 11172937 734 66615 104098472 1538 > > > Read performance is not affected by overlays. However, write performance > drops even with a single overlay. My understanding is that writing 4K > blocks requires a read-modify-write because you must fetch a complete > cluster from deeper in the overlay chain before writing to the active > overlay. However, this does not explain the drop in performance when > writing 64K blocks. The performance drop is not as significant, but if the > write block size matches the cluster size then it seems that there should > not be any performance drop because the write can go directly to the active > overlay. Can you share the QEMU command line you used? As you say, it is expected that layer 0 is a bit faster, however not to this degree. My guess would be that you use the default cache mode (which includes cache.direct=off), so your results are skewed because the first requests will only write to memory (the host page cache) and only later requests will actually hit the disk. For benchmarking, you should always use cache.direct=on (or an alias that contains it, such as cache=none). > Another issue I hit is that I cannot set or change the cluster size of > overlays. Is this possible with "virsh snapshot-create-as"? That's a libvirt question. Peter, can you help? > I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0. > > > Thank you for any insights or advice you have. Kevin
qcow2 overlay performance
I have been measuring the performance of qcow2 overlays, and I am hoping to get some help in understanding the data I collected. In my experiments, I created a VM and attached a 16G qcow2 disk to it using "qemu-img create" and "virsh attach-disk". I use fio to fill it. I create some number of snapshots (overlays) using "virsh snapshot-create-as". To mimic user activity between taking snapshots, I use fio to randomly write to 10% of each overlay right after I create it. After creating the overlays, I use fio to measure random read performance and random write performance with 2 different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio is using the attached disk as a block device to avoid as much file system overhead as possible. The VM, 16G disk, and snapshots (overlays) all reside on local disk. Below are the measurements I collected for up to 5 overlays. 4K blocks64K blocks olays rd bw rd iops wr bw wr iops rd bw rd iops wr bw wr iops 0 4510 1127438028 109507 67854 1060521808 8153 1 4692 11732924 731 66801 1043104297 1629 2 4524 11312781 695 66801 1043104297 1629 3 4573 11433034 758 65500 102395627 1494 4 4556 11392971 742 67973 1062108099 1689 5 4471 11172937 734 66615 104098472 1538 Read performance is not affected by overlays. However, write performance drops even with a single overlay. My understanding is that writing 4K blocks requires a read-modify-write because you must fetch a complete cluster from deeper in the overlay chain before writing to the active overlay. However, this does not explain the drop in performance when writing 64K blocks. The performance drop is not as significant, but if the write block size matches the cluster size then it seems that there should not be any performance drop because the write can go directly to the active overlay. Another issue I hit is that I cannot set or change the cluster size of overlays. Is this possible with "virsh snapshot-create-as"? I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0. Thank you for any insights or advice you have.