Re: qcow2 overlay performance

2020-10-22 Thread Yoonho Park
I am still seeing the performance degradation, but I did find something
interesting (and promising) with qemu 5.1.50. Enabling the subcluster
allocation support in qemu 5.1.50 (extended_l2=on) eliminates the
performance degradation of adding an overlay. Without subcluster allocation
enabled, I still see the performance degradation in qemu 5.1.50 when adding
an overlay. For these experiments, I used 64K blocks and 2M qcow2 cluster
size.

On Mon, Oct 19, 2020 at 12:51 PM Alberto Garcia  wrote:

> On Thu 27 Aug 2020 06:29:15 PM CEST, Yoonho Park wrote:
> > Below is the data with the cache disabled ("virsh attach-disk ... --cache
> > none"). I added the previous data for reference. Overall, random read
> > performance was not affected significantly. This makes sense because a
> > cache is probably not going to help random read performance much. BTW how
> > big the cache is by default? Random write performance for 4K blocks seems
> > more "sane" now. Random write performance for 64K blocks is interesting
> > because base image (0 overlay) performance is 2X slower than 1-5
> overlays.
> > We believe this is because the random writes to an overlay actually turn
> > into sequential writes (appends to the overlay). Does this make sense?
> >
> >
> > NO CACHE
> >
> >   4K blocks64K blocks
> >
> > olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops
> >
> > 0 4478  11194684   117157001 890 42050  657
> >
> > 1 4490  11222503   625 56656 885 93483  1460
>
> I haven't been able to reproduce this (I tried the scenarios with 0 and
> 1 overlays), did you figure out anything new or what's the situation?
>
> Berto
>


Overlay limit bug

2020-10-12 Thread Yoonho Park
I stumbled on a bug in qemu 4.2.0 (virsh 6.0.0) with a large number of
overlays. I am using "qemu-img create" and "virsh snapshot-create-as" to
create each overlay. When I run "virsh snapshot-create-as" for the 42nd
overlay, I get "error: No complete monitor response found in 10485760
bytes: Numerical result out of range". However, I pulled down qemu 5.1.50
(still using virsh 6.0.0), and it looks like the problem has disappeared
which is great. Does anyone know the patch set that addressed this bug?
Also, does anyone know the "official" limit on the number of overlays that
can be created and is there a qemu test that exercises this? I could not
find an overlay limit test in tests/qemu-iotests.


Attached disk blockpull

2020-09-01 Thread Yoonho Park
I am trying to perform a blockpull on an attached disk, but I cannot do
this using "virsh attach-disk" specifying a new, empty overlay file created
with "qemu-img create". The VM does not pick up the backing_file path from
the qcow2 overlay file, and there seems to be no way to specify the backing
path through the attach-disk command itself. Instead the backing path has
to be specified in the xml file used to create the VM, which precludes
attaching a new disk to a running VM. Is this a bug? Is it possible to
attach a disk to a running VM, and specify its backing path, using qemu
directly? This is with qemu 4.2.0 and libvirt 6.0.0.


Re: qcow2 overlay performance

2020-08-27 Thread Yoonho Park
Below is the data with the cache disabled ("virsh attach-disk ... --cache
none"). I added the previous data for reference. Overall, random read
performance was not affected significantly. This makes sense because a
cache is probably not going to help random read performance much. BTW how
big the cache is by default? Random write performance for 4K blocks seems
more "sane" now. Random write performance for 64K blocks is interesting
because base image (0 overlay) performance is 2X slower than 1-5 overlays.
We believe this is because the random writes to an overlay actually turn
into sequential writes (appends to the overlay). Does this make sense?


NO CACHE

  4K blocks64K blocks

olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops

0 4478  11194684   117157001 890 42050  657

1 4490  11222503   625 56656 885 93483  1460

2 4385  10962425   606 56055 875 94445  1475

3 4334  10832307   576 55422 865 95826  1497

4 4356  10892168   542 56070 876 95957  1499

5 4234  10582308   577 54039 844 92936  1452


DEFAULT CACHE (WRITEBACK)

  4K blocks64K blocks

olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops

0 4510  1127438028 109507  67854 1060521808 8153

1 4692  11732924   731 66801 1043104297 1629

2 4524  11312781   695 66801 1043104297 1629

3 4573  11433034   758 65500 102395627  1494

4 4556  11392971   742 67973 1062108099 1689

5 4471  11172937   734 66615 104098472  1538

On Wed, Aug 26, 2020 at 9:18 AM Kevin Wolf  wrote:

> Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben:
> > I have been measuring the performance of qcow2 overlays, and I am hoping
> to
> > get some help in understanding the data I collected. In my experiments, I
> > created a VM and attached a 16G qcow2 disk to it using "qemu-img create"
> > and "virsh attach-disk". I use fio to fill it. I create some number of
> > snapshots (overlays) using "virsh snapshot-create-as". To mimic user
> > activity between taking snapshots, I use fio to randomly write to 10% of
> > each overlay right after I create it. After creating the overlays, I use
> > fio to measure random read performance and random write performance with
> 2
> > different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by
> > the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio
> is
> > using the attached disk as a block device to avoid as much file system
> > overhead as possible. The VM, 16G disk, and snapshots (overlays) all
> reside
> > on local disk. Below are the measurements I collected for up to 5
> overlays.
> >
> >
> >   4K blocks64K blocks
> >
> > olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops
> >
> > 0 4510  1127438028 109507  67854 1060521808 8153
> >
> > 1 4692  11732924   731 66801 1043104297 1629
> >
> > 2 4524  11312781   695 66801 1043104297 1629
> >
> > 3 4573  11433034   758 65500 102395627  1494
> >
> > 4 4556  11392971   742 67973 1062108099 1689
> >
> > 5 4471  11172937   734 66615 104098472  1538
> >
> >
> > Read performance is not affected by overlays. However, write performance
> > drops even with a single overlay. My understanding is that writing 4K
> > blocks requires a read-modify-write because you must fetch a complete
> > cluster from deeper in the overlay chain before writing to the active
> > overlay. However, this does not explain the drop in performance when
> > writing 64K blocks. The performance drop is not as significant, but if
> the
> > write block size matches the cluster size then it seems that there should
> > not be any performance drop because the write can go directly to the
> active
> > overlay.
>
> Can you share the QEMU command line you used?
>
> As you say, it is expected that layer 0 is a bit faster, however not to
> this degree. My guess would be that you use the default cache mode
> (which includes cache.direct=off), so your results are skewed because
> the first requests will only write to memory (the host page cache) and
> only later requests will actually hit the disk.
>
> For benchmarking, you should always use cache.direct=on (or an alias
> that contains it, such as cache=none).
>
> > Another issue I hit is that I cannot set or change the cluster size of
> > overlays. Is this possible with "virsh snapshot-create-as"?
>
> That's a libvirt question. Peter, can you help?
>
> > I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0.
> >
> >
> > Thank you for any insights or advice you have.
>
> Kevin
>
>


Re: qcow2 overlay performance

2020-08-26 Thread Yoonho Park
I used strace to collect the writes, and as far as I can tell they are
aligned to the cluster size (64K). Below are some examples...

1520414 pwritev(35,
[{iov_base="\5\277z\314\24\305\177\r\340\327\f:e\222\10\33\374\232Q;FuN\t\0\0\325\275\0\0\0\0"...,
iov_len=4096},
{iov_base="v\24\324\337\347\364i\26\216\202\277\361\262N\23\22Q\360\360\234\366\360J\1\0\20\325\275\0\0\0\0"...,
iov_len=61440}], 2, 326696960 
1520424 pwrite64(35,
"\200\263\303V)\2779Zpv\30Q\203\2142\24\316\16\353.\4\251T\35\331av\376\252}J\33"...,
65536, 381353984 
1520424 pwrite64(35,
"\0\0\333\204\0\0\0\0\nT\0a\17JR\v\201\n\33\302\301\230(\16P\341\2\263\n2\253\22"...,
65536, 381419520 
1520419 pwrite64(35,
"\220\255\361.\v$C1\2625\26\217\10\315\22\17\266\306\t\367\25\22\274\16\3268\206\333\275\206F\22"...,
65536, 381485056 
1520424 pwrite64(35,
"Z\224\265`\325t0\2\0\0005\271\1\0\0\0Q\266\367\"\323:\247\f\0\0\205\207\1\0\0\0"...,
65536, 381550592 
1520419 pwritev(35,
[{iov_base="M=Fsw\234pd\0\09\32\1\0\0\0\365\310#V\\\303\356\25\36\371I\246Y\255\202\6"...,
iov_len=61440},
{iov_base="\314\242k\370\345;w\22\0\3609\32\1\0\0\0\213n\210#\225\206)\31\321\215+m\16\237\347\36"...,
iov_len=4096}], 2, 381616128 

On Wed, Aug 26, 2020 at 2:00 PM Alberto Garcia  wrote:

> On Wed 26 Aug 2020 03:18:32 PM CEST, Kevin Wolf wrote:
> >> My understanding is that writing 4K blocks requires a
> >> read-modify-write because you must fetch a complete cluster from
> >> deeper in the overlay chain before writing to the active
> >> overlay. However, this does not explain the drop in performance when
> >> writing 64K blocks.
>
> Your understanding is correct. Apart from what Kevin wrote, I assume
> that the write requests are always aligned to the cluster size, is that
> right?
>
> Berto
>


Re: qcow2 overlay performance

2020-08-26 Thread Yoonho Park
Great. I will give your patches a try. Also, your workaround suggestion and
a discussion with a colleague gave me another idea for an experiment. Is it
possible that some of the overhead I am seeing is from the operations
necessary to increase the size of the overlay? Would it make sense to use
the workaround to create a "full-sized" overlay and then measure fio
performance?

On Wed, Aug 26, 2020 at 10:57 AM Peter Krempa  wrote:

> On Wed, Aug 26, 2020 at 15:30:03 +0200, Peter Krempa wrote:
> > On Wed, Aug 26, 2020 at 15:18:32 +0200, Kevin Wolf wrote:
> > > Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben:
> > > > Another issue I hit is that I cannot set or change the cluster size
> of
> > > > overlays. Is this possible with "virsh snapshot-create-as"?
> > >
> > > That's a libvirt question. Peter, can you help?
>
> [...]
>
> > If you figure out that cluster size matters a lot and qemu will not want
> > to change the default, please file a feature request for libvirt.
>
> After sending the previous mail I thought about it for a bit and IMO a
> majority of cases would be fixed by using the cluster size of the
> original image in the newly created image. I've created patches which do
> this with libvirt. You can get them here:
>
> https://www.redhat.com/archives/libvir-list/2020-August/msg00995.html
>
>


Re: qcow2 overlay performance

2020-08-26 Thread Yoonho Park
I create the attached disk with the following commands. The "qemu-img info"
is to double check the cluster size. I am running the same experiments now
with "--cache none" attached to the "virsh attach-disk". Is this sufficient
to avoid the host page cache?

qemu-img create -f qcow2 ${dir}/${disk}.qcow2 ${size} -o
preallocation=full,cluster_size=${cs}
qemu-img info ${dir}/${disk}.qcow2
virsh attach-disk --domain ${domain} ${dir}/${disk}.qcow2 --target ${disk}
--persistent --config --live

On Wed, Aug 26, 2020 at 9:18 AM Kevin Wolf  wrote:

> Am 26.08.2020 um 02:46 hat Yoonho Park geschrieben:
> > I have been measuring the performance of qcow2 overlays, and I am hoping
> to
> > get some help in understanding the data I collected. In my experiments, I
> > created a VM and attached a 16G qcow2 disk to it using "qemu-img create"
> > and "virsh attach-disk". I use fio to fill it. I create some number of
> > snapshots (overlays) using "virsh snapshot-create-as". To mimic user
> > activity between taking snapshots, I use fio to randomly write to 10% of
> > each overlay right after I create it. After creating the overlays, I use
> > fio to measure random read performance and random write performance with
> 2
> > different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by
> > the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio
> is
> > using the attached disk as a block device to avoid as much file system
> > overhead as possible. The VM, 16G disk, and snapshots (overlays) all
> reside
> > on local disk. Below are the measurements I collected for up to 5
> overlays.
> >
> >
> >   4K blocks64K blocks
> >
> > olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops
> >
> > 0 4510  1127438028 109507  67854 1060521808 8153
> >
> > 1 4692  11732924   731 66801 1043104297 1629
> >
> > 2 4524  11312781   695 66801 1043104297 1629
> >
> > 3 4573  11433034   758 65500 102395627  1494
> >
> > 4 4556  11392971   742 67973 1062108099 1689
> >
> > 5 4471  11172937   734 66615 104098472  1538
> >
> >
> > Read performance is not affected by overlays. However, write performance
> > drops even with a single overlay. My understanding is that writing 4K
> > blocks requires a read-modify-write because you must fetch a complete
> > cluster from deeper in the overlay chain before writing to the active
> > overlay. However, this does not explain the drop in performance when
> > writing 64K blocks. The performance drop is not as significant, but if
> the
> > write block size matches the cluster size then it seems that there should
> > not be any performance drop because the write can go directly to the
> active
> > overlay.
>
> Can you share the QEMU command line you used?
>
> As you say, it is expected that layer 0 is a bit faster, however not to
> this degree. My guess would be that you use the default cache mode
> (which includes cache.direct=off), so your results are skewed because
> the first requests will only write to memory (the host page cache) and
> only later requests will actually hit the disk.
>
> For benchmarking, you should always use cache.direct=on (or an alias
> that contains it, such as cache=none).
>
> > Another issue I hit is that I cannot set or change the cluster size of
> > overlays. Is this possible with "virsh snapshot-create-as"?
>
> That's a libvirt question. Peter, can you help?
>
> > I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0.
> >
> >
> > Thank you for any insights or advice you have.
>
> Kevin
>
>


qcow2 overlay performance

2020-08-25 Thread Yoonho Park
I have been measuring the performance of qcow2 overlays, and I am hoping to
get some help in understanding the data I collected. In my experiments, I
created a VM and attached a 16G qcow2 disk to it using "qemu-img create"
and "virsh attach-disk". I use fio to fill it. I create some number of
snapshots (overlays) using "virsh snapshot-create-as". To mimic user
activity between taking snapshots, I use fio to randomly write to 10% of
each overlay right after I create it. After creating the overlays, I use
fio to measure random read performance and random write performance with 2
different blocks sizes, 4K and 64K. 64K is the qcow2 cluster size used by
the 16G qcow2 disk and the overlays (verified with "qemu-img info"). fio is
using the attached disk as a block device to avoid as much file system
overhead as possible. The VM, 16G disk, and snapshots (overlays) all reside
on local disk. Below are the measurements I collected for up to 5 overlays.


  4K blocks64K blocks

olays rd bw rd iops wr bw  wr iops rd bw rd iops wr bw  wr iops

0 4510  1127438028 109507  67854 1060521808 8153

1 4692  11732924   731 66801 1043104297 1629

2 4524  11312781   695 66801 1043104297 1629

3 4573  11433034   758 65500 102395627  1494

4 4556  11392971   742 67973 1062108099 1689

5 4471  11172937   734 66615 104098472  1538


Read performance is not affected by overlays. However, write performance
drops even with a single overlay. My understanding is that writing 4K
blocks requires a read-modify-write because you must fetch a complete
cluster from deeper in the overlay chain before writing to the active
overlay. However, this does not explain the drop in performance when
writing 64K blocks. The performance drop is not as significant, but if the
write block size matches the cluster size then it seems that there should
not be any performance drop because the write can go directly to the active
overlay.


Another issue I hit is that I cannot set or change the cluster size of
overlays. Is this possible with "virsh snapshot-create-as"?


I am using qemu-system-x86_64 version 4.2.0 and virsh version 6.0.0.


Thank you for any insights or advice you have.