On Mon, Jun 22, 2020 at 12:47 PM Max Reitz <mre...@redhat.com> wrote: > > On 22.06.20 00:25, Nir Soffer wrote: > > On Fri, Jun 19, 2020 at 1:40 PM Max Reitz <mre...@redhat.com> wrote: > >> > >> Hi, > >> > >> As discussed here: > >> > >> https://lists.nongnu.org/archive/html/qemu-block/2020-02/msg00644.html > >> https://lists.nongnu.org/archive/html/qemu-block/2020-04/msg00329.html > >> https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00240.html > >> > >> I think that qcow2 images with data-file-raw should always have > >> preallocated 1:1 L1/L2 tables, so that the image always looks the same > >> whether you respect or ignore the qcow2 metadata. > > > > I don't know the internals of qcow2 data_file, but are we really using > > qcow2 metadata when accessing the data file? > > Yes. > > > This may have unwanted performance consequences. > > I don’t think so, because in practice normal lookups of L1/L2 mappings > generally don’t cost that much performance. > > > If I understand correctly, qcow2 metadata is needed only for keeping > > bitmaps (or maybe > > future extensions) for raw data file, and reading from the qcow2 image > > should be read > > directly from the raw file without any extra work. > > > > Writing to the data file should also bypass the qcow2 metadata, since the > > bitmap > > is updated in memory. > > Well, with this series, writing would no longer update the metadata at > least, because it would always be preallocated already. > > >> The easiest way to > >> achieve that is to enforce at least metadata preallocation whenever > >> data-file-raw is given. > > > > But preallocation is not free, even on file systems, it can be even > > slow (NFS < 4.2). > > Metadata preallocation with an external data file should be the same > speed on every file system. We only need to create the metadata > structures, which, with the default cluster size (64k) take up a bit > more than 1/8192 of the full image size. > > Sure, it’s not free. But if we decide we should indeed fully ignore the > L1/L2 tables for data-file-raw images, the qcow2 spec must be amended. > As I can read it, it currently doesn’t say so. > > (By the way, this is not a trivial change. Right now, data-file-raw is > an autoclear flag: If a version of qemu that doesn’t support it accesses > the image, it will automatically clear the flag, but the image stays > valid. If we decide to completely ignore the L1/L2 tables (i.e. not > even create them), then this can no longer be an autoclear flag. We’d > need a new incompatible flag. (Because without L1/L2 tables, the image > becomes useless to older qemu versions.)) > > > With block storage this means you need to allocate the entire image size on > > storage for writing the metadata. > > > > While oVirt does not use qcow2 with data_file, having preallocated qcow2 > > will make this very hard to use, for example for 500 GiB disk we will have > > to > > allocate 500 GiB disk for the raw data file and 500 GiB disk for the qcow2 > > metadata disk which will be 99% unused. > > I don’t understand this. When you use an external data file, the qcow2 > file will only contain the metadata: > > $ qemu-img create -f qcow2 \ > -o data_file=foo.data,data_file_raw=on,preallocation=metadata \ > foo.qcow2 8G > Formatting 'foo.qcow2', fmt=qcow2 size=8589934592 data_file=foo.data > data_file_raw=on cluster_size=65536 preallocation=metadata > lazy_refcounts=off refcount_bits=16 > $ ls -l foo.qcow2 > ... 1310720 ... foo.qcow2 > $ ls -l foo.data > ... 8589934592 ... foo.data
When allocating metadata in regular qcow2, need the to allocate the entire device (+ extra space for metadata overhead): # qemu-img create -f qcow2 -o preallocation=metadata foo.qcow2 500g Formatting 'foo.qcow2', fmt=qcow2 size=536870912000 cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16 # qemu-img check foo.qcow2 No errors were found on the image. 8192000/8192000 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters Image end offset: 536953094144 But I see that with metadata file we allocate much less: # qemu-img create -f qcow2 -o data_file=foo.data,data_file_raw=on,preallocation=metadata foo.qcow2 500g Formatting 'foo.qcow2', fmt=qcow2 size=536870912000 data_file=foo.data data_file_raw=on cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16 # qemu-img check foo.qcow2 No errors were found on the image. 8192000/8192000 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters Image end offset: 65798144 I tested this also with block device: # lvcreate --size 500g --name foo.data test Logical volume "foo.data" created. lvcreate --size 128m --name foo.qcow2 test Logical volume "foo.qcow2" created. # time qemu-img create -f qcow2 -o data_file=/dev/test/foo.data,data_file_raw=on,preallocation=metadata /dev/test/foo.qcow2 500g Formatting '/dev/test/foo.qcow2', fmt=qcow2 size=536870912000 data_file=/dev/test/foo.data data_file_raw=on cluster_size=65536 preallocation=metadata lazy_refcounts=off refcount_bits=16 real 0m4.263s user 0m0.149s sys 0m0.387s # qemu-img info /dev/test/foo.qcow2 image: /dev/test/foo.qcow2 file format: qcow2 virtual size: 500 GiB (536870912000 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 data file: /dev/test/foo.data data file raw: true corrupt: false # qemu-img check /dev/test/foo.qcow2 No errors were found on the image. 8192000/8192000 = 100.00% allocated, 0.00% fragmented, 0.00% compressed clusters Image end offset: 65798144 The overhead 63 MiB per 500 GiB seems reasonable and preallocating the metadata is not that bad. > > I don't think that kubevirt is planning to use this either, but if > > they decide to use > > this it may be a problem for them as well when using block storage. > > > > It looks like we abuse preallocation for getting the side effect that > > the backing file > > will be rejected, instead of adding the validation rejecting backing > > file in this case. > > That isn’t the case. > > I want to use preallocation because I interpret the spec such that it > requires metadata preallocation. It says when accessing a qcow2 file > with data-file-raw, you can ignore the L1/L2 tables. To me, that means > that the L1/L2 tables must give a 1:1 mapping so that you get the same > result whether you interpret them or not. I agree that this is reasonable, and we will be able to use this if we need. Not having to allocate metadata at all and never using the 1:1 mapping would be even better. Nir