Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On Wed 21 Feb 2018 05:59:58 PM CET, Eric Blake wrote: > But as Berto has convinced me that an externally produced image can > convince us to read up to 4M (even though we don't need that much to > decompress), A (harmless but funny) consequence of the way this works is that for any valid compressed cluster you should be able to increase the value of the size field as much as you want without causing any user-visible effect. So if you're working with 2MB clusters but for a particular compressed cluster the size field is 0x0006 (7 sectors) you can still increase it to the maximum (0x1fff, or 8192 sectors) and it should work just the same. QEMU will read 4MB instead of ~4KB but since decompression stops once the original cluster has been restored there's no harm. I think I'll write a test case for this, it can be useful to verify that QEMU can handle this kind of scenarios. Berto
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On 02/21/2018 11:39 AM, Kevin Wolf wrote: See my commit message comment - we have other spots in the code base that blindly g_malloc(2 * s->cluster_size). Though is that a reason to do the same in new code or to phase out such allocations whenever you touch them? Touché. And I intended (but sent the email without amending my commit) to use g_malloc(). But as Berto has convinced me that an externally produced image can convince us to read up to 4M (even though we don't need that much to decompress), I suppose that the try_ variant plus checking is reasonable (and care in NULL'ing out if one but not both allocations succeed). Sounds good. Another thought I had is whether we should do per-request allocation for compressed clusters, too, instead of having per-BDS buffers. The only benefit of a per-BDS buffer is that we cache things - multiple sub-cluster reads in a row all from the same compressed cluster benefit from decompressing only once. The drawbacks of a per-BDS buffer: we can't do things in parallel (everything else in qcow2 drops the lock around bdrv_co_pread[v]), so the initial read prevents anything else in the qcow2 layer from progressing. I also wonder - since we ARE allowing multiple parallel readers in other parts of qcow2 (without a patch, decompression is not in this boat, but decryption and even bounce buffers due to lower-layer alignment constraints are), what sort of mechanisms do we have for using a pool of reusable buffers, rather than having each cluster access that requires a buffer malloc and free the buffer on a per-access basis? I don't know how much time the malloc/free per-transaction overhead adds, or if it is already much smaller than the actual I/O time. But note that while reusable buffers from a pool would cut down on the per-I/O malloc/free overhead if we switch decompression away from per-BDS buffer, it would still not solve the fact that we only get the caching ability where multiple sub-cluster requests from the same compressed cluster require only one decompression, since that's only possible on a per-BDS caching level. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
Am 21.02.2018 um 17:59 hat Eric Blake geschrieben: > On 02/21/2018 10:51 AM, Kevin Wolf wrote: > > Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: > > > When reading a compressed image, we were allocating s->cluster_data > > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image > > > with 2M clusters). Let's check out the history: > > > > > > > Much later, in commit de82815d (v2.2), we noticed that a 64M > > > allocation is prone to failure, so we switched over to a graceful > > > memory allocation error message. But note that elsewhere in the > > > code, we do g_malloc(2 * cluster_size) without ever checking for > > > failure. > > > > > > > -} > > > -if (!s->cluster_cache) { > > > -s->cluster_cache = g_malloc(s->cluster_size); > > > +assert(!s->cluster_cache); > > > > Wouldn't it be better to assert (!!s->cluster_cache == > > !!s->cluster_data) unconditionally? > > > > Sure. > > > > +s->cluster_data = g_try_malloc(s->cluster_size); > > > > Why are you going from qemu_try_blockalign() to simple malloc here? This > > buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we > > should avoid unnecessary use of a bounce buffer. > > But does bdrv_pread() actually need to use a bounce buffer if we don't have > an aligned buffer to read into? Either the underlying protocol already > supports byte-aligned reads (direct into our buffer, regardless of > alignment, no bouncing required), or it already has do to a full sector read > into a bounce buffer anyways (and it doesn't matter whether we aligned our > buffer). blockalign() made sense when we had multiple clients for the > buffer, but ever since v1.1, when we have only a single client, and that > single client is most likely not going to read sector-aligned data in the > first place, aligning the buffer doesn't buy us anything. Good point. To be honest, I don't even analyse each caller, but just consistently use qemu_try_blockalign() whenever a buffer is used for I/O. It's a simple rule of thumb that generally makes sense. So as you say, in this case it's unlikely, but possible that we can benefit from an aligned buffer. I guess my point is more about consistency than actual functionality then. But it's okay either way. > > > > > +s->cluster_cache = g_try_malloc(s->cluster_size); > > > > As you already said, either g_malloc() or check the result. I actually > > think that g_try_malloc() and checking the result is nicer, we still > > allocate up to 2 MB here. > > See my commit message comment - we have other spots in the code base that > blindly g_malloc(2 * s->cluster_size). Though is that a reason to do the same in new code or to phase out such allocations whenever you touch them? > And I intended (but sent the email without amending my commit) to use > g_malloc(). But as Berto has convinced me that an externally produced > image can convince us to read up to 4M (even though we don't need that > much to decompress), I suppose that the try_ variant plus checking is > reasonable (and care in NULL'ing out if one but not both allocations > succeed). Sounds good. Another thought I had is whether we should do per-request allocation for compressed clusters, too, instead of having per-BDS buffers. Kevin
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On 02/21/2018 10:51 AM, Kevin Wolf wrote: Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: When reading a compressed image, we were allocating s->cluster_data to 32*cluster_size + 512 (possibly over 64 megabytes, for an image with 2M clusters). Let's check out the history: Much later, in commit de82815d (v2.2), we noticed that a 64M allocation is prone to failure, so we switched over to a graceful memory allocation error message. But note that elsewhere in the code, we do g_malloc(2 * cluster_size) without ever checking for failure. -} -if (!s->cluster_cache) { -s->cluster_cache = g_malloc(s->cluster_size); +assert(!s->cluster_cache); Wouldn't it be better to assert (!!s->cluster_cache == !!s->cluster_data) unconditionally? Sure. +s->cluster_data = g_try_malloc(s->cluster_size); Why are you going from qemu_try_blockalign() to simple malloc here? This buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we should avoid unnecessary use of a bounce buffer. But does bdrv_pread() actually need to use a bounce buffer if we don't have an aligned buffer to read into? Either the underlying protocol already supports byte-aligned reads (direct into our buffer, regardless of alignment, no bouncing required), or it already has do to a full sector read into a bounce buffer anyways (and it doesn't matter whether we aligned our buffer). blockalign() made sense when we had multiple clients for the buffer, but ever since v1.1, when we have only a single client, and that single client is most likely not going to read sector-aligned data in the first place, aligning the buffer doesn't buy us anything. +s->cluster_cache = g_try_malloc(s->cluster_size); As you already said, either g_malloc() or check the result. I actually think that g_try_malloc() and checking the result is nicer, we still allocate up to 2 MB here. See my commit message comment - we have other spots in the code base that blindly g_malloc(2 * s->cluster_size). And I intended (but sent the email without amending my commit) to use g_malloc(). But as Berto has convinced me that an externally produced image can convince us to read up to 4M (even though we don't need that much to decompress), I suppose that the try_ variant plus checking is reasonable (and care in NULL'ing out if one but not both allocations succeed). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
Am 20.02.2018 um 23:24 hat Eric Blake geschrieben: > When reading a compressed image, we were allocating s->cluster_data > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image > with 2M clusters). Let's check out the history: > > Back when qcow2 was first written, we used s->cluster_data for > everything, including copy_sectors() and encryption, where we want > to operate on more than one cluster at once. Obviously, at that > point, the buffer had to be aligned for other users, even though > compression itself doesn't require any alignment. > > But commit 1b9f1491 (v1.1!) changed things to allocate parallel > buffers on demand rather than sharing a single buffer, for encryption > and COW, leaving compression as the final client of s->cluster_data. > That use was still preserved, because if a single compressed cluster > is read more than once, we reuse the cache instead of decompressing > it a second time (I'm not sure how often this optimization actually > fires, or if it penalizes us from being able to decompress multiple > clusters in parallel even though we can now decrypt clusters in > parallel; the XXX comment in qcow2_co_preadv for > QCOW2_CLUSTER_COMPRESSED is telling). > > Much later, in commit de82815d (v2.2), we noticed that a 64M > allocation is prone to failure, so we switched over to a graceful > memory allocation error message. But note that elsewhere in the > code, we do g_malloc(2 * cluster_size) without ever checking for > failure. > > Then even later, in 3e4c7052 (2.11), we realized that allocating > a large buffer up front for every qcow2 image is expensive, and > switched to lazy allocation only for images that actually had > compressed clusters. But in the process, we never even bothered > to check whether what we were allocating still made sense in its > new context! > > So, it's time to cut back on the waste. A compressed cluster > will NEVER occupy more than an uncompressed cluster (okay, gzip > DOES document that because the compression stream adds metadata, > and because of the pigeonhole principle, there are worst case > scenarios where attempts to compress will actually inflate an > image - but in those cases, we would just write the cluster > uncompressed instead of inflating it). And as that is a smaller > amount of memory, we can get by with the simpler g_malloc. > > Signed-off-by: Eric BlakeMy comments feel a bit trivial compared to Berto's, but anyway: > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c > index 85be7d5e340..8c4b26ceaf2 100644 > --- a/block/qcow2-cluster.c > +++ b/block/qcow2-cluster.c > @@ -1603,15 +1603,9 @@ int qcow2_decompress_cluster(BlockDriverState *bs, > uint64_t cluster_offset) > * are freed in .bdrv_close(). > */ > if (!s->cluster_data) { > -/* one more sector for decompressed data alignment */ > -s->cluster_data = qemu_try_blockalign(bs->file->bs, > -QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512); > -if (!s->cluster_data) { > -return -ENOMEM; > -} > -} > -if (!s->cluster_cache) { > -s->cluster_cache = g_malloc(s->cluster_size); > +assert(!s->cluster_cache); Wouldn't it be better to assert (!!s->cluster_cache == !!s->cluster_data) unconditionally? > +s->cluster_data = g_try_malloc(s->cluster_size); Why are you going from qemu_try_blockalign() to simple malloc here? This buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we should avoid unnecessary use of a bounce buffer. > +s->cluster_cache = g_try_malloc(s->cluster_size); As you already said, either g_malloc() or check the result. I actually think that g_try_malloc() and checking the result is nicer, we still allocate up to 2 MB here. Kevin
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On Wed 21 Feb 2018 04:00:54 PM CET, Eric Blake wrote: >> - Solution b: the width of the 'compressed cluster size' field is >>(cluster_bits - 8), that's (cluster_size / 256) sectors. > > Not true. It is (cluster_bits - 9) or (cluster_size / 512). It's not, it's (cluster_bits - 8), the documentation is wrong, see the patch that I sent earlier. Here's qcow2_do_open(): s->csize_mask = (1 << (s->cluster_bits - 8)) - 1; For a 64k cluster (16 bits) the width is 16 - 8 = 8 bits. That's 2^8 = 256 sectors, and 256 * 512 = 128k, exactly two clusters. Coming back to your patch, since we know where the compressed data starts and we know that it's not going to be larger than cluster_size, we can read [coffset, coffset + cluster_size) and skip the final bytes of the last sector because we know that we don't need that data. Or we can read as much as the size field indicates (up to twice the cluster size) and let the decompression code handle the rest. Berto
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On 02/21/2018 04:04 AM, Alberto Garcia wrote: On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: I was also preparing a patch to change this, but you arrived first :-) So, it's time to cut back on the waste. A compressed cluster will NEVER occupy more than an uncompressed cluster (okay, gzip DOES document that because the compression stream adds metadata, and because of the pigeonhole principle, there are worst case scenarios where attempts to compress will actually inflate an image - but in those cases, we would just write the cluster uncompressed instead of inflating it). And as that is a smaller amount of memory, we can get by with the simpler g_malloc. -if (!s->cluster_cache) { -s->cluster_cache = g_malloc(s->cluster_size); +assert(!s->cluster_cache); +s->cluster_data = g_try_malloc(s->cluster_size); +s->cluster_cache = g_try_malloc(s->cluster_size); Shoot - I made edits that I forgot to commit before sending; I meant for these to be g_malloc() rather than g_try_malloc(). There's a few things here: - QEMU won't write compressed data if the size is >= s->cluster_size (there's an explicit check for that in qcow2_co_pwritev_compressed()) Correct, we never cause inflation (and even if we wanted to, we can't, because the qcow2 format doesn't have enough bits for us to record that many sectors for a compressed stream that occupies more space than the original cluster). - The size field of the compressed cluster descriptor *does* allow larger sizes, so you can't simply read csize bytes into s->cluster_data becuase you could cause a buffer overflow. Let's step through this: nb_csectors = ((cluster_offset >> s->csize_shift) & s->csize_mask + 1; Since s->csize_mask is determined by s->cluster_size / 512, then after this assignment, the most nb_csectors can be is exactly s->cluster_size / 512. sector_offset = coffset & 511; csize = nb_csectors * 512 - sector_offset; And here, csize can only get smaller than the length picked by nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size. - Solution a: check that csize < s->cluster_size and return an error if it's not. However! although QEMU won't produce an image with a compressed cluster that is larger than the uncompressed one, the qcow2 on-disk format in principle allows for that, so arguably we should accept it. No, the qcow2 on-disk format does not have enough bits reserved for that. It is impossible to store an inflated cluster, because you don't have enough bits to record it. That said, we MAY have a bug, more likely to be visible in 512-byte clusters but possible on any size. While the 'number sectors' field IS sufficient for any compressed cluster starting at offset 0 in the cluster, we run into issues if the starting offset is later in the cluster. That is, even though the compressed length of a 512-byte cluster is <= 512 (or we wouldn't compress it), if we start at offset 510, we NEED to read the next cluster to get the rest of the compressed stream - but at 512-byte clusters, there are 0 bits reserved for 'number sectors'. Per the above calculations with the example offset of 510, nb_csectors would be 1 (it can't be anything else for a 512-byte cluster image), and csize would then be 2 bytes, which is insufficient for reading back enough data to reconstruct the cluster. We probably need to clarify in the spec that if the starting offset of a compressed cluster falls mid-sector, then the compressed size has to be smaller than cluster size - (offset % 512) (this requirement is already there implicitly due to the field widths, but being explicit can't hurt). We probably also need to fix our compression code to actually do the right thing, particularly for 512-byte clusters where we are most likely to run into a compressed size that is likely to overflow the space available for nb_csectors. - Solution b: the width of the 'compressed cluster size' field is (cluster_bits - 8), that's (cluster_size / 256) sectors. Not true. It is (cluster_bits - 9) or (cluster_size / 512). Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61. The 'number sectors' field is then bits x+1 - 61 (but you can't have a bitfield occupying bit 62 upto 61; especially since bit 62 is the bit for compressed cluster). Since the size of each sector is 512 bytes, the maximum possible size that the field can store is (cluster_size * 2) bytes. So allocate that amount of space for s->cluster_data, read the data as it is on disk and let the decompression code return an error if the data is indeed corrupted or it doesn't fit in the output buffer. Again, I argue that the qcow2 spec says that the maximum size for a compressed cluster is cluster_size, even if it spills over a host cluster boundary. But if in practice, we HAVE allowed a spillover beyond the 'number fields'
Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: I was also preparing a patch to change this, but you arrived first :-) > So, it's time to cut back on the waste. A compressed cluster > will NEVER occupy more than an uncompressed cluster (okay, gzip > DOES document that because the compression stream adds metadata, > and because of the pigeonhole principle, there are worst case > scenarios where attempts to compress will actually inflate an > image - but in those cases, we would just write the cluster > uncompressed instead of inflating it). And as that is a smaller > amount of memory, we can get by with the simpler g_malloc. > -if (!s->cluster_cache) { > -s->cluster_cache = g_malloc(s->cluster_size); > +assert(!s->cluster_cache); > +s->cluster_data = g_try_malloc(s->cluster_size); > +s->cluster_cache = g_try_malloc(s->cluster_size); There's a few things here: - QEMU won't write compressed data if the size is >= s->cluster_size (there's an explicit check for that in qcow2_co_pwritev_compressed()) - The size field of the compressed cluster descriptor *does* allow larger sizes, so you can't simply read csize bytes into s->cluster_data becuase you could cause a buffer overflow. - Solution a: check that csize < s->cluster_size and return an error if it's not. However! although QEMU won't produce an image with a compressed cluster that is larger than the uncompressed one, the qcow2 on-disk format in principle allows for that, so arguably we should accept it. - Solution b: the width of the 'compressed cluster size' field is (cluster_bits - 8), that's (cluster_size / 256) sectors. Since the size of each sector is 512 bytes, the maximum possible size that the field can store is (cluster_size * 2) bytes. So allocate that amount of space for s->cluster_data, read the data as it is on disk and let the decompression code return an error if the data is indeed corrupted or it doesn't fit in the output buffer. Berto