Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-22 Thread Alberto Garcia
On Wed 21 Feb 2018 05:59:58 PM CET, Eric Blake wrote:
> But as Berto has convinced me that an externally produced image can
> convince us to read up to 4M (even though we don't need that much to
> decompress),

A (harmless but funny) consequence of the way this works is that for any
valid compressed cluster you should be able to increase the value of the
size field as much as you want without causing any user-visible effect.

So if you're working with 2MB clusters but for a particular compressed
cluster the size field is 0x0006 (7 sectors) you can still increase it
to the maximum (0x1fff, or 8192 sectors) and it should work just the
same. QEMU will read 4MB instead of ~4KB but since decompression stops
once the original cluster has been restored there's no harm.

I think I'll write a test case for this, it can be useful to verify that
QEMU can handle this kind of scenarios.

Berto



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Eric Blake

On 02/21/2018 11:39 AM, Kevin Wolf wrote:

See my commit message comment - we have other spots in the code base that
blindly g_malloc(2 * s->cluster_size).


Though is that a reason to do the same in new code or to phase out such
allocations whenever you touch them?


Touché.




And I intended (but sent the email without amending my commit) to use
g_malloc().  But as Berto has convinced me that an externally produced
image can convince us to read up to 4M (even though we don't need that
much to decompress), I suppose that the try_ variant plus checking is
reasonable (and care in NULL'ing out if one but not both allocations
succeed).


Sounds good.

Another thought I had is whether we should do per-request allocation for
compressed clusters, too, instead of having per-BDS buffers.


The only benefit of a per-BDS buffer is that we cache things - multiple 
sub-cluster reads in a row all from the same compressed cluster benefit 
from decompressing only once.  The drawbacks of a per-BDS buffer: we 
can't do things in parallel (everything else in qcow2 drops the lock 
around bdrv_co_pread[v]), so the initial read prevents anything else in 
the qcow2 layer from progressing.


I also wonder - since we ARE allowing multiple parallel readers in other 
parts of qcow2 (without a patch, decompression is not in this boat, but 
decryption and even bounce buffers due to lower-layer alignment 
constraints are), what sort of mechanisms do we have for using a pool of 
reusable buffers, rather than having each cluster access that requires a 
buffer malloc and free the buffer on a per-access basis?  I don't know 
how much time the malloc/free per-transaction overhead adds, or if it is 
already much smaller than the actual I/O time.


But note that while reusable buffers from a pool would cut down on the 
per-I/O malloc/free overhead if we switch decompression away from 
per-BDS buffer, it would still not solve the fact that we only get the 
caching ability where multiple sub-cluster requests from the same 
compressed cluster require only one decompression, since that's only 
possible on a per-BDS caching level.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Kevin Wolf
Am 21.02.2018 um 17:59 hat Eric Blake geschrieben:
> On 02/21/2018 10:51 AM, Kevin Wolf wrote:
> > Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:
> > > When reading a compressed image, we were allocating s->cluster_data
> > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> > > with 2M clusters).  Let's check out the history:
> > > 
> 
> > > Much later, in commit de82815d (v2.2), we noticed that a 64M
> > > allocation is prone to failure, so we switched over to a graceful
> > > memory allocation error message.  But note that elsewhere in the
> > > code, we do g_malloc(2 * cluster_size) without ever checking for
> > > failure.
> > > 
> 
> > > -}
> > > -if (!s->cluster_cache) {
> > > -s->cluster_cache = g_malloc(s->cluster_size);
> > > +assert(!s->cluster_cache);
> > 
> > Wouldn't it be better to assert (!!s->cluster_cache ==
> > !!s->cluster_data) unconditionally?
> > 
> 
> Sure.
> 
> > > +s->cluster_data = g_try_malloc(s->cluster_size);
> > 
> > Why are you going from qemu_try_blockalign() to simple malloc here? This
> > buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
> > should avoid unnecessary use of a bounce buffer.
> 
> But does bdrv_pread() actually need to use a bounce buffer if we don't have
> an aligned buffer to read into?  Either the underlying protocol already
> supports byte-aligned reads (direct into our buffer, regardless of
> alignment, no bouncing required), or it already has do to a full sector read
> into a bounce buffer anyways (and it doesn't matter whether we aligned our
> buffer).  blockalign() made sense when we had multiple clients for the
> buffer, but ever since v1.1, when we have only a single client, and that
> single client is most likely not going to read sector-aligned data in the
> first place, aligning the buffer doesn't buy us anything.

Good point.

To be honest, I don't even analyse each caller, but just consistently use
qemu_try_blockalign() whenever a buffer is used for I/O. It's a simple
rule of thumb that generally makes sense.

So as you say, in this case it's unlikely, but possible that we can
benefit from an aligned buffer. I guess my point is more about
consistency than actual functionality then. But it's okay either way.

> > 
> > > +s->cluster_cache = g_try_malloc(s->cluster_size);
> > 
> > As you already said, either g_malloc() or check the result. I actually
> > think that g_try_malloc() and checking the result is nicer, we still
> > allocate up to 2 MB here.
> 
> See my commit message comment - we have other spots in the code base that
> blindly g_malloc(2 * s->cluster_size).

Though is that a reason to do the same in new code or to phase out such
allocations whenever you touch them?

> And I intended (but sent the email without amending my commit) to use
> g_malloc().  But as Berto has convinced me that an externally produced
> image can convince us to read up to 4M (even though we don't need that
> much to decompress), I suppose that the try_ variant plus checking is
> reasonable (and care in NULL'ing out if one but not both allocations
> succeed).

Sounds good.

Another thought I had is whether we should do per-request allocation for
compressed clusters, too, instead of having per-BDS buffers.

Kevin



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Eric Blake

On 02/21/2018 10:51 AM, Kevin Wolf wrote:

Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:

When reading a compressed image, we were allocating s->cluster_data
to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
with 2M clusters).  Let's check out the history:




Much later, in commit de82815d (v2.2), we noticed that a 64M
allocation is prone to failure, so we switched over to a graceful
memory allocation error message.  But note that elsewhere in the
code, we do g_malloc(2 * cluster_size) without ever checking for
failure.




-}
-if (!s->cluster_cache) {
-s->cluster_cache = g_malloc(s->cluster_size);
+assert(!s->cluster_cache);


Wouldn't it be better to assert (!!s->cluster_cache ==
!!s->cluster_data) unconditionally?



Sure.


+s->cluster_data = g_try_malloc(s->cluster_size);


Why are you going from qemu_try_blockalign() to simple malloc here? This
buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
should avoid unnecessary use of a bounce buffer.


But does bdrv_pread() actually need to use a bounce buffer if we don't 
have an aligned buffer to read into?  Either the underlying protocol 
already supports byte-aligned reads (direct into our buffer, regardless 
of alignment, no bouncing required), or it already has do to a full 
sector read into a bounce buffer anyways (and it doesn't matter whether 
we aligned our buffer).  blockalign() made sense when we had multiple 
clients for the buffer, but ever since v1.1, when we have only a single 
client, and that single client is most likely not going to read 
sector-aligned data in the first place, aligning the buffer doesn't buy 
us anything.





+s->cluster_cache = g_try_malloc(s->cluster_size);


As you already said, either g_malloc() or check the result. I actually
think that g_try_malloc() and checking the result is nicer, we still
allocate up to 2 MB here.


See my commit message comment - we have other spots in the code base 
that blindly g_malloc(2 * s->cluster_size).  And I intended (but sent 
the email without amending my commit) to use g_malloc().  But as Berto 
has convinced me that an externally produced image can convince us to 
read up to 4M (even though we don't need that much to decompress), I 
suppose that the try_ variant plus checking is reasonable (and care in 
NULL'ing out if one but not both allocations succeed).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Kevin Wolf
Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:
> When reading a compressed image, we were allocating s->cluster_data
> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> with 2M clusters).  Let's check out the history:
> 
> Back when qcow2 was first written, we used s->cluster_data for
> everything, including copy_sectors() and encryption, where we want
> to operate on more than one cluster at once.  Obviously, at that
> point, the buffer had to be aligned for other users, even though
> compression itself doesn't require any alignment.
> 
> But commit 1b9f1491 (v1.1!) changed things to allocate parallel
> buffers on demand rather than sharing a single buffer, for encryption
> and COW, leaving compression as the final client of s->cluster_data.
> That use was still preserved, because if a single compressed cluster
> is read more than once, we reuse the cache instead of decompressing
> it a second time (I'm not sure how often this optimization actually
> fires, or if it penalizes us from being able to decompress multiple
> clusters in parallel even though we can now decrypt clusters in
> parallel; the XXX comment in qcow2_co_preadv for
> QCOW2_CLUSTER_COMPRESSED is telling).
> 
> Much later, in commit de82815d (v2.2), we noticed that a 64M
> allocation is prone to failure, so we switched over to a graceful
> memory allocation error message.  But note that elsewhere in the
> code, we do g_malloc(2 * cluster_size) without ever checking for
> failure.
> 
> Then even later, in 3e4c7052 (2.11), we realized that allocating
> a large buffer up front for every qcow2 image is expensive, and
> switched to lazy allocation only for images that actually had
> compressed clusters.  But in the process, we never even bothered
> to check whether what we were allocating still made sense in its
> new context!
> 
> So, it's time to cut back on the waste.  A compressed cluster
> will NEVER occupy more than an uncompressed cluster (okay, gzip
> DOES document that because the compression stream adds metadata,
> and because of the pigeonhole principle, there are worst case
> scenarios where attempts to compress will actually inflate an
> image - but in those cases, we would just write the cluster
> uncompressed instead of inflating it).  And as that is a smaller
> amount of memory, we can get by with the simpler g_malloc.
> 
> Signed-off-by: Eric Blake 

My comments feel a bit trivial compared to Berto's, but anyway:

> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 85be7d5e340..8c4b26ceaf2 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1603,15 +1603,9 @@ int qcow2_decompress_cluster(BlockDriverState *bs, 
> uint64_t cluster_offset)
>   * are freed in .bdrv_close().
>   */
>  if (!s->cluster_data) {
> -/* one more sector for decompressed data alignment */
> -s->cluster_data = qemu_try_blockalign(bs->file->bs,
> -QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> -if (!s->cluster_data) {
> -return -ENOMEM;
> -}
> -}
> -if (!s->cluster_cache) {
> -s->cluster_cache = g_malloc(s->cluster_size);
> +assert(!s->cluster_cache);

Wouldn't it be better to assert (!!s->cluster_cache ==
!!s->cluster_data) unconditionally?

> +s->cluster_data = g_try_malloc(s->cluster_size);

Why are you going from qemu_try_blockalign() to simple malloc here? This
buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
should avoid unnecessary use of a bounce buffer.

> +s->cluster_cache = g_try_malloc(s->cluster_size);

As you already said, either g_malloc() or check the result. I actually
think that g_try_malloc() and checking the result is nicer, we still
allocate up to 2 MB here.

Kevin



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Alberto Garcia
On Wed 21 Feb 2018 04:00:54 PM CET, Eric Blake wrote:

>> - Solution b: the width of the 'compressed cluster size' field is
>>(cluster_bits - 8), that's (cluster_size / 256) sectors.
>
> Not true.  It is (cluster_bits - 9) or (cluster_size / 512).

It's not, it's (cluster_bits - 8), the documentation is wrong, see the
patch that I sent earlier.

Here's qcow2_do_open():

s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;

For a 64k cluster (16 bits) the width is 16 - 8 = 8 bits.

That's 2^8 = 256 sectors, and 256 * 512 = 128k, exactly two clusters.

Coming back to your patch, since we know where the compressed data
starts and we know that it's not going to be larger than cluster_size,
we can read [coffset, coffset + cluster_size) and skip the final bytes
of the last sector because we know that we don't need that data.

Or we can read as much as the size field indicates (up to twice the
cluster size) and let the decompression code handle the rest.

Berto



Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Eric Blake

On 02/21/2018 04:04 AM, Alberto Garcia wrote:

On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:

I was also preparing a patch to change this, but you arrived first :-)


So, it's time to cut back on the waste.  A compressed cluster
will NEVER occupy more than an uncompressed cluster (okay, gzip
DOES document that because the compression stream adds metadata,
and because of the pigeonhole principle, there are worst case
scenarios where attempts to compress will actually inflate an
image - but in those cases, we would just write the cluster
uncompressed instead of inflating it).  And as that is a smaller
amount of memory, we can get by with the simpler g_malloc.



-if (!s->cluster_cache) {
-s->cluster_cache = g_malloc(s->cluster_size);
+assert(!s->cluster_cache);
+s->cluster_data = g_try_malloc(s->cluster_size);
+s->cluster_cache = g_try_malloc(s->cluster_size);


Shoot - I made edits that I forgot to commit before sending; I meant for 
these to be g_malloc() rather than g_try_malloc().




There's a few things here:

- QEMU won't write compressed data if the size is >= s->cluster_size
   (there's an explicit check for that in qcow2_co_pwritev_compressed())


Correct, we never cause inflation (and even if we wanted to, we can't, 
because the qcow2 format doesn't have enough bits for us to record that 
many sectors for a compressed stream that occupies more space than the 
original cluster).




- The size field of the compressed cluster descriptor *does* allow
   larger sizes, so you can't simply read csize bytes into
   s->cluster_data becuase you could cause a buffer overflow.


Let's step through this:

  nb_csectors = ((cluster_offset >> s->csize_shift) & s->csize_mask + 1;

Since s->csize_mask is determined by s->cluster_size / 512, then after 
this assignment, the most nb_csectors can be is exactly s->cluster_size 
/ 512.


  sector_offset = coffset & 511;
  csize = nb_csectors * 512 - sector_offset;

And here, csize can only get smaller than the length picked by 
nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.




- Solution a: check that csize < s->cluster_size and return an error if
   it's not. However! although QEMU won't produce an image with a
   compressed cluster that is larger than the uncompressed one, the qcow2
   on-disk format in principle allows for that, so arguably we should
   accept it.


No, the qcow2 on-disk format does not have enough bits reserved for 
that.  It is impossible to store an inflated cluster, because you don't 
have enough bits to record it.


That said, we MAY have a bug, more likely to be visible in 512-byte 
clusters but possible on any size.  While the 'number sectors' field IS 
sufficient for any compressed cluster starting at offset 0 in the 
cluster, we run into issues if the starting offset is later in the 
cluster.  That is, even though the compressed length of a 512-byte 
cluster is <= 512 (or we wouldn't compress it), if we start at offset 
510, we NEED to read the next cluster to get the rest of the compressed 
stream - but at 512-byte clusters, there are 0 bits reserved for 'number 
sectors'.  Per the above calculations with the example offset of 510, 
nb_csectors would be 1 (it can't be anything else for a 512-byte cluster 
image), and csize would then be 2 bytes, which is insufficient for 
reading back enough data to reconstruct the cluster.


We probably need to clarify in the spec that if the starting offset of a 
compressed cluster falls mid-sector, then the compressed size has to be 
smaller than cluster size - (offset % 512) (this requirement is already 
there implicitly due to the field widths, but being explicit can't 
hurt).  We probably also need to fix our compression code to actually do 
the right thing, particularly for 512-byte clusters where we are most 
likely to run into a compressed size that is likely to overflow the 
space available for nb_csectors.




- Solution b: the width of the 'compressed cluster size' field is
   (cluster_bits - 8), that's (cluster_size / 256) sectors.


Not true.  It is (cluster_bits - 9) or (cluster_size / 512).  Remember, 
x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61.  The 
'number sectors' field is then bits x+1 - 61 (but you can't have a 
bitfield occupying bit 62 upto 61; especially since bit 62 is the bit 
for compressed cluster).



Since the
   size of each sector is 512 bytes, the maximum possible size that the
   field can store is (cluster_size * 2) bytes. So allocate that amount
   of space for s->cluster_data, read the data as it is on disk and let
   the decompression code return an error if the data is indeed
   corrupted or it doesn't fit in the output buffer.


Again, I argue that the qcow2 spec says that the maximum size for a 
compressed cluster is cluster_size, even if it spills over a host 
cluster boundary.  But if in practice, we HAVE allowed a spillover 
beyond the 'number fields' 

Re: [Qemu-block] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

2018-02-21 Thread Alberto Garcia
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:

I was also preparing a patch to change this, but you arrived first :-)

> So, it's time to cut back on the waste.  A compressed cluster
> will NEVER occupy more than an uncompressed cluster (okay, gzip
> DOES document that because the compression stream adds metadata,
> and because of the pigeonhole principle, there are worst case
> scenarios where attempts to compress will actually inflate an
> image - but in those cases, we would just write the cluster
> uncompressed instead of inflating it).  And as that is a smaller
> amount of memory, we can get by with the simpler g_malloc.

> -if (!s->cluster_cache) {
> -s->cluster_cache = g_malloc(s->cluster_size);
> +assert(!s->cluster_cache);
> +s->cluster_data = g_try_malloc(s->cluster_size);
> +s->cluster_cache = g_try_malloc(s->cluster_size);

There's a few things here:

- QEMU won't write compressed data if the size is >= s->cluster_size
  (there's an explicit check for that in qcow2_co_pwritev_compressed())

- The size field of the compressed cluster descriptor *does* allow
  larger sizes, so you can't simply read csize bytes into
  s->cluster_data becuase you could cause a buffer overflow.

- Solution a: check that csize < s->cluster_size and return an error if
  it's not. However! although QEMU won't produce an image with a
  compressed cluster that is larger than the uncompressed one, the qcow2
  on-disk format in principle allows for that, so arguably we should
  accept it.

- Solution b: the width of the 'compressed cluster size' field is
  (cluster_bits - 8), that's (cluster_size / 256) sectors. Since the
  size of each sector is 512 bytes, the maximum possible size that the
  field can store is (cluster_size * 2) bytes. So allocate that amount
  of space for s->cluster_data, read the data as it is on disk and let
  the decompression code return an error if the data is indeed
  corrupted or it doesn't fit in the output buffer.

Berto