Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk

Denis V. Lunev Tue, 23 May 2017 02:13:55 -0700

On 05/22/2017 10:00 PM, Eric Blake wrote:
> On 05/19/2017 04:34 AM, Anton Nefedov wrote:
>> From: "Denis V. Lunev" <d...@openvz.org>
>>
>> Currently each single write operation can result in 3 write operations
>> if guest offsets are not cluster aligned. One write is performed for the
>> real payload and two for COW-ed areas. Thus the data possibly lays
>> non-contiguously on the host filesystem. This will reduce further
>> sequential read performance significantly.
>>
>> The patch allocates the space in the file with cluster granularity,
>> ensuring
>>   1. better host offset locality
>>   2. less space allocation operations
>>      (which can be expensive on distributed storages)
> s/storages/storage/
>
>> Signed-off-by: Denis V. Lunev <d...@openvz.org>
>> Signed-off-by: Anton Nefedov <anton.nefe...@virtuozzo.com>
>> ---
>>  block/qcow2.c | 32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index a8d61f0..2e6a0ec 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1575,6 +1575,32 @@ fail:
>>      return ret;
>>  }
>>  
>> +static void handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    BlockDriverState *file = bs->file->bs;
>> +    QCowL2Meta *m;
>> +    int ret;
>> +
>> +    for (m = l2meta; m != NULL; m = m->next) {
>> +        uint64_t bytes = m->nb_clusters << s->cluster_bits;
>> +
>> +        if (m->cow_start.nb_bytes == 0 && m->cow_end.nb_bytes == 0) {
>> +            continue;
>> +        }
>> +
>> +        /* try to alloc host space in one chunk for better locality */
>> +        ret = file->drv->bdrv_co_pwrite_zeroes(file, m->alloc_offset, 
>> bytes, 0);
> Are we guaranteed that this is a fast operation?  (That is, it either
> results in a hole or an error, and doesn't waste time tediously writing
> actual zeroes)
>
This is why we call driver directly. We expect that replacing with actualy
zeroes write happens only in generic function.


>> +
>> +        if (ret != 0) {
>> +            continue;
>> +        }
> Supposing we are using a file system that doesn't support holes, then
> ret will not be zero, and we ended up not allocating anything after all.
>  Is that a problem that we are just blindly continuing the loop as our
> reaction to the error?
>
> /reads further
>
> I guess not - you aren't reacting to any error call, but merely using
> the side effect that an allocation happened for speed when it worked,
> and ignoring failure (you get the old behavior of the write() now
> causing the allocation) when it didn't.
good point

>> +
>> +        file->total_sectors = MAX(file->total_sectors,
>> +                                  (m->alloc_offset + bytes) / 
>> BDRV_SECTOR_SIZE);
>> +    }
>> +}
>> +
>>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t 
>> offset,
>>                                           uint64_t bytes, QEMUIOVector *qiov,
>>                                           int flags)
>> @@ -1656,8 +1682,12 @@ static coroutine_fn int 
>> qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
>>          if (ret < 0) {
>>              goto fail;
>>          }
>> -
>>          qemu_co_mutex_unlock(&s->lock);
>> +
>> +        if (bs->file->bs->drv->bdrv_co_pwrite_zeroes != NULL) {
>> +            handle_alloc_space(bs, l2meta);
>> +        }
> Is it really a good idea to be modifying the underlying protocol image
> outside of the mutex?
yes! This is by design, not accidental. The area is protected by the
meta and
thus writes to not allocated areas will pass.

> At any rate, it looks like your patch is doing a best-effort write
> zeroes as an attempt to trigger consecutive allocation of the entire
> cluster in the underlying protocol right after a cluster has been
> allocated at the qcow2 format layer.  Which means there are more
> syscalls now than there were previously, but now when we do three
> write() calls at offsets B, A, C, those three calls are into file space
> that was allocated earlier by the write zeroes, rather than fresh calls
> into unallocated space that is likely to trigger up to three disjoint
> allocations.
>
> As a discussion point, wouldn't we achieve the same effect of less
> fragmentation if we instead collect our data into a bounce buffer, and
> only then do a single write() (or more likely, a writev() where the iov
> is set up to reconstruct a single buffer on the syscall, but where the
> source data is still at different offsets)?  We'd be avoiding the extra
> syscalls of pre-allocating the cluster, and while our write() call is
> still causing allocations, at least it is now one cluster-aligned
> write() rather than three sub-cluster out-of-order write()s.
>
If the cluster will become bigger, the difference will be much more
significant.
Actually we have done that already in the original patchset, but those
changes are a bit controversal in terms of performance. They works much
better on SSD and worse on HDD.

May be we have done something wrong, but here only simple non-questionable
things in terms of performance are included.

Den

Re: [Qemu-devel] [PATCH v1 01/13] qcow2: alloc space for COW in one chunk

Reply via email to