Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

Eric Blake Thu, 06 Jul 2017 09:26:25 -0700

On 07/06/2017 11:02 AM, Kevin Wolf wrote:
> Am 05.07.2017 um 23:08 hat Eric Blake geschrieben:
>> We are gradually moving away from sector-based interfaces, towards
>> byte-based.  In the common case, allocation is unlikely to ever use
>> values that are not naturally sector-aligned, but it is possible
>> that byte-based values will let us be more precise about allocation
>> at the end of an unaligned file that can do byte-based access.
>>
>> Changing the signature of the function to use int64_t *pnum ensures
>> that the compiler enforces that all callers are updated.  For now,
>> the io.c layer still assert()s that all callers are sector-aligned
>> on input and that *pnum is sector-aligned on return to the caller,
>> but that can be relaxed when a later patch implements byte-based
>> block status.  Therefore, this code adds usages like
>> DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
>> values, where the call might reasonbly give non-aligned results
>> in the future; on the other hand, no rounding is needed for callers
>> that should just continue to work with byte alignment.
>>
>> For the most part this patch is just the addition of scaling at the
>> callers followed by inverse scaling at bdrv_is_allocated().  But
>> some code, particularly bdrv_commit(), gets a lot simpler because it
>> no longer has to mess with sectors; also, it is now possible to pass
>> NULL if the caller does not care how much of the image is allocated
>> beyond the initial offset.
>>
>> For ease of review, bdrv_is_allocated_above() will be tackled
>> separately.
>>


>> @@ -1036,14 +1036,15 @@ static int coroutine_fn 
>> bdrv_aligned_preadv(BdrvChild *child,

>>
>> -        if (!ret || pnum != nb_sectors) {
>> +        if (!ret || pnum != nb_sectors << BDRV_SECTOR_BITS) {
>>              ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
>>              goto out;
>>          }
> 
> This code would become a lot simpler if you just removed the local
> variables start_sector/end_sector and used the byte offsets instead that
> are available in this function. (And even simpler once sector alignment
> isn't required any more for bdrv_is_allocated(). Should we leave a
> comment here so we won't forget the conversion?)

Hmm - that cleanup is not (yet) done in my bdrv_get_block_status series
(where I finally drop the need for sector alignment in
bdrv_is_allocated).  Yes, I can add a temporary comment here (that aids
with the understanding of intermediate states), as well as a new
followup patch that actually does the cleanup.

>> -int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num,
>> -                                   int nb_sectors, int *pnum)
>> +int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
>> +                                   int64_t bytes, int64_t *pnum)
>>  {
>>      BlockDriverState *file;
>> -    int64_t ret = bdrv_get_block_status(bs, sector_num, nb_sectors, pnum,
>> -                                        &file);
>> +    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
>> +    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
>> +    int64_t ret;
>> +    int psectors;
>> +
>> +    assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
>> +    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) &&
>> +           bytes < INT_MAX * BDRV_SECTOR_SIZE);
> 
> I think we can actually assert bytes <= INT_MAX. Both ways to write the
> assertion are only true because of BDRV_REQUEST_MAX_SECTORS, which
> ensures that the byte counts fit in an int.

That's a tighter assertion, but if it passes, I'm fine using it.

> 
> Some of the places that call bdrv_is_allocated() actually depend on this
> because your patches tend to use nb_sectors << BDRV_SECTOR_BITS (which
> truncates) rather than nb_sectors * BDRV_SECTOR_BYTES (which is always a
> 64 bit calculation).

Actually, for this series of cleanups, I tried really hard to use *
_BYTES everywhere that I widen sectors to bytes, and limit myself to >>
_BITS for going from bytes back to sectors (after also checking if I
needed DIV_ROUND_UP instead of straight division).  If you spot places
where I'm still doing a 32-bit << _BITS, please call it out (as I do
remember that I caused bugs that way back when I converted pdiscard to
byte-based, back near commit d2cb36af).


>> +++ b/block/stream.c
>> @@ -137,6 +137,7 @@ static void coroutine_fn stream_run(void *opaque)
>>
>>      for ( ; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
>>          bool copy;
>> +        int64_t count = 0;
>>
>>          /* Note that even when no rate limit is applied we need to yield
>>           * with no pending I/O here so that bdrv_drain_all() returns.
>> @@ -148,8 +149,8 @@ static void coroutine_fn stream_run(void *opaque)
>>
>>          copy = false;
>>
>> -        ret = bdrv_is_allocated(bs, offset / BDRV_SECTOR_SIZE,
>> -                                STREAM_BUFFER_SIZE / BDRV_SECTOR_SIZE, &n);
>> +        ret = bdrv_is_allocated(bs, offset, STREAM_BUFFER_SIZE, &count);
>> +        n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>>          if (ret == 1) {
>>              /* Allocated in the top, no need to copy.  */
>>          } else if (ret >= 0) {
> 
> I'm not sure if I agree with rounding up here. If ret == 0, it could
> mean that we skip some data that is actually allocated. So I'd like an
> assertion better.
> 
> The last patch of the series gets rid of this by switching the whole
> function to bytes, so not that important.

A temporary assertion is fine (and you're right that the assertion goes
away once the function uses bytes everywhere).

Sounds like we're racking up enough cleanups that I'll spin a v5, but
also sounds like we're real close to getting this in.

>> @@ -274,9 +275,9 @@ static int mig_save_device_bulk(QEMUFile *f, 
>> BlkMigDevState *bmds)
>>          /* Skip unallocated sectors; intentionally treats failure as
>>           * an allocated sector */
>>          while (cur_sector < total_sectors &&
>> -               !bdrv_is_allocated(blk_bs(bb), cur_sector,
>> -                                  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> -            cur_sector += nr_sectors;
>> +               !bdrv_is_allocated(blk_bs(bb), cur_sector * BDRV_SECTOR_SIZE,
>> +                                  MAX_IS_ALLOCATED_SEARCH, &count)) {
>> +            cur_sector += DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>>          }
>>          aio_context_release(blk_get_aio_context(bb));
>>          qemu_mutex_unlock_iothread();
> 
> This one stays and it has the same problem. I think you need to round
> down if you don't want to accidentally skip migrating allocated data.

I haven't (yet) tackled trying to rewrite this one to do byte-base
iteration, but that may help.  I'm also worried (in general) that if
something truncates down a sub-sector result in code that is still
sector-based, then we could get stuck in an inf-loop of querying the
same sector offset over and over again instead of making progress.

But for this particular code, I see what you are saying - we are trying
to optimize the migration by skipping unallocated sectors, but if there
is nothing to skip, we still make migration progress as long as I
remember to break from the loop if I ever get a sub-sector answer (thus
migrating the whole sector, including the [unlikely] initial unallocated
portion).  Okay, I'll fix it for v5.

> 
>> diff --git a/qemu-img.c b/qemu-img.c
>> index d5a1560..5271b41 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -3229,6 +3229,7 @@ static int img_rebase(int argc, char **argv)
>>          int64_t new_backing_num_sectors = 0;
>>          uint64_t sector;
>>          int n;
>> +        int64_t count;
>>          float local_progress = 0;
>>
>>          buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> @@ -3276,12 +3277,14 @@ static int img_rebase(int argc, char **argv)
>>              }
>>
>>              /* If the cluster is allocated, we don't need to take action */
>> -            ret = bdrv_is_allocated(bs, sector, n, &n);
>> +            ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
>> +                                    n << BDRV_SECTOR_BITS, &count);
>>              if (ret < 0) {
>>                  error_report("error while reading image metadata: %s",
>>                               strerror(-ret));
>>                  goto out;
>>              }
>> +            n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>>              if (ret) {
>>                  continue;
>>              }
> 
> Same thing. Sectors that are half allocated and half free must be
> considered completely allocated if the caller can't do byte alignment.
> Just rounding up without looking at ret isn't correct.
> 
>> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
>> index b0ea327..e529b8f 100644
>> --- a/qemu-io-cmds.c
>> +++ b/qemu-io-cmds.c
>> @@ -1894,15 +1889,14 @@ static int map_f(BlockBackend *blk, int argc, char 
>> **argv)
>>          }
>>
>>          retstr = ret ? "    allocated" : "not allocated";
>> -        cvtstr(num << BDRV_SECTOR_BITS, s1, sizeof(s1));
>> -        cvtstr(offset << BDRV_SECTOR_BITS, s2, sizeof(s2));
>> +        cvtstr(num, s1, sizeof(s1));
>> +        cvtstr(offset, s2, sizeof(s2));
>>          printf("%s (0x%" PRIx64 ") bytes %s at offset %s (0x%" PRIx64 ")\n",
>> -               s1, num << BDRV_SECTOR_BITS, retstr,
>> -               s2, offset << BDRV_SECTOR_BITS);
>> +               s1, num, retstr, s2, offset);
>>
>>          offset += num;
>> -        nb_sectors -= num;
>> -    } while (offset < total_sectors);
>> +        bytes -= num;
>> +    };
> 
> This semicolon isn't needed.

Indeed. I converted do/while to plain while, but not quite correctly ;)


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

Reply via email to