On 10/11/2015 07:14, Fam Zheng wrote: > On Mon, 11/09 17:29, Kevin Wolf wrote: >> Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: >>> >>> >>> On 09/11/2015 17:04, Kevin Wolf wrote: >>>> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: >>>>> The "pnum < nb_sectors" condition in deciding whether to actually copy >>>>> data is unnecessarily strict, and the qiov initialization is >>>>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard >>>>> branches. >>>>> >>>>> Reorganize mirror_iteration flow so that we: >>>>> >>>>> 1) Find the contiguous zero/discarded sectors with >>>>> bdrv_get_block_status_above() before deciding what to do. We query >>>>> s->buf_size sized blocks at a time. >>>>> >>>>> 2) If the sectors in question are zeroed/discarded and aligned to >>>>> target cluster, issue zero write or discard accordingly. It's done >>>>> in mirror_do_zero_or_discard, where we don't add buffer to qiov. >>>>> >>>>> 3) Otherwise, do the same loop as before in mirror_do_read. >>>>> >>>>> Signed-off-by: Fam Zheng <f...@redhat.com> >>>> >>>> I'm not sure where in the patch to comment on this, so I'll just do it >>>> here right in the beginning. >>>> >>>> I'm concerned that we need to be more careful about races in this patch, >>>> in particular regarding the bitmaps. I think the conditions for the two >>>> bitmaps are: >>>> >>>> * Dirty bitmap: We must clear the bit after finding the next piece of >>>> data to be mirrored, but before we yield after getting information >>>> that we use for the decision which kind of operation we need. >>>> >>>> In other words, we need to clear the dirty bitmap bit before calling >>>> bdrv_get_block_status_above(), because that's both the function that >>>> retrieves information about the next chunk and also a function that >>>> can yield. >>>> >>>> If after this point the data is written to, we need to mirror it >>>> again. >>> >>> With Fam's patch, that's not trivial for two reasons: >>> >>> 1) bdrv_get_block_status_above() can return a smaller amount than what >>> is asked. >>> >>> 2) the "read and write" case can handle s->granularity sectors per >>> iteration (many of them can be coalesced, but still that's how the >>> iteration works). >>> >>> The simplest solution is to perform the query with s->granularity size >>> rather than s->buf_size. >> >> Then we end up with many small operations, that's not what we want. >> >> Why can't we mark up to s->buf_size dirty clusters as clean first, then >> query the status, and mark all of those that we can't handle dirty >> again? > > Then we may end up marking more clusters as dirty than it should be.
You're both right. > Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are > coroutine, > we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and > bdrv_set_dirty_bitmap. I think this is not necessary. I think the following is safe: 1) before calling bdrv_get_block_status_above(), find out how many consecutive bits in the dirty bitmap are 1 2) zero all those bits in the dirty bitmap 3) call bdrv_get_block_status_above() with a size equivalent to the number of dirty bits 4) if bdrv_get_block_status_above() only returns a partial result, loop step (3) until all the dirty bits are processed For full mirroring, this strategy will probably make the first incremental iteration more expensive. Paolo