On 12/23/2015 05:47 PM, Stefan Hajnoczi wrote: > On Wed, Dec 02, 2015 at 01:37:25PM +0800, Wen Congyang wrote: >> + /* >> + * Only write to active disk if the sectors have >> + * already been allocated in active disk/hidden disk. >> + */ >> + qemu_iovec_init(&hd_qiov, qiov->niov); >> + while (remaining_sectors > 0) { >> + ret = bdrv_is_allocated_above(top, base, sector_num, >> + remaining_sectors, &n); > > There is a race condition here since multiple I/O requests can be in > flight at the same time. If two requests touch the same cluster > between top->base then the result of these checks could be unreliable.
I don't think so. When we come here, primary qemu is gone, and failover is done. We only write to active disk if the sectors have already been allocated in active disk/hidden disk before failover. So it two requests touch the same cluster, it is OK, because the function bdrv_is_allocated_above()'s return value is not changed. > > The simple but slow solution is to use a CoMutex to serialize requests. > >> + if (ret < 0) { >> + return ret; >> + } >> + >> + qemu_iovec_reset(&hd_qiov); >> + qemu_iovec_concat(&hd_qiov, qiov, bytes_done, n * 512); >> + >> + target = ret ? top : base; >> + ret = bdrv_co_writev(target, sector_num, n, &hd_qiov); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + remaining_sectors -= n; >> + sector_num += n; >> + bytes_done += n * BDRV_SECTOR_SIZE; >> + } > > I think this can be replaced with an active commit block job that copies > data down from the hidden/active disk to the secondary disk. It is okay > to keep writing to the secondary disk while the block job is running and > then switch over to the secondary disk once it completes. Yes, active commit is another choice. IIRC, I don't use it because mirror job has some problem. It is fixed now(see bdrv_drained_begin()/bdrv_drained_end() in the mirror job). We will use mirror job in the next version. > >> + >> + return 0; >> +} >> + >> +static coroutine_fn int replication_co_discard(BlockDriverState *bs, >> + int64_t sector_num, >> + int nb_sectors) >> +{ >> + BDRVReplicationState *s = bs->opaque; >> + int ret; >> + >> + ret = replication_get_io_status(s); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + if (ret == 1) { >> + /* It is secondary qemu and we are after failover */ >> + ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors); > > What if the clusters are still allocated in the hidden/active disk? > What does discard do? Drop the data that allocated in the disk? If so, I think I make a misunderstand. I will fix it in the next version. Thanks Wen Congyang