On 02/12/2015 06:26 PM, f...@redhat.com wrote: > On Thu, 02/12 18:11, Wen Congyang wrote: >> On 02/12/2015 05:44 PM, Fam Zheng wrote: >>> On Thu, 02/12 17:33, Wen Congyang wrote: >>>> On 02/12/2015 04:44 PM, Fam Zheng wrote: >>>>> On Thu, 02/12 15:40, Wen Congyang wrote: >>>>>> On 02/12/2015 03:21 PM, Fam Zheng wrote: >>>>>>> Hi Congyang, >>>>>>> >>>>>>> On Thu, 02/12 11:07, Wen Congyang wrote: >>>>>>>> +== Workflow == >>>>>>>> +The following is the image of block replication workflow: >>>>>>>> + >>>>>>>> + +----------------------+ +------------------------+ >>>>>>>> + |Primary Write Requests| |Secondary Write Requests| >>>>>>>> + +----------------------+ +------------------------+ >>>>>>>> + | | >>>>>>>> + | (4) >>>>>>>> + | V >>>>>>>> + | /-------------\ >>>>>>>> + | Copy and Forward | | >>>>>>>> + |---------(1)----------+ | Disk Buffer | >>>>>>>> + | | | | >>>>>>>> + | (3) \-------------/ >>>>>>>> + | speculative ^ >>>>>>>> + | write through (2) >>>>>>>> + | | | >>>>>>>> + V V | >>>>>>>> + +--------------+ +----------------+ >>>>>>>> + | Primary Disk | | Secondary Disk | >>>>>>>> + +--------------+ +----------------+ >>>>>>>> + >>>>>>>> + 1) Primary write requests will be copied and forwarded to >>>>>>>> Secondary >>>>>>>> + QEMU. >>>>>>>> + 2) Before Primary write requests are written to Secondary disk, >>>>>>>> the >>>>>>>> + original sector content will be read from Secondary disk and >>>>>>>> + buffered in the Disk buffer, but it will not overwrite the >>>>>>>> existing >>>>>>>> + sector content in the Disk buffer. >>>>>>> >>>>>>> I'm a little confused by the tenses ("will be" versus "are") and terms. >>>>>>> I am >>>>>>> reading them as "s/will be/are/g" >>>>>>> >>>>>>> Why do you need this buffer? >>>>>> >>>>>> We only sync the disk till next checkpoint. Before next checkpoint, >>>>>> secondary >>>>>> vm write to the buffer. >>>>>> >>>>>>> >>>>>>> If both primary and secondary write to the same sector, what is saved >>>>>>> in the >>>>>>> buffer? >>>>>> >>>>>> The primary content will be written to the secondary disk, and the >>>>>> secondary content >>>>>> is saved in the buffer. >>>>> >>>>> I wonder if alternatively this is possible with an imaginary "writable >>>>> backing >>>>> image" feature, as described below. >>>>> >>>>> When we have a normal backing chain, >>>>> >>>>> {virtio-blk dev 'foo'} >>>>> | >>>>> | >>>>> | >>>>> [base] <- [mid] <- (foo) >>>>> >>>>> Where [base] and [mid] are read only, (foo) is writable. When we add an >>>>> overlay >>>>> to an existing image on top, >>>>> >>>>> {virtio-blk dev 'foo'} {virtio-blk dev 'bar'} >>>>> | | >>>>> | | >>>>> | | >>>>> [base] <- [mid] <- (foo) <---------------------- (bar) >>>>> >>>>> It's important to make sure that writes to 'foo' doesn't break data for >>>>> 'bar'. >>>>> We can utilize an automatic hidden drive-backup target: >>>>> >>>>> {virtio-blk dev 'foo'} >>>>> {virtio-blk dev 'bar'} >>>>> | >>>>> | >>>>> | >>>>> | >>>>> v >>>>> v >>>>> >>>>> [base] <- [mid] <- (foo) <----------------- (hidden target) >>>>> <--------------- (bar) >>>>> >>>>> v ^ >>>>> v ^ >>>>> v ^ >>>>> v ^ >>>>> >>>> drive-backup sync=none >>>> >>>>> >>>>> So when guest writes to 'foo', the old data is moved to (hidden target), >>>>> which >>>>> remains unchanged from (bar)'s PoV. >>>>> >>>>> The drive in the middle is called hidden because QEMU creates it >>>>> automatically, >>>>> the naming is arbitrary. >>>>> >>>>> It is interesting because it is a more generalized case of image fleecing, >>>>> where the (hidden target) is exposed via NBD server for data scanning >>>>> (read >>>>> only) purpose. >>>>> >>>>> More interestingly, with above facility, it is also possible to create a >>>>> guest >>>>> visible live snapshot (disk 'bar') of an existing device (disk 'foo') very >>>>> cheaply. Or call it shadow copy if you will. >>>>> >>>>> Back to the COLO case, the configuration will be very similar: >>>>> >>>>> >>>>> {primary wr} >>>>> {secondary vm} >>>>> | >>>>> | >>>>> | >>>>> | >>>>> | >>>>> | >>>>> v >>>>> v >>>>> >>>>> [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) >>>>> <------------- (active disk) >>>>> >>>>> v ^ >>>>> v ^ >>>>> v ^ >>>>> v ^ >>>>> >>>> drive-backup sync=none >>>> >>>> >>>> What is active disk? There are two disk images? >>> >>> It starts as an empty image with (hidden buf disk) as backing file, which in >>> turn has (nbd target) as backing file. >> >> It's too complicated..., and I don't understand it. >> 1. What is active disk? Use raw or a new block driver? > > It is an empty qcow2 image with the same lenght as your Secondary Disk.
I test qcow2_make_empty()'s performance. The result shows that it may take about 100ms(normal sata disk). It is not acceptable for COLO. So I think disk buff is necessary(just use it to replace qcow2). Thanks Wen Congyang > >> 2. Hidden buf disk use new block driver? > > It is an empty qcow2 image with the same lenght as your Secondary Disk, too. > >> 3. nbd target is hidden buf disk's backing image? If it is opened read-only, >> we will >> export a nbd with read-only BlockDriverState, but nbd server needs to >> write it. > > NBD target is your Secondary Disk. It is opened read-write. > > The patches to enable opening it as read-write, and starting drive-backup > between it and hidden buf disk, are all work in progress (the core concept) of > image fleecing. > > Fam > >>>>> >>>>> The workflow analogue is: >>>>> >>>>>>>> + 1) Primary write requests will be copied and forwarded to >>>>>>>> Secondary >>>>>>>> + QEMU. >>>>> >>>>> Primary write requests are forwarded to secondary QEMU as well. >>>>> >>>>>>>> + 2) Before Primary write requests are written to Secondary disk, >>>>>>>> the >>>>>>>> + original sector content will be read from Secondary disk and >>>>>>>> + buffered in the Disk buffer, but it will not overwrite the >>>>>>>> existing >>>>>>>> + sector content in the Disk buffer. >>>>> >>>>> Before Primary write requests are written to (nbd target), aka the >>>>> Secondary >>>>> disk, the orignal sector content is read from it and copied to (hidden buf >>>>> disk) by drive-backup. It obviously will not overwrite the data in (active >>>>> disk). >>>>> >>>>>>>> + 3) Primary write requests will be written to Secondary disk. >>>>> >>>>> Primary write requests are written to (nbd target). >>>>> >>>>>>>> + 4) Secondary write requests will be buffered in the Disk buffer >>>>>>>> and it >>>>>>>> + will overwrite the existing sector content in the buffer. >>>>> >>>>> Secondary write request will be written in (active disk) as usual. >>>>> >>>>> Finally, when checkpoint arrives, if you want to sync with primary, just >>>>> drop >>>>> data in (hidden buf disk) and (active disk); when failover happends, if >>>>> you >>>>> want to promote secondary vm, you can commit (active disk) to (nbd >>>>> target), and >>>>> drop data in (hidden buf disk). >>>>> >>>>> Fam >>>>> . >>>>> >>>> >>>> >>> . >>> >> > . >