On 04/12/2017 12:55 PM, Denis V. Lunev wrote: > Let me rephrase a bit. > > The proposal is looking very close to the following case: > - raw sparse file > > In this case all writes are very-very-very fast and from the > guest point of view all is OK. Sequential data is really sequential. > Though once we are starting to perform any sequential IO, we > have real pain. Each sequential operation becomes random > on the host file system and the IO becomes very slow. This > will not be observed with the test, but the performance will > degrade very soon. > > This is why raw sparse files are not used in the real life. > Hypervisor must maintain guest OS invariants and the data, > which is nearby from the guest point of view should be kept > nearby in host. > > This is why actually that 64kb data blocks are extremely > small :) OK. This is offtopic.
Not necessarily. Using subclusters may allow you to ramp up to larger cluster sizes. We can also set up our allocation (and pre-allocation schemes) so that we always reserve an entire cluster on the host at the time we allocate the cluster, even if we only plan to write to particular subclusters within that cluster. In fact, 32 subclusters to a 2M cluster results in 64k subclusters, where you are still writing at 64k data chunks but could now have guaranteed 2M locality, compared to the current qcow2 with 64k clusters that writes in 64k data chunks but with no locality. Just because we don't write the entire cluster up front does not mean that we don't have to allocate (or have a mode that allocates) the entire cluster at the time of the first subcluster use. > > One can easily recreate this case using the following simple > test: > - write each even 4kb page of the disk, one by one > - write each odd 4 kb page of the disk > - run sequential read with f.e. 1 MB data block > > Normally we should still have native performance, but > with raw sparse files and (as far as understand the > proposal) sub-clusters we will have the host IO pattern > exactly like random. Only if we don't pre-allocate entire clusters at the point that we first touch the cluster. > > This seems like a big and inevitable problem of the approach > for me. We still have the potential to improve current > algorithms and not introduce non-compatible changes. > > Sorry if this is too emotional. We have learned above in a > very hard way. And your experience is useful, as a way to fine-tune this proposal. But it doesn't mean we should entirely ditch this proposal. I also appreciate that you have patches in the works to reduce bottlenecks (such as turning sub-cluster writes into 3 IOPs rather than 5, by doing read-head, read-tail, write-cluster, instead of the current read-head, write-head, write-body, read-tail, write-tail), but think that both approaches are complimentary, not orthogonal. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
signature.asc
Description: OpenPGP digital signature