On 09/09/2010 01:59 AM, Avi Kivity wrote:
On 09/08/2010 06:07 PM, Stefan Hajnoczi wrote:
uint32_t table_size; /* table size, in clusters */
Presumably L1 table size? Or any table size?
Hm. It would be nicer not to require contiguous sectors anywhere. How
about a variable- or fixed-height tree?
Both extents and fancier trees don't fit the philosophy, which is to
keep things straightforward and fast by doing less. With extents and
trees you've got something that looks much more like a full-blown
filesystem. Is there an essential feature or characteristic that QED
cannot provide in its current design?
Not using extents mean that random workloads on very large disks will
continuously need to page in L2s (which are quite large, 256KB is
large enough that you need to account for read time, not just seek
time). Keeping it to two levels means that the image size is limited,
not very good for an image format designed in 2010.
Define "very large disks".
My target for VM images is 100GB-1TB. Practically speaking, that at
least covers us for the next 5 years.
Since QED has rich support for features, we can continue to evolve the
format over time in a backwards compatible way. I'd rather delay
supporting massively huge disks for the future when we better understand
true nature of the problem.
Is the physical image size always derived from the host file
metadata? Is
this always safe?
In my email summarizing crash scenarios and recovery we cover the
bases and I think it is safe to rely on file size as physical image
size. The drawback is that you need a host filesystem and cannot
directly use a bare block device. I think that is acceptable for a
sparse format, otherwise we'd be using raw.
Hm, we do have a use case for qcow2-over-lvm. I can't say it's
something I like, but a point to consider.
We specifically are not supporting that use-case in QED today. There's
a good reason for it. For cluster allocation, we achieve good
performance because for L2 cluster updates, we can avoid synchronous
metadata updates (except for L1 updates).
We achieve synchronous metadata updates by leveraging the underlying
filesystem's metadata. The underlying filesystems are much smarter
about their metadata updates. They'll keep a journal to delay
synchronous updates and other fancy things.
If we tried to represent the disk size in the header, we would have to
do an fsync() on every cluster allocation.
I can only imagine the use case for qcow2-over-lvm is performance. But
the performance of QED on a file system is so much better than qcow2
that you can safely just use a file system and avoid the complexity of
qcow2 over lvm.
Regards,
Anthony Liguori