Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-25 Thread Alberto Garcia
On Mon 25 Sep 2017 10:15:49 PM CEST, John Snow  wrote:

>>- We need a proper name for these sub-tables that we are loading
>>  now. I'm actually still struggling with this :-) I can't think of
>>  any name that is clear enough and not too cumbersome to use (L2
>>  subtables? => Confusing. L3 tables? => they're not really that).
>
> L2  .
>
> "slice" might be the nicest as it's in common usage elsewhere in
> software engineering and properly intuits that it's a division /
> portion of a whole.
>
> "l2 slice" ?

Doesn't sound bad! Thanks for the suggestion.

Berto



Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-25 Thread John Snow


On 09/19/2017 11:07 AM, Alberto Garcia wrote:
> Hi everyone,

>- We need a proper name for these sub-tables that we are loading
>  now. I'm actually still struggling with this :-) I can't think of
>  any name that is clear enough and not too cumbersome to use (L2
>  subtables? => Confusing. L3 tables? => they're not really that).
> 

L2  .

"slice" might be the nicest as it's in common usage elsewhere in
software engineering and properly intuits that it's a division / portion
of a whole.

"l2 slice" ?

--js





Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-20 Thread Alberto Garcia
On Wed 20 Sep 2017 09:06:20 AM CEST, Kevin Wolf wrote:
>> |---+--+-+---+--|
>> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
>> |---+--+-+---+--|
>> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
>> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
>> |---+--+-+---+--|
>> 
>> The improvements are clearly visible, but it's important to point out
>> a couple of things:
>> 
>>- L2 cache size is always < total L2 metadata on disk (otherwise
>>  this wouldn't make sense). Increasing the L2 cache size improves
>>  performance a lot (and makes the effect of these patches
>>  disappear), but it requires more RAM.
>
> Do you have the numbers for the two cases abve if the L2 tables
> covered the whole image?

Yeah, sorry, it's around 6 IOPS in both cases (more or less what I
also get with a raw image).

>>- Doing random reads over the whole disk is probably not a very
>>  realistic scenario. During normal usage only certain areas of the
>>  disk need to be accessed, so performance should be much better
>>  with the same amount of cache.
>>- I wrote a best-case scenario test (several I/O jobs each accesing
>>  a part of the disk that requires loading its own L2 table) and my
>>  patched version is 20x faster even with 64KB clusters.
>
> I suppose you choose the scenario so that the number of jobs is larger
> than the number of cached L2 tables without the patch, but smaller than
> than the number of cache entries with the patch?

Exactly, I should have made that explicit :) I had 32 jobs, each one of
them limited to a small area (32MB), so with 4K pages you only need
128KB of cache memory (vs 2MB with the current code).

> We will probably need to do some more benchmarking to find a good
> default value for the cached chunks. 4k is nice and small, so we can
> cover many parallel jobs without using too much memory. But if we have
> a single sequential job, we may end up doing the metadata updates in
> small 4k chunks instead of doing a single larger write.

Right, although a 4K table can already hold pointers to 512 data
clusters, so even if you do sequential I/O you don't need to update the
metadata so often, do you?

I guess the default value should probably depend on the cluster size.

>>- We need a proper name for these sub-tables that we are loading
>>  now. I'm actually still struggling with this :-) I can't think of
>>  any name that is clear enough and not too cumbersome to use (L2
>>  subtables? => Confusing. L3 tables? => they're not really that).
>
> L2 table chunk? Or just L2 cache entry?

Yeah, something like that, but let's see how variables end up being
named :)

Berto



Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-20 Thread Kevin Wolf
Am 19.09.2017 um 17:07 hat Alberto Garcia geschrieben:
> Hi everyone,
> 
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
> [...]

Thanks for working on this, Berto! I think this is essential for large
cluster sizes and have been meaning to make a change like this for a
long time, but I never found the time for it.

> Some results from my tests (using an SSD drive and random 4K reads):
> 
> |---+--+-+---+--|
> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
> |---+--+-+---+--|
> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
> |---+--+-+---+--|
> 
> The improvements are clearly visible, but it's important to point out
> a couple of things:
> 
>- L2 cache size is always < total L2 metadata on disk (otherwise
>  this wouldn't make sense). Increasing the L2 cache size improves
>  performance a lot (and makes the effect of these patches
>  disappear), but it requires more RAM.

Do you have the numbers for the two cases abve if the L2 tables covered
the whole image?

>- Doing random reads over the whole disk is probably not a very
>  realistic scenario. During normal usage only certain areas of the
>  disk need to be accessed, so performance should be much better
>  with the same amount of cache.
>- I wrote a best-case scenario test (several I/O jobs each accesing
>  a part of the disk that requires loading its own L2 table) and my
>  patched version is 20x faster even with 64KB clusters.

I suppose you choose the scenario so that the number of jobs is larger
than the number of cached L2 tables without the patch, but smaller than
than the number of cache entries with the patch?

We will probably need to do some more benchmarking to find a good
default value for the cached chunks. 4k is nice and small, so we can
cover many parallel jobs without using too much memory. But if we have a
single sequential job, we may end up doing the metadata updates in
small 4k chunks instead of doing a single larger write.

Of course, if this starts becoming a problem (maybe unlikely?), we can
always change the cache code to gather any adjacent dirty chunks in the
cache when writing out something. Same thing for readahead, if we can
find a policy when to evict old entries for readahead.

>- We need a proper name for these sub-tables that we are loading
>  now. I'm actually still struggling with this :-) I can't think of
>  any name that is clear enough and not too cumbersome to use (L2
>  subtables? => Confusing. L3 tables? => they're not really that).

L2 table chunk? Or just L2 cache entry?

> I think I haven't forgotten anything. As I said I have a working
> prototype of this and if you like the idea I'd like to publish it
> soon. Any questions or comments will be appreciated.

Please do post it!

Kevin



Re: [Qemu-block] RFC: Reducing the size of entries in the qcow2 L2 cache

2017-09-19 Thread Denis V. Lunev
On 09/19/2017 06:07 PM, Alberto Garcia wrote:
> Hi everyone,
>
> over the past few weeks I have been testing the effects of reducing
> the size of the entries in the qcow2 L2 cache. This was briefly
> mentioned by Denis in the same thread where we discussed subcluster
> allocation back in April, but I'll describe here the problem and the
> proposal in detail.
>
> === Problem ===
>
> In the qcow2 file format guest addresses are mapped to host addresses
> using the so-called L1 and L2 tables. The size of an L2 table is the
> same as the cluster size, therefore a larger cluster means more L2
> entries in a table, and because of that an L2 table can map a larger
> portion of the address space (not only because it contains more
> entries, but also because the data cluster that each one of those
> entries points at is larger).
>
> There are two consequences of this:
>
>1) If you double the cluster size of a qcow2 image then the maximum
>   space needed for all L2 tables is divided by two (i.e. you need
>   half the metadata).
>
>2) If you double the cluster size of a qcow2 image then each one of
>   the L2 tables will map four times as much disk space.
>
> With the default cluster size of 64KB, each L2 table maps 512MB of
> contiguous disk space. This table shows what happens when you change
> the cluster size:
>
>  |--+--|
>  | Cluster size | An L2 table maps |
>  |--+--|
>  |   512  B |32 KB |
>  | 1 KB |   128 KB |
>  | 2 KB |   512 KB |
>  | 4 KB | 2 MB |
>  | 8 KB | 8 MB |
>  |16 KB |32 MB |
>  |32 KB |   128 MB |
>  |64 KB |   512 MB |
>  |   128 KB | 2 GB |
>  |   256 KB | 8 GB |
>  |   512 KB |32 GB |
>  | 1 MB |   128 GB |
>  | 2 MB |   512 GB |
>  |--+--|
>
> When QEMU wants to convert a guest address into a host address, it
> needs to read the entry from the corresponding L2 table. The qcow2
> driver doesn't read those entries directly, it does it by loading the
> tables in the L2 cache so they can be kept in memory in case they are
> needed later.
>
> The problem here is that the L2 cache (and the qcow2 driver in
> general) always works with complete L2 tables: if QEMU needs a
> particular L2 entry then the whole cluster containing the L2 table is
> read from disk, and if the cache is full then a cluster worth of
> cached data has to be discarded.
>
> The consequences of this are worse the larger the cluster size is, not
> only because we're reading (and discarding) larger amounts of data,
> but also because we're using that memory in a very inefficient way.
>
> Example: with 1MB clusters each L2 table maps 128GB of contiguous
> virtual disk, so that's the granularity of our cache. If we're
> performing I/O in a 4GB area that overlaps two of those 128GB chunks,
> we need to have in the cache two complete L2 tables (2MB) even when in
> practice we're only using 32KB of those 2MB (32KB contain enough L2
> entries to map the 4GB that we're using).
>
> === The proposal ===
>
> One way to solve the problems described above is to decouple the L2
> table size (which is equal to the cluster size) from the cache entry
> size.
>
> The qcow2 cache doesn't actually know anything about the data that
> it's loading, it just receives a disk offset and checks that it is
> properly aligned. It's perfectly possible to make it load data blocks
> smaller than a cluster.
>
> I already have a working prototype, and I was doing tests using a 4KB
> cache entry size. 4KB is small enough, it allows us to make a more
> flexible use of the cache, it's also a common file system block size
> and it can hold enough L2 entries to cover substantial amounts of disk
> space (especially with large clusters).
>
>  |--+---|
>  | Cluster size | 4KB of L2 entries map |
>  |--+---|
>  | 64 KB| 32 MB |
>  | 128 KB   | 64 MB |
>  | 256 KB   | 128 MB|
>  | 512 KB   | 256 MB|
>  | 1 MB | 512 MB|
>  | 2 MB | 1 GB  |
>  |--+---|
>
> Some results from my tests (using an SSD drive and random 4K reads):
>
> |---+--+-+---+--|
> | Disk size | Cluster size | L2 cache| Standard QEMU | Patched QEMU |
> |---+--+-+---+--|
> | 16 GB | 64 KB| 1 MB [8 GB] | 5000 IOPS | 12700 IOPS   |
> |  2 TB |  2 MB| 4 MB [1 TB] |  576 IOPS | 11000 IOPS   |
>