Re: Some performance numbers for virtiofs, DAX and virtio-9p

Vivek Goyal Fri, 11 Dec 2020 08:06:31 -0800

On Thu, Dec 10, 2020 at 08:29:21PM +0100, Miklos Szeredi wrote:
> On Thu, Dec 10, 2020 at 5:11 PM Vivek Goyal <vgo...@redhat.com> wrote:
> 
> > Conclusion
> > -----------
> > - virtiofs DAX seems to help a lot in many workloads.
> >
> >   Note, DAX performance well only if data fits in cache window. My total
> >   data is 16G and cache window size is 16G as well. If data is larger
> >   than DAX cache window, then performance of dax suffers a lot. Overhead
> >   of reclaiming old mapping and setting up a new one is very high.
> 
> Which begs the question: what is the optimal window size?


Yep. I will need to run some more tests with data size being constant
and varying DAX window size.

For now, I would say optimal window size is same as data size. But
knowing data size might be hard in advance. So a rough guideline
could be that it could be same as amount of RAM given to guest.

> 
> What is the cost per GB of window to the host and guest?

Inside guest, I think two primary structures are allocated. There
will be "struct page" allocated per 4K page. Size of struct page
seems to be 64. And then there will be "struct fuse_dax_mapping"
allocated per 2MB. Size of "struct fuse_dax_mapping" is 112.

This means per 2MB of DAX window, memory needed in guest is.

memory per 2MB of DAX window = 112 + 64 * 512 = 32880 bytes.
memory per 1GB of DAX window = 32880 * 512 = 16834560 (16MB approx)

I think "struct page" allocation is biggest memory allocation
and that's roughly 1.56% (64/4096) of DAX window size. And that also
results in 16MB memory allocation per GB of dax window.

So if a guest has 4G RAM and 4G dax window, then 64MB will be
consumed in dax window struct pages. I will say no too bad.

I am looking at qemu code and its not obvious to me what memory
allocation will be needed 1GB of guest. Looks like it just 
stores the cache window location and size and when mapping
request comes, it simply adds offset to cache window start. So
it might not be allocating memory per page of dax window.

mmap(cache_host + sm->c_offset[i], sm->len[i]....

David, you most likely have a better idea about this.

> 
> Could we measure at what point does a large window size actually make
> performance worse?

Will do. Will run tests with varying window sizes (small to large)
and see how does it impact performance for same workload with
same guest memory.

> 
> >
> > NAME                    WORKLOAD                Bandwidth       IOPS
> > 9p-none                 seqread-psync           98.6mb          24.6k
> > 9p-mmap                 seqread-psync           97.5mb          24.3k
> > 9p-loose                seqread-psync           91.6mb          22.9k
> > vtfs-none               seqread-psync           98.4mb          24.6k
> > vtfs-none-dax           seqread-psync           660.3mb         165.0k
> > vtfs-auto               seqread-psync           650.0mb         162.5k
> > vtfs-auto-dax           seqread-psync           703.1mb         175.7k
> > vtfs-always             seqread-psync           671.3mb         167.8k
> > vtfs-always-dax         seqread-psync           687.2mb         171.8k
> >
> > 9p-none                 seqread-psync-multi     397.6mb         99.4k
> > 9p-mmap                 seqread-psync-multi     382.7mb         95.6k
> > 9p-loose                seqread-psync-multi     350.5mb         87.6k
> > vtfs-none               seqread-psync-multi     360.0mb         90.0k
> > vtfs-none-dax           seqread-psync-multi     2281.1mb        570.2k
> > vtfs-auto               seqread-psync-multi     2530.7mb        632.6k
> > vtfs-auto-dax           seqread-psync-multi     2423.9mb        605.9k
> > vtfs-always             seqread-psync-multi     2535.7mb        633.9k
> > vtfs-always-dax         seqread-psync-multi     2406.1mb        601.5k
> 
> Seems like in all the -multi tests 9p-none performs consistently
> better than vtfs-none.   Could that be due to the single queue?

Not sure. In the past I had run -multi tests with shared thread pool
(cache=auto) and single thread seemed to perform better. I can
try shared pool and run -multi tests again and see if that helps.

Thanks
Vivek

Re: Some performance numbers for virtiofs, DAX and virtio-9p

Reply via email to