On Thu, Dec 10, 2020 at 08:29:21PM +0100, Miklos Szeredi wrote: > On Thu, Dec 10, 2020 at 5:11 PM Vivek Goyal <vgo...@redhat.com> wrote: > > > Conclusion > > ----------- > > - virtiofs DAX seems to help a lot in many workloads. > > > > Note, DAX performance well only if data fits in cache window. My total > > data is 16G and cache window size is 16G as well. If data is larger > > than DAX cache window, then performance of dax suffers a lot. Overhead > > of reclaiming old mapping and setting up a new one is very high. > > Which begs the question: what is the optimal window size?
Yep. I will need to run some more tests with data size being constant and varying DAX window size. For now, I would say optimal window size is same as data size. But knowing data size might be hard in advance. So a rough guideline could be that it could be same as amount of RAM given to guest. > > What is the cost per GB of window to the host and guest? Inside guest, I think two primary structures are allocated. There will be "struct page" allocated per 4K page. Size of struct page seems to be 64. And then there will be "struct fuse_dax_mapping" allocated per 2MB. Size of "struct fuse_dax_mapping" is 112. This means per 2MB of DAX window, memory needed in guest is. memory per 2MB of DAX window = 112 + 64 * 512 = 32880 bytes. memory per 1GB of DAX window = 32880 * 512 = 16834560 (16MB approx) I think "struct page" allocation is biggest memory allocation and that's roughly 1.56% (64/4096) of DAX window size. And that also results in 16MB memory allocation per GB of dax window. So if a guest has 4G RAM and 4G dax window, then 64MB will be consumed in dax window struct pages. I will say no too bad. I am looking at qemu code and its not obvious to me what memory allocation will be needed 1GB of guest. Looks like it just stores the cache window location and size and when mapping request comes, it simply adds offset to cache window start. So it might not be allocating memory per page of dax window. mmap(cache_host + sm->c_offset[i], sm->len[i].... David, you most likely have a better idea about this. > > Could we measure at what point does a large window size actually make > performance worse? Will do. Will run tests with varying window sizes (small to large) and see how does it impact performance for same workload with same guest memory. > > > > > NAME WORKLOAD Bandwidth IOPS > > 9p-none seqread-psync 98.6mb 24.6k > > 9p-mmap seqread-psync 97.5mb 24.3k > > 9p-loose seqread-psync 91.6mb 22.9k > > vtfs-none seqread-psync 98.4mb 24.6k > > vtfs-none-dax seqread-psync 660.3mb 165.0k > > vtfs-auto seqread-psync 650.0mb 162.5k > > vtfs-auto-dax seqread-psync 703.1mb 175.7k > > vtfs-always seqread-psync 671.3mb 167.8k > > vtfs-always-dax seqread-psync 687.2mb 171.8k > > > > 9p-none seqread-psync-multi 397.6mb 99.4k > > 9p-mmap seqread-psync-multi 382.7mb 95.6k > > 9p-loose seqread-psync-multi 350.5mb 87.6k > > vtfs-none seqread-psync-multi 360.0mb 90.0k > > vtfs-none-dax seqread-psync-multi 2281.1mb 570.2k > > vtfs-auto seqread-psync-multi 2530.7mb 632.6k > > vtfs-auto-dax seqread-psync-multi 2423.9mb 605.9k > > vtfs-always seqread-psync-multi 2535.7mb 633.9k > > vtfs-always-dax seqread-psync-multi 2406.1mb 601.5k > > Seems like in all the -multi tests 9p-none performs consistently > better than vtfs-none. Could that be due to the single queue? Not sure. In the past I had run -multi tests with shared thread pool (cache=auto) and single thread seemed to perform better. I can try shared pool and run -multi tests again and see if that helps. Thanks Vivek