Re: [PATCH v5 03/13] mm/shmem: Support memfile_notifier

2022-03-10 Thread Dave Chinner
On Thu, Mar 10, 2022 at 10:09:01PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
> 
> It maintains a memfile_notifier list in shmem_inode_info structure and
> implements memfile_pfn_ops callbacks defined by memfile_notifier. It
> then exposes them to memfile_notifier via
> shmem_get_memfile_notifier_info.
> 
> We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be
> allocated by userspace for private memory. If there is no pages
> allocated at the offset then error should be returned so KVM knows that
> the memory is not private memory.
> 
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/shmem_fs.h |  4 +++
>  mm/shmem.c   | 76 
>  2 files changed, 80 insertions(+)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 2dde843f28ef..7bb16f2d2825 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* inode in-kernel data */
>  
> @@ -28,6 +29,9 @@ struct shmem_inode_info {
>   struct simple_xattrsxattrs; /* list of xattrs */
>   atomic_tstop_eviction;  /* hold when working on inode */
>   unsigned intxflags; /* shmem extended flags */
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> + struct memfile_notifier_list memfile_notifiers;
> +#endif
>   struct inodevfs_inode;
>  };
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9b31a7056009..7b43e274c9a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct 
> inode *inode, pgoff_t index)
>   return page ? page_folio(page) : NULL;
>  }
>  
> +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> +#ifdef CONFIG_MEMFILE_NOTIFIER
> + struct shmem_inode_info *info = SHMEM_I(inode);
> +
> + memfile_notifier_fallocate(>memfile_notifiers, start, end);
> +#endif
> +}

*notify_populate(), not fallocate.  This is a notification that a
range has been populated, not that the fallocate() syscall was run
to populate the backing store of a file.

i.e.  fallocate is the name of a userspace filesystem API that can
be used to manipulate the backing store of a file in various ways.
It can both populate and punch away the backing store of a file, and
some operations that fallocate() can run will do both (e.g.
FALLOC_FL_ZERO_RANGE) and so could generate both
notify_invalidate() and a notify_populate() events.

Hence "fallocate" as an internal mm namespace or operation does not
belong anywhere in core MM infrastructure - it should never get used
anywhere other than the VFS/filesystem layers that implement the
fallocate() syscall or use it directly.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-23 Thread Dave Chinner
On Fri, Aug 21, 2020 at 08:59:44AM -0400, Brian Foster wrote:
> On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  
> > wrote:
> > And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> > more IOPS.
> > 
> > I just ran the tests with aio=native and with a raw image instead of
> > qcow2, here are the results:
> > 
> > qcow2:
> > |--+-+|
> > | preallocation| aio=threads | aio=native |
> > |--+-+|
> > | off  |8139 |   7649 |
> > | off (w/o ZERO_RANGE) |2965 |   2779 |
> > | metadata |7768 |   8265 |
> > | falloc   |7742 |   7956 |
> > | full |   41389 |  56668 |
> > |--+-+|
> > 
> 
> So this seems like Dave's suggestion to use native aio produced more
> predictable results with full file prealloc being a bit faster than per
> cluster prealloc. Not sure why that isn't the case with aio=threads. I

That will the context switch overhead with aio=threads becoming a
performance limiting factor at higher IOPS. The "full" workload
there is probably running at 80-120k context switches/s while the
aio=native if probably under 10k ctxsw/s because it doesn't switch
threads for every IO that has to be submitted/completed.

For all the other results, I'd consider the difference to be noise -
it's just not significant enough to draw any conclusions from at
all.

FWIW, the other thing that aio=native gives us is plugging across
batch IO submission. This allows bio merging before dispatch and
that can greatly increase performance of AIO when the IO being
submitted has some mergable submissions. That's not the case for
pure random IO like this, but there are relatively few pure random
IO workloads out there... :P

> was wondering if perhaps the threading affects something indirectly like
> the qcow2 metadata allocation itself, but I guess that would be
> inconsistent with ext4 showing a notable jump from (1) to (4) (assuming
> the previous ext4 numbers were with aio=threads).

> > raw:
> > |---+-+|
> > | preallocation | aio=threads | aio=native |
> > |---+-+|
> > | off   |7647 |   7928 |
> > | falloc|7662 |   7856 |
> > | full  |   45224 |  58627 |
> > |---+-+|
> > 
> > A qcow2 file with preallocation=metadata is more or less similar to a
> > sparse raw file (and the numbers are indeed similar).
> > 
> > preallocation=off on qcow2 does not have an equivalent on raw files.
> > 
> 
> It sounds like preallocation=off for qcow2 would be roughly equivalent
> to a raw file with a 64k extent size hint (on XFS).

Yes, the effect should be close to identical, the only difference is
that qcow2 adds new clusters to the end of the file (i.e. the file
itself is not sparse), while the extent size hint will just add 64kB
extents into the file around the write offset. That demonstrates the
other behavioural advantage that extent size hints have is they
avoid needing to extend the file, which is yet another way to
serialise concurrent IO and create IO pipeline stalls...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-23 Thread Dave Chinner
 a combination of all three things.
Watching a set of fast sampling metrics that tell you what the
device and filesytem are doing in real time (e.g. I use PCP for this
and visualise ithe behaviour in real time via pmchart) gives a lot
of insight into exactly what is changing during transient workload
changes liek starting a benchmark...

> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.

What does "preallocation=off" mean again? Is that using
fallocate(ZERO_RANGE) prior to the data write rather than
preallocating the metadata/entire file? If so, I would expect the
limiting factor is the rate at which IO can be issued because of the
fallocate() triggered pipeline bubbles. That leaves idle device time
so you're not pushing the limits of the hardware and hence none of
the behaviours above will be evident...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-20 Thread Dave Chinner
 question!)

However, if you just use (4) you get:

falloc(64k)
  
  
<4k io>
  
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
falloc(64k)
  
  
  <4k IO completes, converts 4k to written>
  
<4k io>
  

until all the clusters in the qcow2 file are intialised. IOWs, each
fallocate() call serialises all IO in flight. Compare that to using
extent size hints on a raw sparse image file for the same thing:


<4k IO>
  
  
<4k IO>
  
  
<4k IO>
  
  
...
  <4k IO completes, converts 4k to written>
  <4k IO completes, converts 4k to written>
  <4k IO completes, converts 4k to written>


See the difference in IO pipelining here? You get the same "64kB
cluster initialised at a time" behaviour as qcow2, but you don't get
the IO pipeline stalls caused by fallocate() having to drain all the
IO in flight before it does the allocation.

>- Why is (5) so much faster than everything else?

The full file allocation in (5) means the IO doesn't have to modify
the extent map hence all extent mapping is uses shared locking and
the entire IO path can run concurrently without serialisation at
all.

Thing is, once your writes into sprase image files regularly start
hitting written extents, the performance of (1), (2) and (4) will
trend towards (5) as writes hit already allocated ranges of the file
and the serialisation of extent mapping changes goes away. This
occurs with guest filesystems that perform overwrite in place (such
as XFS) and hence overwrites of existing data will hit allocated
space in the image file and not require further allocation.

IOWs, typical "write once" benchmark testing indicates the *worst*
performance you are going to see. As the guest filesytsem ages and
initialises more of the underlying image file, it will get faster,
not slower.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v6 6/6] xfs: disable map_sync for async flush

2019-04-23 Thread Dave Chinner
On Tue, Apr 23, 2019 at 01:36:12PM +0530, Pankaj Gupta wrote:
> Dont support 'MAP_SYNC' with non-DAX files and DAX files
> with asynchronous dax_device. Virtio pmem provides
> asynchronous host page cache flush mechanism. We don't
> support 'MAP_SYNC' with virtio pmem and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..0e59be018511 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1196,11 +1196,13 @@ xfs_file_mmap(
>   struct file *filp,
>   struct vm_area_struct *vma)
>  {
> - /*
> -  * We don't support synchronous mappings for non-DAX files. At least
> -  * until someone comes with a sensible use case.
> + struct dax_device *dax_dev = xfs_find_daxdev_for_inode
> + (file_inode(filp));
> +
> + /* We don't support synchronous mappings for non-DAX files and
> +  * for DAX files if underneath dax_device is not synchronous.
>*/

/*
 * This is the correct multi-line comment format. Please
 * update the patch to maintain the existing comment format.
 */

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-03 Thread Dave Chinner
On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 1f2e2845eb76..dced2eb8c91a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1203,6 +1203,14 @@ xfs_file_mmap(
>   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
>   return -EOPNOTSUPP;
>  
> + /* We don't support synchronous mappings with DAX files if
> +  * dax_device is not synchronous.
> +  */
> + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> + (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(file_inode(filp)))

All this ad hoc IS_DAX conditional logic is getting pretty nasty.

xfs_file_mmap(

{
struct inode*inode = file_inode(filp);

if (vma->vm_flags & VM_SYNC) {
if (!IS_DAX(inode))
return -EOPNOTSUPP;
if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
return -EOPNOTSUPP;
}

file_accessed(filp);
vma->vm_ops = _file_vm_ops;
if (IS_DAX(inode))
vma->vm_flags |= VM_HUGEPAGE;
return 0;
}


Even better, factor out all the "MAP_SYNC supported" checks into a
helper so that the filesystem code just doesn't have to care about
the details of checking for DAX+MAP_SYNC support

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] security implications of caching with virtio pmem (was Re: [PATCH v3 0/5] kvm "virtio pmem" device)

2019-02-11 Thread Dave Chinner
On Mon, Feb 11, 2019 at 02:29:46AM -0500, Pankaj Gupta wrote:
> Hello Dave,
> Are we okay with this?

Sure.

I'm not sure I agree with all the analysis presented, but, well, I
haven't looked any deeper because I'm tired of being shouted at and
being called argumentative for daring to ask hard questions about
this topic

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-15 Thread Dave Chinner
On Tue, Jan 15, 2019 at 12:35:06AM -0500, Pankaj Gupta wrote:
> 
> > > > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > > > >
> > > > > > > Until you have images (and hence host page cache) shared between
> > > > > > > multiple guests. People will want to do this, because it means 
> > > > > > > they
> > > > > > > only need a single set of pages in host memory for executable
> > > > > > > binaries rather than a set of pages per guest. Then you have
> > > > > > > multiple guests being able to detect residency of the same set of
> > > > > > > pages. If the guests can then, in any way, control eviction of the
> > > > > > > pages from the host cache, then we have a guest-to-guest
> > > > > > > information
> > > > > > > leak channel.
> > > > > >
> > > > > > I don't think we should ever be considering something that would
> > > > > > allow a
> > > > > > guest to evict page's from the host's pagecache [1].  The guest
> > > > > > should
> > > > > > be able to kick its own references to the host's pagecache out of 
> > > > > > its
> > > > > > own pagecache, but not be able to influence whether the host or
> > > > > > another
> > > > > > guest has a read-only mapping cached.
> > > > > >
> > > > > > [1] Unless the guest is allowed to modify the host's file; obviously
> > > > > > truncation, holepunching, etc are going to evict pages from the
> > > > > > host's
> > > > > > page cache.
> > > > >
> > > > > This is so correct. Guest does not not evict host page cache pages
> > > > > directly.
> > > >
> > > > They don't right now.
> > > >
> > > > But someone is going to end up asking for discard to work so that
> > > > the guest can free unused space in the underlying spares image (i.e.
> > > > make use of fstrim or mount -o discard) because they have workloads
> > > > that have bursts of space usage and they need to trim the image
> > > > files afterwards to keep their overall space usage under control.
> > > >
> > > > And then
> > > 
> > > ...we reject / push back on that patch citing the above concern.
> > 
> > So at what point do we draw the line?
> > 
> > We're allowing writable DAX mappings, but as I've pointed out that
> > means we are going to be allowing  a potential information leak via
> > files with shared extents to be directly mapped and written to.
> > 
> > But we won't allow useful admin operations that allow better
> > management of host side storage space similar to how normal image
> > files are used by guests because it's an information leak vector?
> 
> First of all Thank you for all the useful discussions. 
> I am summarizing here:
> 
> - We have to live with the limitation to not support fstrim and 
>   mount -o discard options with virtio-pmem as they will evict 
>   host page cache pages. We cannot allow this for virtio-pmem
>   for security reasons. These filesystem commands will just zero out 
>   unused pages currently.

Not sure I follow you here - what pages are going to be zeroed and
when will they be zeroed? If discard is not allowed, filesystems
just don't issue such commands and the underlying device will never
seen them.

> - If alot of space is unused and not freed guest can request host 
>   Administrator for truncating the host backing image. 

You can't use truncate to free space in a disk image file. The only
way to do it safely in a generic, filesystem agnositic way is to
mount the disk image (e.g. on loopback) and run fstrim on it. The
loopback device will punches holes in the file where all the free
space is reported by the filesystem via discard requests.

Which is kinda my point - this could only be done if the guest is
shut down, which makes it very difficult for admins to manage. 

>   We are also planning to support qcow2 sparse image format at 
>   host side with virtio-pmem.

So you're going to be remapping a huge number of disjoint regions
into a linear pmem mapping? ISTR discussions about similar things
for virtio+fuse+dax that came up against "large numbers of mapped
regions don't scale" and so it wasn't a practical solution compared
to a just using raw sparse files

> - There is no existing solution for Qemu persistent memory 
>   emulation with write support currently. This solution provides 
>   us the paravartualized way of emulating persistent memory.

Sure, but the question is why do you need to create an emulation
that doesn't actually perform like pmem? The whole point of pmem is
performance, and emulating pmem by mmap() of a file on spinning
disks is going to be horrible for performance. Even on SSDs it's
going to be orders of magnitudes slower than real pmem.

So exactly what problem are you trying to solve with this driver?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-14 Thread Dave Chinner
On Mon, Jan 14, 2019 at 01:35:57PM -0800, Dan Williams wrote:
> On Mon, Jan 14, 2019 at 1:25 PM Dave Chinner  wrote:
> >
> > On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> > >
> > > > > Until you have images (and hence host page cache) shared between
> > > > > multiple guests. People will want to do this, because it means they
> > > > > only need a single set of pages in host memory for executable
> > > > > binaries rather than a set of pages per guest. Then you have
> > > > > multiple guests being able to detect residency of the same set of
> > > > > pages. If the guests can then, in any way, control eviction of the
> > > > > pages from the host cache, then we have a guest-to-guest information
> > > > > leak channel.
> > > >
> > > > I don't think we should ever be considering something that would allow a
> > > > guest to evict page's from the host's pagecache [1].  The guest should
> > > > be able to kick its own references to the host's pagecache out of its
> > > > own pagecache, but not be able to influence whether the host or another
> > > > guest has a read-only mapping cached.
> > > >
> > > > [1] Unless the guest is allowed to modify the host's file; obviously
> > > > truncation, holepunching, etc are going to evict pages from the host's
> > > > page cache.
> > >
> > > This is so correct. Guest does not not evict host page cache pages 
> > > directly.
> >
> > They don't right now.
> >
> > But someone is going to end up asking for discard to work so that
> > the guest can free unused space in the underlying spares image (i.e.
> > make use of fstrim or mount -o discard) because they have workloads
> > that have bursts of space usage and they need to trim the image
> > files afterwards to keep their overall space usage under control.
> >
> > And then
> 
> ...we reject / push back on that patch citing the above concern.

So at what point do we draw the line?

We're allowing writable DAX mappings, but as I've pointed out that
means we are going to be allowing  a potential information leak via
files with shared extents to be directly mapped and written to.

But we won't allow useful admin operations that allow better
management of host side storage space similar to how normal image
files are used by guests because it's an information leak vector?

That's splitting some really fine hairs there...

> > > In case of virtio-pmem & DAX, guest clears guest page cache exceptional 
> > > entries.
> > > Its solely decision of host to take action on the host page cache pages.
> > >
> > > In case of virtio-pmem, guest does not modify host file directly i.e don't
> > > perform hole punch & truncation operation directly on host file.
> >
> > ... this will no longer be true, and the nuclear landmine in this
> > driver interface will have been armed
> 
> I agree with the need to be careful when / if explicit cache control
> is added, but that's not the case today.

"if"?

I expect it to be "when", not if. Expect the worst, plan for it now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-14 Thread Dave Chinner
On Mon, Jan 14, 2019 at 02:15:40AM -0500, Pankaj Gupta wrote:
> 
> > > Until you have images (and hence host page cache) shared between
> > > multiple guests. People will want to do this, because it means they
> > > only need a single set of pages in host memory for executable
> > > binaries rather than a set of pages per guest. Then you have
> > > multiple guests being able to detect residency of the same set of
> > > pages. If the guests can then, in any way, control eviction of the
> > > pages from the host cache, then we have a guest-to-guest information
> > > leak channel.
> > 
> > I don't think we should ever be considering something that would allow a
> > guest to evict page's from the host's pagecache [1].  The guest should
> > be able to kick its own references to the host's pagecache out of its
> > own pagecache, but not be able to influence whether the host or another
> > guest has a read-only mapping cached.
> > 
> > [1] Unless the guest is allowed to modify the host's file; obviously
> > truncation, holepunching, etc are going to evict pages from the host's
> > page cache.
> 
> This is so correct. Guest does not not evict host page cache pages directly. 

They don't right now.

But someone is going to end up asking for discard to work so that
the guest can free unused space in the underlying spares image (i.e.
make use of fstrim or mount -o discard) because they have workloads
that have bursts of space usage and they need to trim the image
files afterwards to keep their overall space usage under control.

And then

> In case of virtio-pmem & DAX, guest clears guest page cache exceptional 
> entries.
> Its solely decision of host to take action on the host page cache pages.
> 
> In case of virtio-pmem, guest does not modify host file directly i.e don't
> perform hole punch & truncation operation directly on host file.  

... this will no longer be true, and the nuclear landmine in this
driver interface will have been armed

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-13 Thread Dave Chinner
On Sun, Jan 13, 2019 at 03:38:21PM -0800, Matthew Wilcox wrote:
> On Mon, Jan 14, 2019 at 10:29:02AM +1100, Dave Chinner wrote:
> > Until you have images (and hence host page cache) shared between
> > multiple guests. People will want to do this, because it means they
> > only need a single set of pages in host memory for executable
> > binaries rather than a set of pages per guest. Then you have
> > multiple guests being able to detect residency of the same set of
> > pages. If the guests can then, in any way, control eviction of the
> > pages from the host cache, then we have a guest-to-guest information
> > leak channel.
> 
> I don't think we should ever be considering something that would allow a
> guest to evict page's from the host's pagecache [1].  The guest should
> be able to kick its own references to the host's pagecache out of its
> own pagecache, but not be able to influence whether the host or another
> guest has a read-only mapping cached.
> 
> [1] Unless the guest is allowed to modify the host's file; obviously
> truncation, holepunching, etc are going to evict pages from the host's
> page cache.

Right, and that's exactly what I mean by "we need to be real careful
with functionality like this".

To be honest, I really don't think I've even touched the surface
here.

e.g. Filesystems and storage can share logical and physical extents.
Which means that image files that share storage (e.g.  because they
are all cloned from the same master image and/or there's in-line
deduplication running on the storage) and can be directly accessed
by guests may very well be susceptible to detection of host side
deduplication and subsequent copy-on-write operations.

This really doesn't seem much different to me from the guest being
able to infer host side KSM page deduplication and COW operation in
the guest side page cache.  The only difference is that DAX is being
used to probe the host side page cache and storage rather than the
guest side.

IOWs, I suspect there's a world of pain waiting for us if we punch
huge holes through the virtual machine abstractions like this.

Improving performance is a laudible goal, but at what price?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-13 Thread Dave Chinner
On Fri, Jan 11, 2019 at 02:45:04AM -0500, Pankaj Gupta wrote:
> 
> > 
> > On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
> > >  This patch series has implementation for "virtio pmem".
> > >  "virtio pmem" is fake persistent memory(nvdimm) in guest
> > >  which allows to bypass the guest page cache. This also
> > >  implements a VIRTIO based asynchronous flush mechanism.
> > 
> > H. Sharing the host page cache direct into the guest VM. Sounds
> > like a good idea, but.
> > 
> > This means the guest VM can now run timing attacks to observe host
> > side page cache residency, and depending on the implementation I'm
> > guessing that the guest will be able to control host side page
> > cache eviction, too (e.g. via discard or hole punch operations).
> 
> Not sure how? this is similar to mmapping virtual memory by any userspace 
> process. Any host userspace process can do such attack on host page cache
> using mincore & mmap shared file. 

Mincore is for monitoring, not cached eviction. And it's not
required to observe cache residency, either. That's a wide open
field containing an uncountable number of moles...

> But i don't think guest can do this alone. For virtio-pmem usecase
> guest won't be using page cache so timing attack from only guest
> side is not possible unless host userspace can run checks on page
> cache eviction state using mincore etc.  As rightly described by
> Rik, guest will only access its own page cache pages and if guest
> page cache is managed directly by host, this saves alot of effort
> for guest in transferring guest state of page cache.  

Until you have images (and hence host page cache) shared between
multiple guests. People will want to do this, because it means they
only need a single set of pages in host memory for executable
binaries rather than a set of pages per guest. Then you have
multiple guests being able to detect residency of the same set of
pages. If the guests can then, in any way, control eviction of the
pages from the host cache, then we have a guest-to-guest information
leak channel.

i.e. it's something we need to be aware of and really careful about
enabling infrastructure that /will/ be abused if guests can find a
way to influence the host side cache residency.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] [PATCH v3 0/5] kvm "virtio pmem" device

2019-01-09 Thread Dave Chinner
On Wed, Jan 09, 2019 at 08:17:31PM +0530, Pankaj Gupta wrote:
>  This patch series has implementation for "virtio pmem". 
>  "virtio pmem" is fake persistent memory(nvdimm) in guest 
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.  

H. Sharing the host page cache direct into the guest VM. Sounds
like a good idea, but.

This means the guest VM can now run timing attacks to observe host
side page cache residency, and depending on the implementation I'm
guessing that the guest will be able to control host side page
cache eviction, too (e.g. via discard or hole punch operations).

Which means this functionality looks to me like a new vector for
information leakage into and out of the guest VM via guest
controlled host page cache manipulation.

https://arxiv.org/pdf/1901.01161

I might be wrong, but if I'm not we're going to have to be very
careful about how guest VMs can access and manipulate host side
resources like the page cache.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)

2016-07-22 Thread Dave Chinner
On Thu, Jul 21, 2016 at 10:23:48AM -0400, Paolo Bonzini wrote:
> > > 1) avoid copying zero data, to keep the copy process efficient.  For this,
> > > SEEK_HOLE/SEEK_DATA are enough.
> > > 
> > > 2) copy file contents while preserving the allocation state of the file's
> > > extents.
> > 
> > Which is /very difficult/ to do safely and reliably.
> > i.e. the use of fiemap to duplicate the exact layout of a file
> > from userspace is only posisble if you can /guarantee/ the source
> > file has not changed in any way during the copy operation at the
> > pointin time you finalise the destination data copy.
> 
> We don't do exactly that, exactly because it's messy when you have
> concurrent accesses (which shouldn't be done but you never know).

Which means you *cannot make the assumption it won't happen*.

FIEMAP is not guaranteed to tell you exactly where the data in the
file is that you need to copy is and that nothing you can do from
userspace changes that. I can't say it any clearer than that.

> When
> doing a copy, we use(d to use) FIEMAP the same way as you'd use lseek,
> querying one extent at a time.  If you proceed this way, all of these
> can cause the same races:
> 
> - pread(ofs=10MB, len=10MB) returns all zeroes, so the 10MB..20MB is
> not copied
> 
> - pread(ofs=10MB, len=10MB) returns non-zero data, so the 10MB..20MB is
> copied
> 
> - lseek(SEEK_DATA, 10MB) returns 20MB, so the 10MB..20MB area is not
> copied
> 
> - lseek(SEEK_HOLE, 10MB) returns 20MB, so the 10MB..20MB area is
> copied
> 
> - ioctl(FIEMAP at 10MB) returns an extent starting at 20MB, so
> the 10MB..20MB area is not copied

No, FIEMAP is not guaranteed to behave like this. what is returned
is filesystem dependent. Fielsystems that don't support holes will
return data extents. Filesystems that support compression might
return a compressed data extent rather than a hole. Encrypted files
might not expose holes at all, so people can't easily find known
plain text regions in the encrypted data. Filesystems could report
holes as deduplicated data, etc.  What do you do when FIEMAP returns
"OFFLINE" to indicate that the data is located elsewhere and will
need to be retrieved by the HSM operating on top of the filesystem
before layout can be determined?

All of the above are *valid* and *correct*, because the filesytem
defines what FIEMAP returns for a given file offset. just because
ext4 and XFS have mostly the same behaviour, it doesn't mean that
every other filesystem behaves the same way.

The assumptions being made about FIEMAP behaviour will only lead to
user data corruption, as they already have several times in the past.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)

2016-07-21 Thread Dave Chinner
On Thu, Jul 21, 2016 at 01:31:21PM +0100, Pádraig Brady wrote:
> On 21/07/16 12:43, Dave Chinner wrote:
> > On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
> >> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
> >> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
> >>   http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87
> > 
> > My understanding was that FIEMAP was disabled almost immediately
> > after the data corruption problems were understood, and it hasn't
> > been turned back on since. I don't see FIEMAP calls in strace when I
> > copy sparse files, even when I use --sparse=auto which is supposed
> > to trigger the sparse source file checks using FIEMAP.
> > 
> > So, yeah, while there's FIEMAP code present, that doesn't mean it's
> > actually used. And, yes, there are comments in coreutils commits in
> > 2011 saying that the FIEMAP workarounds are temporary until
> > SEK_HOLE/DATA are supported, which were added to the kernel in 2011
> > soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
> > Five years later, and the fiemap code is stil there?
> 
> Yes it's still there, and hasn't caused issues.
> It's used only for sparse files:
> 
>   $ truncate -s1G t.fiemap
>   $ strace -e ioctl cp t.fiemap t2.fiemap
>   ioctl(3, FS_IOC_FIEMAP, 0x7007c910) = 0

So, my tests were done with coreutils 8.25, and a sparse 100M file with
a layout like this:

$ xfs_bmap -vvp testfile
testfile:
 EXT: FILE-OFFSET   BLOCK-RANGEAG AG-OFFSETTOTAL FLAGS
   0: [0..20479]:   71768912..71789391  3 (11216720..11237199) 20480 0
   1: [20480..40959]:   hole   20480
   2: [40960..61439]:   71809872..71830351  3 (11257680..11278159) 20480 0
   3: [61440..81919]:   71830352..71850831  3 (11278160..11298639) 20480 1
   4: [81920..96223]:   71850832..71865135  3 (11298640..11312943) 14304 0
   5: [96224..190471]:  60457944..60552191  2 (20089816..20184063) 94248 0
   6: [190472..204799]: 73101864..73116191  3 (12549672..12563999) 14328 0
 FLAG Values:
01 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end   on stripe unit
10 Doesn't begin on stripe width
01 Doesn't end   on stripe width

i.e. a 10MB hole @ 10MB, a 10MB unwritten extent @ 30MB. And, yes, I
do see a fiemap call with your example above.

IOWs, it seems the "is sparse" heuristic that cp uses is not very
reliable.  That doesn't inspire confidence, and explains why none of
my cp tests on sparse files ove rthe past 5 years have ever shown
FIEMAP calls

/me is now somewhat afraid to use cp for copies that require data
integrity.

> It's a pity fiemap is racy (on some file systems?) wrt reporting
> of data not yet written out, and is thus fairly useless IMHO
> without the FIEMAP_FLAG_SYNC workaround.

It's racy on *all filesystems* - all the FIEMAP_FLAG_SYNC flag does
is make the race window smaller. The extent map is only coherent
with the data in memory while the inode locks are held by the
kernel. The moment the inode locks are dropped by the kernel in the
syscall, the extent map held in the fiemap structure is considered
stale and no longer coherent with the extent map held on the inode.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)

2016-07-21 Thread Dave Chinner
On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote:
> > > 1) is it expected that SEEK_HOLE skips unwritten extents?
> > 
> > There are multiple answers to this, all of which are correct depending
> > on current context and state:
> > 
> > 1. No - some filesystems will report clean unwritten extents as holes.
> > 
> > 2. Yes - some filesystems will report clean unwritten extents as data.
> > 
> > 3.  Maybe - if there is written data in memory over the unwritten
> > extent on disk (i.e. hasn't been flushed to disk, it will be
> > considered a data region with non-zero data. (FIEMAP will still
> > report is as unwritten)
> 
> Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC
> in this case (not FIEMAP_EXTENT_UNWRITTEN).

No. FIEMAP only returns the known extent state at the given file
offset.  "delalloc" extents exist in memory, indicating the space
has already been accounted for over that offset, but the extent has
not been physically allocated. Like all other types of extents,
there may or may not be valid data over a delayed allocation extent. 

IOWs, fiemap only gives you a snapshot of extent state, not the
ranges of valid data in the file.

> > > If not, would
> > > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> > > would be similar to what SEEK_HOLE/SEEK_DATA do now?
> > 
> > To solve what problem? You haven't explained what problem you are
> > trying to solve yet.
> > 
> > > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> > > filesystems and kernel releases is it really not needed?
> > 
> > I can't answer this question, either, because I don't know what
> > you want the fiemap information for.
> 
> The answer is the same no matter if we use both lseek and FIEMAP, so
> I'll answer just once.  We want to do two things:
> 
> 1) avoid copying zero data, to keep the copy process efficient.  For this,
> SEEK_HOLE/SEEK_DATA are enough.
> 
> 2) copy file contents while preserving the allocation state of the file's 
> extents.

Which is /very difficult/ to do safely and reliably.

We do actually do reliable, safe, exact hole and preallocation
layout duplication with xfs_fsr, but that uses kernel provided
cookies (from XFS_IOC_BULKSTAT) to detect that data in the source
file has not changed while it was being copied before executing the
final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes
the new copy of the data user visible.

i.e. the use of fiemap to duplicate the exact layout of a file
from userspace is only posisble if you can /guarantee/ the source
file has not changed in any way during the copy operation at the
pointin time you finalise the destination data copy.

> There can be various reasons why the user has preallocated the file (because 
> they
> don't want an ENOSPC to happen while the VM runs; on some filesystems, to
> minimize cases where io_submit is very un-asynchronous; or just because 
> someone
> had a reason to do a BLKZEROOUT ioctl on the virtual disk).  We want to 
> preserve
> these while converting or otherwise moving the file around.

Sure, there's many reasons for using prealloc/punch/zero. The real
difference to other file operations is that they interface with low
level filesystem structure, not the data contained within the
extents. That's what makes them problematic for duplication -
userspace cannot serialise against low level filesystem structure
modifications.

Optimising file copies safely is one of the reasons the
copy_file_range() syscall has been introduced (in 4.5). While we
haven't implemented anything special in XFS yet, it will internally
use splice to do a zero-copy data transfer from source to
destination file. Optimising for exact layout copies is precisely
the sort of thing this syscall is intended for.

It's also intended to enable applications to take advantage of
hardware acceleration of data copying (e.g. server side copies to
avoid round trips as has been implemented for NFS, or storage array
offload of data copying) when such support is provided by the kernel.

IOWs, I think you should be looking to optimise file copies by using
copy_file_range() and getting filesystems to do exactly what you
need. Using FIEMAP, fallocate and moving data through userspace
won't ever be reliable without special filesystem help (that only
exists for XFS right now), nor will it enable the application to
transparently use smart storage protocols and hardware when it is
present on user systems

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)

2016-07-21 Thread Dave Chinner
On Wed, Jul 20, 2016 at 03:35:17PM +0200, Niels de Vos wrote:
> On Wed, Jul 20, 2016 at 10:30:25PM +1000, Dave Chinner wrote:
> > On Wed, Jul 20, 2016 at 05:19:37AM -0400, Paolo Bonzini wrote:
> > > Adding ext4 and XFS guys (Lukas and Dave respectively).  As a quick 
> > > recap, the
> > > issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we 
> > > use in
> > > "qemu-img map".  This command prints metadata about a virtual disk 
> > > image---which
> > > in the case of a raw image amounts to detecting holes and unwritten 
> > > extents.
> > > 
> > > First, it seems like SEEK_HOLE and SEEK_DATA really should be 
> > > "SEEK_NONZERO" and
> > > "SEEK_ZERO", on both ext4 and XFS.You can see that unwritten extents 
> > > are
> > > reported by "qemu-img map" as holes:
> > 
> > Correctly so. seek hole/data knows nothing about the underlying
> > filesystem storage implementation. A "hole" is defined as a region
> > of zeros, and the filesystem is free call anything it kows for
> > certain contains zeros as a hole.
> > 
> > FYI, SEEK_HOLE/DATA were named after the the pre-existing Solaris
> > seek API that uses these definitions for hole and data
> > 
> > > $ dd if=/dev/urandom of=test.img bs=1M count=100
> > > $ fallocate -z -o 10M -l 10M test.img
> > > $ du -h test.img
> > > $ qemu-img map --output=json test.img
> > > [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": 
> > > true, "offset": 0},
> > > { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, 
> > > "data": false, "offset": 10485760},
> > > { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, 
> > > "data": true, "offset": 20971520}]
> > 
> > > 
> > > On the second line, zero=true data=false identifies a hole.  The right 
> > > output
> > > would either have zero=true data=true (unwritten extent) or just
> > > 
> > > [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, 
> > > "data": true, "offset": 0},
> > >
> > > since the zero flag is advisory (it doesn't try to detect zeroes beyond 
> > > what the
> > > filesystem says).
> > 
> > Not from SEEK_HOLE/SEEK_DATA. All it conveys is "this is the start
> > of a range of zeros" and "this is the start of a range of data". And
> > for filesystems that don't specifically implement these seek
> > operations, zeros are considered data. i.e. SEEK_HOLE will take you
> > to the end of file, SEEK_DATA returns the current position
> > 
> > i.e. unwritten extents contain no data, so they are semantically
> > identical to holes for the purposes of seeking and hence SEEK_DATA
> > can skip over them.
> > 
> > > The reason why we disabled FIEMAP was a combination of a corruption and 
> > > performance
> > > issue.  The data corruption bug was at 
> > > https://bugs.launchpad.net/qemu/+bug/1368815
> > > and it was reported on Ubuntu Trusty (kernel 3.13 based on the release 
> > > notes at
> > > https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes).  We corrected that by 
> > > using
> > > FIEMAP_FLAG_SYNC, based on a similar patch to coreutils.  This turned out 
> > > to be too
> > > slow, so we dropped FIEMAP altogether.
> > 
> > Yes, because FIEMAP output is only useful for diagnostic purposes as
> > it can be stale even before the syscall returns to userspace. i.e.
> > it can't be used in userspace for optimising file copies
> 
> Oh... And I was surprised to learn that "cp" does use FIEMAP and not
> SEEK_HOLE/SEEK_DATA. You give a strong suggestion to fix that.
>   http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/extent-scan.c#n87

My understanding was that FIEMAP was disabled almost immediately
after the data corruption problems were understood, and it hasn't
been turned back on since. I don't see FIEMAP calls in strace when I
copy sparse files, even when I use --sparse=auto which is supposed
to trigger the sparse source file checks using FIEMAP.

So, yeah, while there's FIEMAP code present, that doesn't mean it's
actually used. And, yes, there are comments in coreutils commits in
2011 saying that the FIEMAP workarounds are temporary until
SEK_HOLE/DATA are supported, which were added to the kernel in 2011
soon after the 'cp-corrupts-data-with-fiemap' debacle came to light.
Five years later, and the fiemap code is stil there?

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server)

2016-07-20 Thread Dave Chinner
 filesystems will report clean unwritten extents as holes.

2. Yes - some filesystems will report clean unwritten extents as
data.

3.  Maybe - if there is written data in memory over the unwritten
extent on disk (i.e. hasn't been flushed to disk, it will be
considered a data region with non-zero data. (FIEMAP will still
report is as unwritten)

> If not, would
> it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which
> would be similar to what SEEK_HOLE/SEEK_DATA do now?

To solve what problem? You haven't explained what problem you are
trying to solve yet.

> 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC?  And if not, for what
> filesystems and kernel releases is it really not needed?

I can't answer this question, either, because I don't know what
you want the fiemap information for.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Qemu-devel] ext4 error when testing virtio-scsi & vhost-scsi

2016-07-17 Thread Dave Chinner
On Fri, Jul 15, 2016 at 03:55:20PM +0800, Zhangfei Gao wrote:
> Dear Dave
> 
> On Wed, Jul 13, 2016 at 7:03 AM, Dave Chinner <da...@fromorbit.com> wrote:
> > On Tue, Jul 12, 2016 at 12:43:24PM -0400, Theodore Ts'o wrote:
> >> On Tue, Jul 12, 2016 at 03:14:38PM +0800, Zhangfei Gao wrote:
> >> > Some update:
> >> >
> >> > If test with ext2, no problem in iblock.
> >> > If test with ext4, ext4_mb_generate_buddy reported error in the
> >> > removing files after reboot.
> >> >
> >> >
> >> > root@(none)$ rm test
> >> > [   21.006549] EXT4-fs error (device sda): ext4_mb_generate_buddy:758: 
> >> > group 18
> >> > , block bitmap and bg descriptor inconsistent: 26464 vs 25600 free 
> >> > clusters
> >> > [   21.008249] JBD2: Spotted dirty metadata buffer (dev = sda, blocknr = 
> >> > 0). Th
> >> > ere's a risk of filesystem corruption in case of system crash.
> >> >
> >> > Any special notes of using ext4 in qemu?
> >>
> >> Ext4 has more runtime consistency checking than ext2.  So just because
> >> ext4 complains doesn't mean that there isn't a problem with the file
> >> system; it just means that ext4 is more likely to notice before you
> >> lose user data.
> >>
> >> So if you test with ext2, try running e2fsck afterwards, to make sure
> >> the file system is consistent.
> >>
> >> Given that I'm reguarly testing ext4 using kvm, and I haven't seen
> >> anything like this in a very long time, I suspect the problemb is with
> >> your SCSI code, and not with ext4.
> >
> > It's the same error I reported yesterday for ext3 on 4.7-rc6 when
> > rebooting a VM after it hung.
> 
> 
> Any link of this error?

http://article.gmane.org/gmane.comp.file-systems.ext4/53792

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Qemu-devel] ext4 error when testing virtio-scsi & vhost-scsi

2016-07-12 Thread Dave Chinner
On Tue, Jul 12, 2016 at 12:43:24PM -0400, Theodore Ts'o wrote:
> On Tue, Jul 12, 2016 at 03:14:38PM +0800, Zhangfei Gao wrote:
> > Some update:
> > 
> > If test with ext2, no problem in iblock.
> > If test with ext4, ext4_mb_generate_buddy reported error in the
> > removing files after reboot.
> > 
> > 
> > root@(none)$ rm test
> > [   21.006549] EXT4-fs error (device sda): ext4_mb_generate_buddy:758: 
> > group 18
> > , block bitmap and bg descriptor inconsistent: 26464 vs 25600 free clusters
> > [   21.008249] JBD2: Spotted dirty metadata buffer (dev = sda, blocknr = 
> > 0). Th
> > ere's a risk of filesystem corruption in case of system crash.
> > 
> > Any special notes of using ext4 in qemu?
> 
> Ext4 has more runtime consistency checking than ext2.  So just because
> ext4 complains doesn't mean that there isn't a problem with the file
> system; it just means that ext4 is more likely to notice before you
> lose user data.
> 
> So if you test with ext2, try running e2fsck afterwards, to make sure
> the file system is consistent.
> 
> Given that I'm reguarly testing ext4 using kvm, and I haven't seen
> anything like this in a very long time, I suspect the problemb is with
> your SCSI code, and not with ext4.

It's the same error I reported yesterday for ext3 on 4.7-rc6 when
rebooting a VM after it hung.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com