Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-23 Thread Dan Williams
On Mon, Jan 23, 2017 at 10:03 AM, Christoph Hellwig  wrote:
> On Mon, Jan 23, 2017 at 09:14:04AM -0800, Dan Williams wrote:
>> The use case that we have now is distinguishing volatile vs persistent
>> memory (brd vs pmem).
>
> brd is a development tool, so until we have other reasons for this
> abstraction (which I'm pretty sure will show up rather sooner than later)
> I would not worry about it too much.

By "volatile" I also meant cases where pmem is fronting volatile
memory, or more importantly when the platform has otherwise arranged
for cpu caches to be flushed on a power loss event like I believe some
existing storage appliances do.

>> I took a look at mtd layering approach and the main difference is that
>> layers above the block layer do not appear to know anything about mtd
>> specifics.
>
> Or the block layer itself for that matter.  And that's exactly where
> I want DAX to be in the future.
>
>> For fs/dax.c we currently need some path to retrieve a dax
>> anchor object through the block device.
>
> We have a need to retreiver the anchor object.  We currently do it
> though the block layer for historical reasons, but it doesn't have
> to be that way.
>
>> > In the longer run I like your dax_operations, but they need to be
>> > separate from the block layer.
>>
>> I'll move them from block_device_operations to dax data hanging off of
>> the bdev_inode, or is there a better way to go from bdev-to-dax?
>
> I don't think that's any better.  What we really want is a way
> to find the underlying persistent memory / DAX / whatever we call
> it node without going through a block device.  E.g. a library function
> to give that object for a given path name, where the path name could
> be either that of the /dev/pmemN or the /dev/daxN device.
>
> If the file system for now still needs a block device as well it
> will only accept the /dev/pmemN name, and open both the low-level
> pmem device and the block device.  Once that file system doesn't
> need block code (and I think we could do that easily for XFS,
> nevermind any new FS) it won't have to deal with the block
> device at all.
>
> pmem.c then becomes a consumer of the dax_ops just like the file system.

Ah ok, I'll take a look at a dax_by_path() capability.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-23 Thread Christoph Hellwig
On Mon, Jan 23, 2017 at 09:14:04AM -0800, Dan Williams wrote:
> The use case that we have now is distinguishing volatile vs persistent
> memory (brd vs pmem).

brd is a development tool, so until we have other reasons for this
abstraction (which I'm pretty sure will show up rather sooner than later)
I would not worry about it too much.

> I took a look at mtd layering approach and the main difference is that
> layers above the block layer do not appear to know anything about mtd
> specifics.

Or the block layer itself for that matter.  And that's exactly where
I want DAX to be in the future.

> For fs/dax.c we currently need some path to retrieve a dax
> anchor object through the block device.

We have a need to retreiver the anchor object.  We currently do it
though the block layer for historical reasons, but it doesn't have
to be that way.

> > In the longer run I like your dax_operations, but they need to be
> > separate from the block layer.
> 
> I'll move them from block_device_operations to dax data hanging off of
> the bdev_inode, or is there a better way to go from bdev-to-dax?

I don't think that's any better.  What we really want is a way
to find the underlying persistent memory / DAX / whatever we call
it node without going through a block device.  E.g. a library function
to give that object for a given path name, where the path name could
be either that of the /dev/pmemN or the /dev/daxN device.

If the file system for now still needs a block device as well it
will only accept the /dev/pmemN name, and open both the low-level
pmem device and the block device.  Once that file system doesn't
need block code (and I think we could do that easily for XFS,
nevermind any new FS) it won't have to deal with the block
device at all.

pmem.c then becomes a consumer of the dax_ops just like the file system.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-23 Thread Dan Williams
On Mon, Jan 23, 2017 at 8:00 AM, Christoph Hellwig  wrote:
> On Sun, Jan 22, 2017 at 11:10:04PM -0800, Dan Williams wrote:
>> How about we solve the copy_from_user() abuse first before we hijack
>> this thread for some future feature that afaics has no patches posted
>> yet.
>
> Solving copy_from_user abuse first sounds perfectly fine to me.  But
> please do so without abusing the block layer for persistent memory
> access.  Given that we don't have use cases for different pmem access
> methods in a single OS image yet let's avoid introducing new ops
> for now and just remove the copy_from_user abuse.

The use case that we have now is distinguishing volatile vs persistent
memory (brd vs pmem).

I took a look at mtd layering approach and the main difference is that
layers above the block layer do not appear to know anything about mtd
specifics. For fs/dax.c we currently need some path to retrieve a dax
anchor object through the block device.

> In the longer run I like your dax_operations, but they need to be
> separate from the block layer.

I'll move them from block_device_operations to dax data hanging off of
the bdev_inode, or is there a better way to go from bdev-to-dax?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-23 Thread Christoph Hellwig
On Sun, Jan 22, 2017 at 09:30:23AM -0800, Dan Williams wrote:
> So are you saying we need a way to go from a block_device inode to a
> dax_device inode and then look up the dax_operations from there?
> 
> A filesystem, if it so chooses, could mount on top of the dax_device
> inode directly?

Sentence 1: maybe if we have to.  Sentence 2: absolutely.

> I did add a dax_superblock for the device-dax character device
> representation I could refactor that so the block_device presentation
> of a namespace and a character device presentation are just different
> layers on top of the base-level dax inode.

That's a good start.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-23 Thread Matthew Wilcox
From: Christoph Hellwig [mailto:h...@lst.de]
> On Sun, Jan 22, 2017 at 03:43:09PM +, Matthew Wilcox wrote:
> > In the case of a network filesystem being used to communicate with
> > a different VM on the same physical machine, there is no backing
> > device, just a network protocol.
> 
> Again, do you mean block device?  For a filesystem that does not do any
> pagecache writeback we already don't need a backing device, so I don't
> really see an issue there to start with.

No, I mean a network filesystem like 9p or cifs or nfs.  If the memcpy is 
supposed to be performed by the backing device and there is no backing device, 
then it's going to need to be done by the network filesystem.

(Also, the network filesystem might have a command, like RDMA has/will have, to 
ensure that the write has reached persistence)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Matthew Wilcox
From: Dan Williams [mailto:dan.j.willi...@intel.com]
> A couple weeks back, in the course of reviewing the memcpy_nocache()
> proposal from Brian, Linus subtly suggested that the pmem specific
> memcpy_to_pmem() routine be moved to be implemented at the driver
> level [1]:

Of course, there may not be a backing device either!  That will depend on the 
filesystem.
I see two possible routes here:

1. Add a new address_space_operation:

const struct dax_operations *(*get_dax_ops)(struct address_space *);

2. Add two of the dax_operations to address_space_operations:

size_t (*copy_from_iter)(struct address_space *, void *, size_t, struct 
iov_iter *);
void (*flush)(struct address_space *, void *, size_t);
(we won't need ->direct_access as an address_space op because that'll be 
handled a different way in the brave new world that supports non-bdev-based 
filesystems)

Obviously in either case we'd have generic bdev versions for ext4, xfs and 
other block based filesystems, but filesystems with a character device or a 
network protocol behind them would do whatever it is they need to do.

I kind of prefer the second option, but does anyone else have a preference?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Matthew Wilcox
From: Christoph Hellwig [mailto:h...@lst.de]
> On Sat, Jan 21, 2017 at 04:28:52PM +, Matthew Wilcox wrote:
> > Of course, there may not be a backing device either!
> 
> s/backing device/block device/ ?  If so fully agreed.  I like the dax_ops
> scheme, but we should go all the way and detangle it from the block
> device.  I already brought up this issue with the fallback to direct I/O
> on I/O error series.

In the case of a network filesystem being used to communicate with a different 
VM on the same physical machine, there is no backing device, just a network 
protocol.
 
> And both of them are wrong.  The write_begin/write_end mistake
> notwithstanding address_space ops are operations the VM can call without
> knowing things like fs locking contexts.  The above on the other hand
> are device operations provided by the low-level driver, similar to
> block_device operations.  So what we need is to have a way to mount
> a dax device as a file system, similar to how we support that for block
> or MTD devices and can then call methods on it.  For now this will
> be a bit complicated because all current DAX-aware file systems also
> still need block device for the metadata path, so we can't just say
> you mount either a DAX or block device.  But I think we should aim
> for mounting a DAX device as the primary use case, and then deal
> with block device emulation as a generic DAX layer thing, similarly
> how we implement (bad in the rw case) block devices on top of MTD.

I'm not terribly enthusiastic about creating a fake block device to sit on top 
of a network filesystem, but I suppose we could go that way if we had to.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Matthew Wilcox
From: Christoph Hellwig [mailto:h...@lst.de]
> On Sun, Jan 22, 2017 at 06:39:28PM +, Matthew Wilcox wrote:
> > Two guests on the same physical machine (or a guest and a host) have access
> > to the same set of physical addresses.  This might be an NV-DIMM, or it 
> > might
> > just be DRAM (for the purposes of reducing guest overhead).  The network
> > filesystem has been enhanced with a call to allow the client to ask the 
> > server
> > "What is the physical address for this range of bytes in this file?"
> >
> > We don't want to use the guest pagecache here.  That's antithetical to the
> > second usage, and it's inefficient for the first usage.
> 
> And the answer is that you need a dax device for whatever memoery exposed
> in this way, as it needs to show up in the memory map for example.

Wow, DAX devices look painful and awful.  I certainly don't want to be exposing 
the memory fronted by my network filesystem to userspace to access.  That just 
seems like a world of pain and bad experiences.  Absolutely the filesystem (or 
perhaps better, the ACPI tables) need to mark that chunk of memory as reserved, 
but it's definitely not available for anyone to access without the filesystem 
being aware.

Even if we let the filesystem create a DAX device that doesn't show up in /dev 
(for example), Dan's patches don't give us a way to go from a file on the 
filesystem to a set of dax_ops.  And it does need to be a per-file operation, 
eg to support a file on an XFS volume which might be on a RT device or a normal 
device.  That was why I leaned towards an address_space operation, but I'd be 
happy to see an inode_operation instead. 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Matthew Wilcox
From: Christoph Hellwig [mailto:h...@lst.de]
> On Sun, Jan 22, 2017 at 06:19:24PM +, Matthew Wilcox wrote:
> > No, I mean a network filesystem like 9p or cifs or nfs.  If the memcpy
> > is supposed to be performed by the backing device
> 
> struct backing_dev has no relation to the DAX code.  Even more so what's
> the point of doing a DAXish memcpy in that case?  If we buffer in
> memory for network I/O we should just use the page cache.

Oh, I didn't mean a 'struct backing_dev'.  I meant that, conceptually, there is 
no driver for the filesystem to call.  Here's the architecture that I'm trying 
to work with:

Two guests on the same physical machine (or a guest and a host) have access to 
the same set of physical addresses.  This might be an NV-DIMM, or it might just 
be DRAM (for the purposes of reducing guest overhead).  The network filesystem 
has been enhanced with a call to allow the client to ask the server "What is 
the physical address for this range of bytes in this file?"

We don't want to use the guest pagecache here.  That's antithetical to the 
second usage, and it's inefficient for the first usage.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Dan Williams
On Sun, Jan 22, 2017 at 10:37 PM, Matthew Wilcox  wrote:
> From: Christoph Hellwig [mailto:h...@lst.de]
>> On Sun, Jan 22, 2017 at 06:39:28PM +, Matthew Wilcox wrote:
>> > Two guests on the same physical machine (or a guest and a host) have access
>> > to the same set of physical addresses.  This might be an NV-DIMM, or it 
>> > might
>> > just be DRAM (for the purposes of reducing guest overhead).  The network
>> > filesystem has been enhanced with a call to allow the client to ask the 
>> > server
>> > "What is the physical address for this range of bytes in this file?"
>> >
>> > We don't want to use the guest pagecache here.  That's antithetical to the
>> > second usage, and it's inefficient for the first usage.
>>
>> And the answer is that you need a dax device for whatever memoery exposed
>> in this way, as it needs to show up in the memory map for example.
>
> Wow, DAX devices look painful and awful.  I certainly don't want to be 
> exposing the memory fronted by my network filesystem to userspace to access.  
> That just seems like a world of pain and bad experiences.  Absolutely the 
> filesystem (or perhaps better, the ACPI tables) need to mark that chunk of 
> memory as reserved, but it's definitely not available for anyone to access 
> without the filesystem being aware.
>
> Even if we let the filesystem create a DAX device that doesn't show up in 
> /dev (for example), Dan's patches don't give us a way to go from a file on 
> the filesystem to a set of dax_ops.  And it does need to be a per-file 
> operation, eg to support a file on an XFS volume which might be on a RT 
> device or a normal device.  That was why I leaned towards an address_space 
> operation, but I'd be happy to see an inode_operation instead.

How about we solve the copy_from_user() abuse first before we hijack
this thread for some future feature that afaics has no patches posted
yet.

An incremental step towards disentangling filesystem-dax from
block_devices is a lookup mechanism to go from a block_device to a dax
object that holds dax_ops. When this brave new filesystem enabling
appears it can grow a mechanism to lookup, or mount on, the dax object
directly.

One idea is to just hang a pointer to this dax object off of
bdev_inode, set at bdev open() time.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Christoph Hellwig
On Sun, Jan 22, 2017 at 06:39:28PM +, Matthew Wilcox wrote:
> Two guests on the same physical machine (or a guest and a host) have access 
> to the same set of physical addresses.  This might be an NV-DIMM, or it might 
> just be DRAM (for the purposes of reducing guest overhead).  The network 
> filesystem has been enhanced with a call to allow the client to ask the 
> server "What is the physical address for this range of bytes in this file?"
> 
> We don't want to use the guest pagecache here.  That's antithetical to the 
> second usage, and it's inefficient for the first usage.

And the answer is that you need a dax device for whatever memoery exposed
in this way, as it needs to show up in the memory map for example.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Christoph Hellwig
On Sun, Jan 22, 2017 at 06:19:24PM +, Matthew Wilcox wrote:
> No, I mean a network filesystem like 9p or cifs or nfs.  If the memcpy
> is supposed to be performed by the backing device

struct backing_dev has no relation to the DAX code.  Even more so what's
the point of doing a DAXish memcpy in that case?  If we buffer in
memory for network I/O we should just use the page cache.

> (Also, the network filesystem might have a command, like RDMA has/will have, 
> to ensure that the write has reached persistence)

I know very well due to my work for a DAX-backed pNFS layout.  But that
is mostly transparent to the NFS frontend code and won't use DAX
on the client at all.  Just pagecache as a source for RDMA READ/WRITE.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Dan Williams
On Sat, Jan 21, 2017 at 9:52 AM, Christoph Hellwig  wrote:
> On Sat, Jan 21, 2017 at 04:28:52PM +, Matthew Wilcox wrote:
>> Of course, there may not be a backing device either!
>
> s/backing device/block device/ ?  If so fully agreed.  I like the dax_ops
> scheme, but we should go all the way and detangle it from the block
> device.  I already brought up this issue with the fallback to direct I/O
> on I/O error series.
>
>> I see two possible routes here:
>>
>> 1. Add a new address_space_operation:
>>
>>   const struct dax_operations *(*get_dax_ops)(struct address_space *);
>>
>> 2. Add two of the dax_operations to address_space_operations:
>>
>>   size_t (*copy_from_iter)(struct address_space *, void *, size_t, 
>> struct iov_iter *);
>>   void (*flush)(struct address_space *, void *, size_t);
>> (we won't need ->direct_access as an address_space op because that'll be 
>> handled a different way in the brave new world that supports non-bdev-based 
>> filesystems)
>
> And both of them are wrong.  The write_begin/write_end mistake
> notwithstanding address_space ops are operations the VM can call without
> knowing things like fs locking contexts.  The above on the other hand
> are device operations provided by the low-level driver, similar to
> block_device operations.  So what we need is to have a way to mount
> a dax device as a file system, similar to how we support that for block
> or MTD devices and can then call methods on it.  For now this will
> be a bit complicated because all current DAX-aware file systems also
> still need block device for the metadata path, so we can't just say
> you mount either a DAX or block device.  But I think we should aim
> for mounting a DAX device as the primary use case, and then deal
> with block device emulation as a generic DAX layer thing, similarly
> how we implement (bad in the rw case) block devices on top of MTD.

So are you saying we need a way to go from a block_device inode to a
dax_device inode and then look up the dax_operations from there?

A filesystem, if it so chooses, could mount on top of the dax_device
inode directly?

I did add a dax_superblock for the device-dax character device
representation I could refactor that so the block_device presentation
of a namespace and a character device presentation are just different
layers on top of the base-level dax inode.

...or am I not tracking what you are suggesting?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-22 Thread Christoph Hellwig
On Sun, Jan 22, 2017 at 03:43:09PM +, Matthew Wilcox wrote:
> In the case of a network filesystem being used to communicate with
> a different VM on the same physical machine, there is no backing
> device, just a network protocol.

Again, do you mean block device?  For a filesystem that does not do any
pagecache writeback we already don't need a backing device, so I don't
really see an issue there to start with.

> I'm not terribly enthusiastic about creating a fake block device to
> sit on top of a network filesystem, but I suppose we could go that
> way if we had to.

I see no need to a new network filesystem to have a fake block device.
We do need a fake block device for an unchanged or partial DAX aware
file system.  And those are the only ones we have at the moment, although
XFS could be converted to do direct calls bypassing the block layer
fairly trivially if needed.  For ext2 and ext4 that would be much harder
due to the buffer cache dependency.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 00/13] dax, pmem: move cpu cache maintenance to libnvdimm

2017-01-21 Thread Christoph Hellwig
On Sat, Jan 21, 2017 at 04:28:52PM +, Matthew Wilcox wrote:
> Of course, there may not be a backing device either!

s/backing device/block device/ ?  If so fully agreed.  I like the dax_ops
scheme, but we should go all the way and detangle it from the block
device.  I already brought up this issue with the fallback to direct I/O
on I/O error series.

> I see two possible routes here:
> 
> 1. Add a new address_space_operation:
> 
>   const struct dax_operations *(*get_dax_ops)(struct address_space *);
> 
> 2. Add two of the dax_operations to address_space_operations:
> 
>   size_t (*copy_from_iter)(struct address_space *, void *, size_t, struct 
> iov_iter *);
>   void (*flush)(struct address_space *, void *, size_t);
> (we won't need ->direct_access as an address_space op because that'll be 
> handled a different way in the brave new world that supports non-bdev-based 
> filesystems)

And both of them are wrong.  The write_begin/write_end mistake
notwithstanding address_space ops are operations the VM can call without
knowing things like fs locking contexts.  The above on the other hand
are device operations provided by the low-level driver, similar to
block_device operations.  So what we need is to have a way to mount
a dax device as a file system, similar to how we support that for block
or MTD devices and can then call methods on it.  For now this will
be a bit complicated because all current DAX-aware file systems also
still need block device for the metadata path, so we can't just say
you mount either a DAX or block device.  But I think we should aim
for mounting a DAX device as the primary use case, and then deal
with block device emulation as a generic DAX layer thing, similarly
how we implement (bad in the rw case) block devices on top of MTD.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel