[ndctl PATCH] test, multi-pmem: check namespace deletion

2017-01-12 Thread Dan Williams
The initial implementation of multi-pmem support neglected to enable
deletion of pmem namespaces. Add a test to check for this condition.

Signed-off-by: Dan Williams 
---
 test/multi-pmem.c |   28 
 1 file changed, 28 insertions(+)

diff --git a/test/multi-pmem.c b/test/multi-pmem.c
index 8f76e02d2fd1..dd1269ced667 100644
--- a/test/multi-pmem.c
+++ b/test/multi-pmem.c
@@ -54,6 +54,28 @@ static void destroy_namespace(struct ndctl_namespace *ndns)
cmd_destroy_namespace(argc, argv, ctx);
 }
 
+/* Check that the namespace device is gone (if it wasn't the seed) */
+static int check_deleted(struct ndctl_region *region, const char *devname,
+   struct ndctl_test *test)
+{
+   struct ndctl_namespace *ndns;
+
+   if (!ndctl_test_attempt(test, KERNEL_VERSION(4, 10, 0)))
+   return 0;
+
+   ndctl_namespace_foreach(region, ndns) {
+   if (strcmp(devname, ndctl_namespace_get_devname(ndns)))
+   continue;
+   if (ndns == ndctl_region_get_namespace_seed(region))
+   continue;
+   fprintf(stderr, "mult-pmem: expected %s to be deleted\n",
+   devname);
+   return -ENXIO;
+   }
+
+   return 0;
+}
+
 static int do_multi_pmem(struct ndctl_ctx *ctx, struct ndctl_test *test)
 {
int i;
@@ -190,6 +212,9 @@ static int do_multi_pmem(struct ndctl_ctx *ctx, struct 
ndctl_test *test)
devname, blk_avail, blk_avail_orig);
return -ENXIO;
}
+
+   if (check_deleted(target, devname, test) != 0)
+   return -ENXIO;
}
 
ndns = namespaces[NUM_NAMESPACES - 1];
@@ -204,6 +229,9 @@ static int do_multi_pmem(struct ndctl_ctx *ctx, struct 
ndctl_test *test)
return -ENXIO;
}
 
+   if (check_deleted(target, devname, test) != 0)
+   return -ENXIO;
+
ndctl_bus_foreach(ctx, bus) {
if (strncmp(ndctl_bus_get_provider(bus), "nfit_test", 9) != 0)
continue;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero

2017-01-12 Thread Dan Williams
Commit 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple
pmem-namespaces per region") added support for establishing additional
pmem namespace beyond the seed device, similar to blk namespaces.
However, it neglected to delete the namespace when the size is set to
zero.

Fixes: 98a29c39dc68 ("libnvdimm, namespace: allow creation of multiple 
pmem-namespaces per region")
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/namespace_devs.c |   23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 6307088b375f..a518cb1b59d4 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -957,6 +957,7 @@ static ssize_t __size_store(struct device *dev, unsigned 
long long val)
 {
resource_size_t allocated = 0, available = 0;
struct nd_region *nd_region = to_nd_region(dev->parent);
+   struct nd_namespace_common *ndns = to_ndns(dev);
struct nd_mapping *nd_mapping;
struct nvdimm_drvdata *ndd;
struct nd_label_id label_id;
@@ -964,7 +965,7 @@ static ssize_t __size_store(struct device *dev, unsigned 
long long val)
u8 *uuid = NULL;
int rc, i;
 
-   if (dev->driver || to_ndns(dev)->claim)
+   if (dev->driver || ndns->claim)
return -EBUSY;
 
if (is_namespace_pmem(dev)) {
@@ -1034,20 +1035,16 @@ static ssize_t __size_store(struct device *dev, 
unsigned long long val)
 
nd_namespace_pmem_set_resource(nd_region, nspm,
val * nd_region->ndr_mappings);
-   } else if (is_namespace_blk(dev)) {
-   struct nd_namespace_blk *nsblk = to_nd_namespace_blk(dev);
-
-   /*
-* Try to delete the namespace if we deleted all of its
-* allocation, this is not the seed device for the
-* region, and it is not actively claimed by a btt
-* instance.
-*/
-   if (val == 0 && nd_region->ns_seed != dev
-   && !nsblk->common.claim)
-   nd_device_unregister(dev, ND_ASYNC);
}
 
+   /*
+* Try to delete the namespace if we deleted all of its
+* allocation, this is not the seed device for the region, and
+* it is not actively claimed by a btt instance.
+*/
+   if (val == 0 && nd_region->ns_seed != dev && !ndns->claim)
+   nd_device_unregister(dev, ND_ASYNC);
+
return rc;
 }
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams
On Thu, Jan 12, 2017 at 3:14 PM, Jerome Glisse  wrote:
> On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
>> Back when we were first attempting to support DMA for DAX mappings of
>> persistent memory the plan was to forgo 'struct page' completely and
>> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
>> effort died in this thread:
>>
>> https://lkml.org/lkml/2015/8/14/3
>>
>> ...where we learned that the dependencies on struct page for dma
>> mapping are deeper than a PFN_PHYS() conversion for some
>> architectures. That was the moment we pivoted to ZONE_DEVICE and
>> arranged for a 'struct page' to be available for any persistent memory
>> range that needs to be the target of DMA. ZONE_DEVICE enables any
>> device-driver that can target "System RAM" to also be able to target
>> persistent memory through a DAX mapping.
>>
>> Since that time the "page-less" DAX path has continued to mature [1]
>> without growing new dependencies on struct page, but at the same time
>> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
>>
>> Peer-to-peer DMA appears to be evolving from a niche embedded use case
>> to something general purpose platforms will need to comprehend. The
>> "map_peer_resource" [2] approach looks to be headed to the same
>> destination as the pfn-to-scatterlist effort. It's difficult to avoid
>> 'struct page' for describing DMA operations without custom driver
>> code.
>>
>> With that background, a statement and a question to discuss at LSF/MM:
>>
>> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
>> requires pfn_to_page() support across the entire physical address
>> range mapped.
>
> Note that in my case it is even worse. The pfn of the page does not
> correspond to anything so it need to go through a special function
> to find if a page can be mapped for another device and to provide a
> valid pfn at which the page can be access by other device.

I still haven't quite wrapped my head about how these pfn ranges are
created. Would this be a use case for a new pfn_t flag? It doesn't
sound like something we'd want to risk describing with raw 'unsigned
long' pfns.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Jerome Glisse
On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.

Note that in my case it is even worse. The pfn of the page does not
correspond to anything so it need to go through a special function
to find if a page can be mapped for another device and to provide a
valid pfn at which the page can be access by other device.

Basicly the PCIE bar is like a window into the device memory that is
dynamicly remap to specific page of the device memory. Not all device
memory can be expose through PCIE bar because of PCIE issues.

> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

Obvioulsy i would like to join this discussion :)

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams
Back when we were first attempting to support DMA for DAX mappings of
persistent memory the plan was to forgo 'struct page' completely and
develop a pfn-to-scatterlist capability for the dma-mapping-api. That
effort died in this thread:

https://lkml.org/lkml/2015/8/14/3

...where we learned that the dependencies on struct page for dma
mapping are deeper than a PFN_PHYS() conversion for some
architectures. That was the moment we pivoted to ZONE_DEVICE and
arranged for a 'struct page' to be available for any persistent memory
range that needs to be the target of DMA. ZONE_DEVICE enables any
device-driver that can target "System RAM" to also be able to target
persistent memory through a DAX mapping.

Since that time the "page-less" DAX path has continued to mature [1]
without growing new dependencies on struct page, but at the same time
continuing to rely on ZONE_DEVICE to satisfy get_user_pages().

Peer-to-peer DMA appears to be evolving from a niche embedded use case
to something general purpose platforms will need to comprehend. The
"map_peer_resource" [2] approach looks to be headed to the same
destination as the pfn-to-scatterlist effort. It's difficult to avoid
'struct page' for describing DMA operations without custom driver
code.

With that background, a statement and a question to discuss at LSF/MM:

General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
requires pfn_to_page() support across the entire physical address
range mapped.

Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
collides with platform alignment assumptions [3], and if there's a
wider effort to rework memory hotplug [4] it seems DMA support should
be part of the discussion.

---

This topic focuses on the mechanism to enable pfn_to_page() for an
arbitrary physical address range, and the proposed peer-to-peer DMA
topic [5] touches on the userspace presentation of this mechanism. I
might be good to combine these topics if there's interest? In any
event, I'm interested in both as well Michal's concern about memory
hotplug in general.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-November/007672.html
[2]: http://www.spinics.net/lists/linux-pci/msg44560.html
[3]: https://lkml.org/lkml/2016/12/1/740
[4]: http://www.spinics.net/lists/linux-mm/msg119369.html
[5]: http://marc.info/?l=linux-mm&m=148156541804940&w=2
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Logan Gunthorpe


On 11/01/17 09:54 PM, Stephen Bates wrote:
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.

Just to clarify: the iopmem patchset had one patch that allowed for
slightly more flexible zone device mappings which ought to be useful for
everyone.

The other patch (which was iopmem proper) was more of an example of how
the zone_device memory _could_ be exposed to userspace with "iopmem"
hardware that looks similar to nvdimm hardware. Iopmem was not really
useful, in itself, for NVMe devices and it was never expected to be
useful for GPUs.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


USPS issue #7909403: unable to delivery parcel

2017-01-12 Thread USPS Express Delivery
Dear Customer,

USPS courier was unable to contact you for your parcel delivery.

Please review delivery label in attachment!

Yours sincerely,
Harold Glass,
USPS Parcels Operation Agent.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5] DAX: enable iostat for read/write

2017-01-12 Thread Kani, Toshimitsu
On Thu, 2017-01-12 at 10:02 -0800, Joe Perches wrote:
> On Thu, 2017-01-12 at 11:38 -0700, Toshi Kani wrote:
> > DAX IO path does not support iostat, but its metadata IO path does.
> > Therefore, iostat shows metadata IO statistics only, which has been
> > confusing to users.
> 
> []
> > diff --git a/fs/dax.c b/fs/dax.c
> 
> []
> > @@ -1058,12 +1058,24 @@ dax_iomap_rw(struct kiocb *iocb, struct
> > iov_iter *iter,
> >  {
> >     struct address_space *mapping = iocb->ki_filp->f_mapping;
> >     struct inode *inode = mapping->host;
> > +   struct gendisk *disk = inode->i_sb->s_bdev->bd_disk;
> >     loff_t pos = iocb->ki_pos, ret = 0, done = 0;
> >     unsigned flags = 0;
> > +   unsigned long start = 0;
> > +   int do_acct = blk_queue_io_stat(disk->queue);
> >  
> >     if (iov_iter_rw(iter) == WRITE)
> >     flags |= IOMAP_WRITE;
> >  
> > +   if (do_acct) {
> > +   sector_t sec = iov_iter_count(iter) >> 9;
> > +
> > +   start = jiffies;
> > +   generic_start_io_acct(iov_iter_rw(iter),
> > +     min_t(unsigned long, 1,
> > sec),
> 
> I believe I mislead you with a thinko.
> 
> Your original code was
>   (!sec) ? 1 : sec
> and I suggested incorrectly using min_t
> 
> It should of course be max_t.  Sorry.

My bad. I should have caught it.

> Also, as sec is now sector_t (u64), perhaps this
> unsigned long cast is incorrect.

I see. Since iov_iter_count() returns a size_t value, I will use
'size_t' for 'sec' as you originally suggested. 

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5] DAX: enable iostat for read/write

2017-01-12 Thread Joe Perches
On Thu, 2017-01-12 at 11:38 -0700, Toshi Kani wrote:
> DAX IO path does not support iostat, but its metadata IO path does.
> Therefore, iostat shows metadata IO statistics only, which has been
> confusing to users.
[]
> diff --git a/fs/dax.c b/fs/dax.c
[]
> @@ -1058,12 +1058,24 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter 
> *iter,
>  {
>   struct address_space *mapping = iocb->ki_filp->f_mapping;
>   struct inode *inode = mapping->host;
> + struct gendisk *disk = inode->i_sb->s_bdev->bd_disk;
>   loff_t pos = iocb->ki_pos, ret = 0, done = 0;
>   unsigned flags = 0;
> + unsigned long start = 0;
> + int do_acct = blk_queue_io_stat(disk->queue);
>  
>   if (iov_iter_rw(iter) == WRITE)
>   flags |= IOMAP_WRITE;
>  
> + if (do_acct) {
> + sector_t sec = iov_iter_count(iter) >> 9;
> +
> + start = jiffies;
> + generic_start_io_acct(iov_iter_rw(iter),
> +   min_t(unsigned long, 1, sec),

I believe I mislead you with a thinko.

Your original code was
(!sec) ? 1 : sec
and I suggested incorrectly using min_t

It should of course be max_t.  Sorry.

Also, as sec is now sector_t (u64), perhaps this
unsigned long cast is incorrect.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5] DAX: enable iostat for read/write

2017-01-12 Thread Toshi Kani
DAX IO path does not support iostat, but its metadata IO path does.
Therefore, iostat shows metadata IO statistics only, which has been
confusing to users.

Add iostat support to the DAX read/write path.

Note, iostat still does not support the DAX mmap path as it allows
user applications to access directly.

Signed-off-by: Toshi Kani 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Alexander Viro 
Cc: Dave Chinner 
Cc: Ross Zwisler 
Cc: Joe Perches 
---
v5:
 - Add a flag in case 'start' is 0 after 'jiffies' rolls over.
   (Dan Williams)
 - Fix a signed/unsigned conversion. (Joe Perches)
---
 fs/dax.c |   15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 5c74f60..a3e406a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1058,12 +1058,24 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 {
struct address_space *mapping = iocb->ki_filp->f_mapping;
struct inode *inode = mapping->host;
+   struct gendisk *disk = inode->i_sb->s_bdev->bd_disk;
loff_t pos = iocb->ki_pos, ret = 0, done = 0;
unsigned flags = 0;
+   unsigned long start = 0;
+   int do_acct = blk_queue_io_stat(disk->queue);
 
if (iov_iter_rw(iter) == WRITE)
flags |= IOMAP_WRITE;
 
+   if (do_acct) {
+   sector_t sec = iov_iter_count(iter) >> 9;
+
+   start = jiffies;
+   generic_start_io_acct(iov_iter_rw(iter),
+ min_t(unsigned long, 1, sec),
+ &disk->part0);
+   }
+
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
@@ -1073,6 +1085,9 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
done += ret;
}
 
+   if (do_acct)
+   generic_end_io_acct(iov_iter_rw(iter), &disk->part0, start);
+
iocb->ki_pos += done;
return done ? done : ret;
 }
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jason Gunthorpe
On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > > (ie. at least those backed with ZONE_DEVICE memory). Then
> > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > > (using whatever interface is most appropriate) and userspace can do what
> > > it pleases with them. This makes _so_ much sense and actually largely
> > > already works today (as demonstrated by iopmem).

> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Solve it for the easy ZONE_DIRECT/etc case then.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jerome Glisse
On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
> >
> >
> > On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
> >
> >
> >> Make a generic API for all of this and you'd have my vote..
> >>
> >>
> >> IMHO, you must support basic pinning semantics - that is necessary to
> >> support generic short lived DMA (eg filesystem, etc). That hardware can
> >> clearly do that if it can support ODP.
> >
> > I agree completely.
> >
> >
> > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > (ie. at least those backed with ZONE_DEVICE memory). Then
> > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > (using whatever interface is most appropriate) and userspace can do what
> > it pleases with them. This makes _so_ much sense and actually largely
> > already works today (as demonstrated by iopmem).
> 
> +1 for iopmem ;-)
> 
> I feel like we are going around and around on this topic. I would like to
> see something that is upstream that enables P2P even if it is only the
> minimum viable useful functionality to begin. I think aiming for the moon
> (which is what HMM and things like it are) are simply going to take more
> time if they ever get there.
> 
> There is a use case for in-kernel P2P PCIe transfers between two NVMe
> devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
> BARs on the NIC). I am even seeing users who now want to move data P2P
> between FPGAs and NVMe SSDs and the upstream kernel should be able to
> support these users or they will look elsewhere.
> 
> The iopmem patchset addressed all the use cases above and while it is not
> an in kernel API it could have been modified to be one reasonably easily.
> As Logan states the driver can then choose to pass the VMAs to user-space
> in a manner that makes sense.
> 
> Earlier in the thread someone mentioned LSF/MM. There is already a
> proposal to discuss this topic so if you are interested please respond to
> the email letting the committee know this topic is of interest to you [1].
> 
> Also earlier in the thread someone discussed the issues around the IOMMU.
> Given the known issues around P2P transfers in certain CPU root complexes
> [2] it might just be a case of only allowing P2P when a PCIe switch
> connects the two EPs. Another option is just to use CONFIG_EXPERT and make
> sure people are aware of the pitfalls if they invoke the P2P option.


iopmem is not applicable to GPU what i propose is to split the issue in 2
so that everyone can reuse the part that needs to be common namely the DMA
API part where you have to create IOMMU mapping for one device to point
to the other device memory.

We can have a DMA API that is agnostic to how the device memory is manage
(so does not matter if device memory have struct page or not). This what
i have been arguing in this thread. To make progress on this issue we need
to stop conflicting different use case.

So i say let solve the IOMMU issue first and let everyone use it in their
own way with their device. I do not think we can share much more than
that.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm