Re: Questions about vNVDIMM on qemu/KVM

2018-05-24 Thread Yasunori Goto

> >
> > But ,I would like understand one more thing.
> > In the following mail, it seems that e820 bus will be used for fake DAX.
> >
> > https://lists.01.org/pipermail/linux-nvdimm/2018-January/013926.html
> >
> > Could you tell me what is relationship between "fake DAX" in this mail
> > and Guest DAX?
> > Why e820 is necessary for this case?
> >
> 
> It was proposed as a starting template for writing a new nvdimm bus
> driver. All we need is a way to communicate both the address range and
> the flush interface. This could be done with a new SPA Range GUID with
> the NFIT, or a custom virtio-pci device that registers a special
> nvdimm region with this property. My preference is whichever approach
> minimizes the code duplication, because the pmem driver should be
> re-used as much as possible.

Ok, I see.
Thank you very much for your explanation.

Bye,
---
Yasunori Goto


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/7] fs: allow per-device dax status checking for filesystems

2018-05-24 Thread Darrick J. Wong
On Thu, May 24, 2018 at 08:55:12PM -0600, Ross Zwisler wrote:
> From: "Darrick J. Wong" 
> 
> Remove __bdev_dax_supported and change to bdev_dax_supported that takes a
> bdev parameter.  This enables multi-device filesystems like xfs to check
> that a dax device can work for the particular filesystem.  Once that's
> in place, actually fix all the parts of XFS where we need to be able to
> distinguish between datadev and rtdev.
> 
> This patch fixes the problem where we screw up the dax support checking
> in xfs if the datadev and rtdev have different dax capabilities.
> 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Ross Zwisler 

Reviewed-by: Darr...oh, I'm not allowed to do that, am I?

Would you mind (re)sending this to the xfs list so that someone else can
review it?

--D

> ---
>  drivers/dax/super.c | 30 +++---
>  fs/ext2/super.c |  2 +-
>  fs/ext4/super.c |  2 +-
>  fs/xfs/xfs_ioctl.c  |  3 ++-
>  fs/xfs/xfs_iops.c   | 30 +-
>  fs/xfs/xfs_super.c  | 10 --
>  include/linux/dax.h | 10 +++---
>  7 files changed, 55 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 2b2332b605e4..9206539c8330 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -73,8 +73,8 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  #endif
>  
>  /**
> - * __bdev_dax_supported() - Check if the device supports dax for filesystem
> - * @sb: The superblock of the device
> + * bdev_dax_supported() - Check if the device supports dax for filesystem
> + * @bdev: block device to check
>   * @blocksize: The block size of the device
>   *
>   * This is a library function for filesystems to check if the block device
> @@ -82,33 +82,33 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>   *
>   * Return: negative errno if unsupported, 0 if supported.
>   */
> -int __bdev_dax_supported(struct super_block *sb, int blocksize)
> +int bdev_dax_supported(struct block_device *bdev, int blocksize)
>  {
> - struct block_device *bdev = sb->s_bdev;
>   struct dax_device *dax_dev;
>   pgoff_t pgoff;
>   int err, id;
>   void *kaddr;
>   pfn_t pfn;
>   long len;
> + char buf[BDEVNAME_SIZE];
>  
>   if (blocksize != PAGE_SIZE) {
> - pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
> - sb->s_id);
> + pr_debug("%s: error: unsupported blocksize for dax\n",
> + bdevname(bdev, buf));
>   return -EINVAL;
>   }
>  
>   err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
>   if (err) {
> - pr_debug("VFS (%s): error: unaligned partition for dax\n",
> - sb->s_id);
> + pr_debug("%s: error: unaligned partition for dax\n",
> + bdevname(bdev, buf));
>   return err;
>   }
>  
>   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
>   if (!dax_dev) {
> - pr_debug("VFS (%s): error: device does not support dax\n",
> - sb->s_id);
> + pr_debug("%s: error: device does not support dax\n",
> + bdevname(bdev, buf));
>   return -EOPNOTSUPP;
>   }
>  
> @@ -119,8 +119,8 @@ int __bdev_dax_supported(struct super_block *sb, int 
> blocksize)
>   put_dax(dax_dev);
>  
>   if (len < 1) {
> - pr_debug("VFS (%s): error: dax access failed (%ld)\n",
> - sb->s_id, len);
> + pr_debug("%s: error: dax access failed (%ld)\n",
> + bdevname(bdev, buf), len);
>   return len < 0 ? len : -EIO;
>   }
>  
> @@ -137,14 +137,14 @@ int __bdev_dax_supported(struct super_block *sb, int 
> blocksize)
>   } else if (pfn_t_devmap(pfn)) {
>   /* pass */;
>   } else {
> - pr_debug("VFS (%s): error: dax support not enabled\n",
> - sb->s_id);
> + pr_debug("%s: error: dax support not enabled\n",
> + bdevname(bdev, buf));
>   return -EOPNOTSUPP;
>   }
>  
>   return 0;
>  }
> -EXPORT_SYMBOL_GPL(__bdev_dax_supported);
> +EXPORT_SYMBOL_GPL(bdev_dax_supported);
>  #endif
>  
>  enum dax_device_flags {
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index de1694512f1f..9627c3054b5c 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -961,7 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
>  
>   if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
> - err = bdev_dax_supported(sb, blocksize);
> + err = bdev_dax_supported(sb->s_bdev, blocksize);
>   if (err) {
>   ext2_msg(sb, KERN_ERR,
> 

Re: [PATCH 2/7] dax: change bdev_dax_supported() to support boolean returns

2018-05-24 Thread Darrick J. Wong
On Thu, May 24, 2018 at 08:55:13PM -0600, Ross Zwisler wrote:
> From: Dave Jiang 
> 
> The function return values are confusing with the way the function is
> named. We expect a true or false return value but it actually returns
> 0/-errno.  This makes the code very confusing. Changing the return values
> to return a bool where if DAX is supported then return true and no DAX
> support returns false.
> 
> Signed-off-by: Dave Jiang 
> Signed-off-by: Ross Zwisler 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  drivers/dax/super.c | 16 
>  fs/ext2/super.c |  3 +--
>  fs/ext4/super.c |  3 +--
>  fs/xfs/xfs_ioctl.c  |  4 ++--
>  fs/xfs/xfs_super.c  | 12 ++--
>  include/linux/dax.h |  6 +++---
>  6 files changed, 21 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9206539c8330..e5447eddecf8 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -80,9 +80,9 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>   * This is a library function for filesystems to check if the block device
>   * can be mounted with dax option.
>   *
> - * Return: negative errno if unsupported, 0 if supported.
> + * Return: true if supported, false if unsupported
>   */
> -int bdev_dax_supported(struct block_device *bdev, int blocksize)
> +bool bdev_dax_supported(struct block_device *bdev, int blocksize)
>  {
>   struct dax_device *dax_dev;
>   pgoff_t pgoff;
> @@ -95,21 +95,21 @@ int bdev_dax_supported(struct block_device *bdev, int 
> blocksize)
>   if (blocksize != PAGE_SIZE) {
>   pr_debug("%s: error: unsupported blocksize for dax\n",
>   bdevname(bdev, buf));
> - return -EINVAL;
> + return false;
>   }
>  
>   err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
>   if (err) {
>   pr_debug("%s: error: unaligned partition for dax\n",
>   bdevname(bdev, buf));
> - return err;
> + return false;
>   }
>  
>   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
>   if (!dax_dev) {
>   pr_debug("%s: error: device does not support dax\n",
>   bdevname(bdev, buf));
> - return -EOPNOTSUPP;
> + return false;
>   }
>  
>   id = dax_read_lock();
> @@ -121,7 +121,7 @@ int bdev_dax_supported(struct block_device *bdev, int 
> blocksize)
>   if (len < 1) {
>   pr_debug("%s: error: dax access failed (%ld)\n",
>   bdevname(bdev, buf), len);
> - return len < 0 ? len : -EIO;
> + return false;
>   }
>  
>   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
> @@ -139,10 +139,10 @@ int bdev_dax_supported(struct block_device *bdev, int 
> blocksize)
>   } else {
>   pr_debug("%s: error: dax support not enabled\n",
>   bdevname(bdev, buf));
> - return -EOPNOTSUPP;
> + return false;
>   }
>  
> - return 0;
> + return true;
>  }
>  EXPORT_SYMBOL_GPL(bdev_dax_supported);
>  #endif
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 9627c3054b5c..c09289a42dc5 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -961,8 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
>  
>   if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
> - err = bdev_dax_supported(sb->s_bdev, blocksize);
> - if (err) {
> + if (!bdev_dax_supported(sb->s_bdev, blocksize)) {
>   ext2_msg(sb, KERN_ERR,
>   "DAX unsupported by block device. Turning off 
> DAX.");
>   sbi->s_mount_opt &= ~EXT2_MOUNT_DAX;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 089170e99895..2e1622907f4a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3732,8 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void 
> *data, int silent)
>   " that may contain inline data");
>   sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
>   }
> - err = bdev_dax_supported(sb->s_bdev, blocksize);
> - if (err) {
> + if (!bdev_dax_supported(sb->s_bdev, blocksize)) {
>   ext4_msg(sb, KERN_ERR,
>   "DAX unsupported by block device. Turning off 
> DAX.");
>   sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 0effd46b965f..2c70a0a4f59f 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1103,8 +1103,8 @@ xfs_ioctl_setattr_dax_invalidate(
>   if (fa->fsx_xflags & FS_XFLAG_DAX) {
>  

[no subject]

2018-05-24 Thread Bounced mail
The original message was received at Fri, 25 May 2018 11:26:13 +0800
from lists.01.org [137.8.247.250]

- The following addresses had permanent fatal errors -




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/7] Fix DM DAX handling

2018-05-24 Thread Ross Zwisler
This series fixes a few issues that I found with DM's handling of DAX
devices.  Here are some of the issues I found:

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a raw PMEM namespace but which can hold a
   filesystem mounted with the -o dax mount option.  DAX operations to
   the raw PMEM namespace part lack struct page and can fail in
   interesting/unexpected ways when doing things like fork(), examining
   memory with gdb, etc.

 * We can create a dm-stripe or dm-linear device which is made up of an
   fsdax PMEM namespace and a BRD ramdisk which can hold a filesystem
   mounted with the -o dax mount option.  All I/O to this filesystem
   will fail.

 * In DM you can't transition a dm target which could possibly support
   DAX (mode DM_TYPE_DAX_BIO_BASED) to one which can't support DAX
   (mode DM_TYPE_BIO_BASED), even if you never use DAX.

The first 2 patches in this series are prep work from Darrick and Dave
which improve bdev_dax_supported().  The last 5 problems fix the above
mentioned problems in DM.  I feel that this series simplifies the
handling of DAX devices in DM, and the last 5 DM-related patches have a
net code reduction of 50 lines.


Darrick J. Wong (1):
  fs: allow per-device dax status checking for filesystems

Dave Jiang (1):
  dax: change bdev_dax_supported() to support boolean returns

Ross Zwisler (5):
  dm: fix test for DAX device support
  dm: prevent DAX mounts if not supported
  dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode
  dm-snap: remove unnecessary direct_access() stub
  dm-error: remove unnecessary direct_access() stub

 drivers/dax/super.c   | 44 +--
 drivers/md/dm-ioctl.c | 16 ++--
 drivers/md/dm-snap.c  |  8 
 drivers/md/dm-table.c | 29 +++-
 drivers/md/dm-target.c|  7 ---
 drivers/md/dm.c   |  7 ++-
 fs/ext2/super.c   |  3 +--
 fs/ext4/super.c   |  3 +--
 fs/xfs/xfs_ioctl.c|  3 ++-
 fs/xfs/xfs_iops.c | 30 -
 fs/xfs/xfs_super.c| 10 --
 include/linux/dax.h   | 12 
 include/linux/device-mapper.h |  8 ++--
 13 files changed, 88 insertions(+), 92 deletions(-)

-- 
2.14.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 4/7] dm: prevent DAX mounts if not supported

2018-05-24 Thread Ross Zwisler
Currently the code in dm_dax_direct_access() only checks whether the target
type has a direct_access() operation defined, not whether the underlying
block devices all support DAX.  This latter property can be seen by looking
at whether we set the QUEUE_FLAG_DAX request queue flag when creating the
DM device.

This is problematic if we have, for example, a dm-linear device made up of
a PMEM namespace in fsdax mode followed by a ramdisk from BRD.
QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but
we have a working direct_access() entry point and the first member of the
dm-linear set *does* support DAX.

This allows the user to create a filesystem on the dm-linear device, and
then mount it with DAX.  The filesystem's bdev_dax_supported() test will
pass because it'll operate on the first member of the dm-linear device,
which happens to be a fsdax PMEM namespace.

All DAX I/O will then fail to that dm-linear device because the lack of
QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working.  This means that
the struct dax_device isn't ever set in the filesystem, so
dax_direct_access() will always return -EOPNOTSUPP.

By failing out of dm_dax_direct_access() if QUEUE_FLAG_DAX isn't set we let
the filesystem know we don't support DAX at mount time.  The filesystem
will then silently fall back and remove the dax mount option, causing it to
work properly.

Signed-off-by: Ross Zwisler 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
---
 drivers/md/dm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 0a7b0107ca78..9728433362d1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1050,14 +1050,13 @@ static long dm_dax_direct_access(struct dax_device 
*dax_dev, pgoff_t pgoff,
 
if (!ti)
goto out;
-   if (!ti->type->direct_access)
+   if (!blk_queue_dax(md->queue))
goto out;
len = max_io_len(sector, ti) / PAGE_SECTORS;
if (len < 1)
goto out;
nr_pages = min(len, nr_pages);
-   if (ti->type->direct_access)
-   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
+   ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn);
 
  out:
dm_put_live_table(md, srcu_idx);
-- 
2.14.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/7] fs: allow per-device dax status checking for filesystems

2018-05-24 Thread Ross Zwisler
From: "Darrick J. Wong" 

Remove __bdev_dax_supported and change to bdev_dax_supported that takes a
bdev parameter.  This enables multi-device filesystems like xfs to check
that a dax device can work for the particular filesystem.  Once that's
in place, actually fix all the parts of XFS where we need to be able to
distinguish between datadev and rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong 
Signed-off-by: Ross Zwisler 
---
 drivers/dax/super.c | 30 +++---
 fs/ext2/super.c |  2 +-
 fs/ext4/super.c |  2 +-
 fs/xfs/xfs_ioctl.c  |  3 ++-
 fs/xfs/xfs_iops.c   | 30 +-
 fs/xfs/xfs_super.c  | 10 --
 include/linux/dax.h | 10 +++---
 7 files changed, 55 insertions(+), 32 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..9206539c8330 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -73,8 +73,8 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
 #endif
 
 /**
- * __bdev_dax_supported() - Check if the device supports dax for filesystem
- * @sb: The superblock of the device
+ * bdev_dax_supported() - Check if the device supports dax for filesystem
+ * @bdev: block device to check
  * @blocksize: The block size of the device
  *
  * This is a library function for filesystems to check if the block device
@@ -82,33 +82,33 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
  *
  * Return: negative errno if unsupported, 0 if supported.
  */
-int __bdev_dax_supported(struct super_block *sb, int blocksize)
+int bdev_dax_supported(struct block_device *bdev, int blocksize)
 {
-   struct block_device *bdev = sb->s_bdev;
struct dax_device *dax_dev;
pgoff_t pgoff;
int err, id;
void *kaddr;
pfn_t pfn;
long len;
+   char buf[BDEVNAME_SIZE];
 
if (blocksize != PAGE_SIZE) {
-   pr_debug("VFS (%s): error: unsupported blocksize for dax\n",
-   sb->s_id);
+   pr_debug("%s: error: unsupported blocksize for dax\n",
+   bdevname(bdev, buf));
return -EINVAL;
}
 
err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
if (err) {
-   pr_debug("VFS (%s): error: unaligned partition for dax\n",
-   sb->s_id);
+   pr_debug("%s: error: unaligned partition for dax\n",
+   bdevname(bdev, buf));
return err;
}
 
dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
if (!dax_dev) {
-   pr_debug("VFS (%s): error: device does not support dax\n",
-   sb->s_id);
+   pr_debug("%s: error: device does not support dax\n",
+   bdevname(bdev, buf));
return -EOPNOTSUPP;
}
 
@@ -119,8 +119,8 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
put_dax(dax_dev);
 
if (len < 1) {
-   pr_debug("VFS (%s): error: dax access failed (%ld)\n",
-   sb->s_id, len);
+   pr_debug("%s: error: dax access failed (%ld)\n",
+   bdevname(bdev, buf), len);
return len < 0 ? len : -EIO;
}
 
@@ -137,14 +137,14 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
} else if (pfn_t_devmap(pfn)) {
/* pass */;
} else {
-   pr_debug("VFS (%s): error: dax support not enabled\n",
-   sb->s_id);
+   pr_debug("%s: error: dax support not enabled\n",
+   bdevname(bdev, buf));
return -EOPNOTSUPP;
}
 
return 0;
 }
-EXPORT_SYMBOL_GPL(__bdev_dax_supported);
+EXPORT_SYMBOL_GPL(bdev_dax_supported);
 #endif
 
 enum dax_device_flags {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index de1694512f1f..9627c3054b5c 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -961,7 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
-   err = bdev_dax_supported(sb, blocksize);
+   err = bdev_dax_supported(sb->s_bdev, blocksize);
if (err) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index eb104e8476f0..089170e99895 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3732,7 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)

[PATCH 3/7] dm: fix test for DAX device support

2018-05-24 Thread Ross Zwisler
Currently device_supports_dax() just checks to see if the QUEUE_FLAG_DAX
flag is set on the device's request queue to decide whether or not the
device supports filesystem DAX.  This is insufficient because there are
devices like PMEM namespaces in raw mode which have QUEUE_FLAG_DAX set but
which don't actually support DAX.

This means that you could create a dm-linear device, for example, where the
first part of the dm-linear device was a PMEM namespace in fsdax mode and
the second part was a PMEM namespace in raw mode.  Both DM and the
filesystem you put on that dm-linear device would think the whole device
supports DAX, which would lead to bad behavior once your raw PMEM namespace
part using DAX needed struct page for something.

Fix this by using bdev_dax_supported() like filesystems do at mount time.
This checks for raw mode and also performs other tests like checking to
make sure the dax_direct_access() path works.

Signed-off-by: Ross Zwisler 
Fixes: commit 545ed20e6df6 ("dm: add infrastructure for DAX support")
---
 drivers/md/dm-table.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 0589a4da12bb..5bb994b012ca 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -885,9 +885,7 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
 static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
   sector_t start, sector_t len, void *data)
 {
-   struct request_queue *q = bdev_get_queue(dev->bdev);
-
-   return q && blk_queue_dax(q);
+   return bdev_dax_supported(dev->bdev, PAGE_SIZE);
 }
 
 static bool dm_table_supports_dax(struct dm_table *t)
-- 
2.14.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 6/7] dm-snap: remove unnecessary direct_access() stub

2018-05-24 Thread Ross Zwisler
This stub was added so that we could use dm-snap with DM_TYPE_DAX_BIO_BASED
mode devices.  That mode and the transition issues associated with it no
longer exist, so we can remove this dead code.

Signed-off-by: Ross Zwisler 
---
 drivers/md/dm-snap.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 216035be5661..0143b158d52d 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -2305,13 +2305,6 @@ static int origin_map(struct dm_target *ti, struct bio 
*bio)
return do_origin(o->dev, bio);
 }
 
-static long origin_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
-   long nr_pages, void **kaddr, pfn_t *pfn)
-{
-   DMWARN("device does not support dax.");
-   return -EIO;
-}
-
 /*
  * Set the target "max_io_len" field to the minimum of all the snapshots'
  * chunk sizes.
@@ -2371,7 +2364,6 @@ static struct target_type origin_target = {
.postsuspend = origin_postsuspend,
.status  = origin_status,
.iterate_devices = origin_iterate_devices,
-   .direct_access = origin_dax_direct_access,
 };
 
 static struct target_type snapshot_target = {
-- 
2.14.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/7] dm: remove DM_TYPE_DAX_BIO_BASED dm_queue_mode

2018-05-24 Thread Ross Zwisler
The DM_TYPE_DAX_BIO_BASED dm_queue_mode was introduced to prevent DM
devices that could possibly support DAX from transitioning into DM devices
that cannot support DAX.

For example, the following transition will currently fail:

 dm-linear: [fsdax pmem][fsdax pmem] => [fsdax pmem][fsdax raw]
  DM_TYPE_DAX_BIO_BASED   DM_TYPE_BIO_BASED

but these will both succeed:

 dm-linear: [fsdax pmem][brd ramdisk] => [fsdax pmem][fsdax raw]
DM_TYPE_DAX_BASEDDM_TYPE_BIO_BASED

 dm-linear: [fsdax pmem][fsdax raw] => [fsdax pmem][fsdax pmem]
DM_TYPE_BIO_BASEDDM_TYPE_DAX_BIO_BASED

This seems arbitrary, as really the choice on whether to use DAX happens at
filesystem mount time.  There's no guarantee that the in the first case
(double fsdax pmem) we were using the dax mount option with our file
system.

Instead, get rid of DM_TYPE_DAX_BIO_BASED and all the special casing around
it, and instead make the request queue's QUEUE_FLAG_DAX be our one source
of truth.  If this is set, we can use DAX, and if not, not.  We keep this
up to date in table_load() as the table changes.  As with regular block
devices the filesystem will then know at mount time whether DAX is a
supported mount option or not.

Signed-off-by: Ross Zwisler 
---
 drivers/md/dm-ioctl.c | 16 ++--
 drivers/md/dm-table.c | 25 ++---
 drivers/md/dm.c   |  2 --
 include/linux/device-mapper.h |  8 ++--
 4 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 5acf77de5945..d1f86d0bb2d0 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1292,15 +1292,6 @@ static int populate_table(struct dm_table *table,
return dm_table_complete(table);
 }
 
-static bool is_valid_type(enum dm_queue_mode cur, enum dm_queue_mode new)
-{
-   if (cur == new ||
-   (cur == DM_TYPE_BIO_BASED && new == DM_TYPE_DAX_BIO_BASED))
-   return true;
-
-   return false;
-}
-
 static int table_load(struct file *filp, struct dm_ioctl *param, size_t 
param_size)
 {
int r;
@@ -1343,12 +1334,17 @@ static int table_load(struct file *filp, struct 
dm_ioctl *param, size_t param_si
DMWARN("unable to set up device queue for new table.");
goto err_unlock_md_type;
}
-   } else if (!is_valid_type(dm_get_md_type(md), dm_table_get_type(t))) {
+   } else if (dm_get_md_type(md) != dm_table_get_type(t)) {
DMWARN("can't change device type after initial table load.");
r = -EINVAL;
goto err_unlock_md_type;
}
 
+   if (dm_table_supports_dax(t))
+   blk_queue_flag_set(QUEUE_FLAG_DAX, md->queue);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_DAX, md->queue);
+
dm_unlock_md_type(md);
 
/* stage inactive table */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 5bb994b012ca..ea5c4a1e6f5b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -866,7 +866,6 @@ EXPORT_SYMBOL(dm_consume_args);
 static bool __table_type_bio_based(enum dm_queue_mode table_type)
 {
return (table_type == DM_TYPE_BIO_BASED ||
-   table_type == DM_TYPE_DAX_BIO_BASED ||
table_type == DM_TYPE_NVME_BIO_BASED);
 }
 
@@ -888,7 +887,7 @@ static int device_supports_dax(struct dm_target *ti, struct 
dm_dev *dev,
return bdev_dax_supported(dev->bdev, PAGE_SIZE);
 }
 
-static bool dm_table_supports_dax(struct dm_table *t)
+bool dm_table_supports_dax(struct dm_table *t)
 {
struct dm_target *ti;
unsigned i;
@@ -907,6 +906,7 @@ static bool dm_table_supports_dax(struct dm_table *t)
 
return true;
 }
+EXPORT_SYMBOL_GPL(dm_table_supports_dax);
 
 static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
 
@@ -944,7 +944,6 @@ static int dm_table_determine_type(struct dm_table *t)
/* possibly upgrade to a variant of bio-based */
goto verify_bio_based;
}
-   BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
BUG_ON(t->type == DM_TYPE_NVME_BIO_BASED);
goto verify_rq_based;
}
@@ -981,18 +980,14 @@ static int dm_table_determine_type(struct dm_table *t)
 verify_bio_based:
/* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED;
-   if (dm_table_supports_dax(t) ||
-   (list_empty(devices) && live_md_type == 
DM_TYPE_DAX_BIO_BASED)) {
-   t->type = DM_TYPE_DAX_BIO_BASED;
-   } else {
-   /* Check if upgrading to NVMe bio-based is valid or 
required */
-   tgt = dm_table_get_immutable_target(t);
-   if (tgt && !tgt->max_io_len && 

[PATCH 2/7] dax: change bdev_dax_supported() to support boolean returns

2018-05-24 Thread Ross Zwisler
From: Dave Jiang 

The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang 
Signed-off-by: Ross Zwisler 
---
 drivers/dax/super.c | 16 
 fs/ext2/super.c |  3 +--
 fs/ext4/super.c |  3 +--
 fs/xfs/xfs_ioctl.c  |  4 ++--
 fs/xfs/xfs_super.c  | 12 ++--
 include/linux/dax.h |  6 +++---
 6 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9206539c8330..e5447eddecf8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -80,9 +80,9 @@ EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
  * This is a library function for filesystems to check if the block device
  * can be mounted with dax option.
  *
- * Return: negative errno if unsupported, 0 if supported.
+ * Return: true if supported, false if unsupported
  */
-int bdev_dax_supported(struct block_device *bdev, int blocksize)
+bool bdev_dax_supported(struct block_device *bdev, int blocksize)
 {
struct dax_device *dax_dev;
pgoff_t pgoff;
@@ -95,21 +95,21 @@ int bdev_dax_supported(struct block_device *bdev, int 
blocksize)
if (blocksize != PAGE_SIZE) {
pr_debug("%s: error: unsupported blocksize for dax\n",
bdevname(bdev, buf));
-   return -EINVAL;
+   return false;
}
 
err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
bdevname(bdev, buf));
-   return err;
+   return false;
}
 
dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
if (!dax_dev) {
pr_debug("%s: error: device does not support dax\n",
bdevname(bdev, buf));
-   return -EOPNOTSUPP;
+   return false;
}
 
id = dax_read_lock();
@@ -121,7 +121,7 @@ int bdev_dax_supported(struct block_device *bdev, int 
blocksize)
if (len < 1) {
pr_debug("%s: error: dax access failed (%ld)\n",
bdevname(bdev, buf), len);
-   return len < 0 ? len : -EIO;
+   return false;
}
 
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
@@ -139,10 +139,10 @@ int bdev_dax_supported(struct block_device *bdev, int 
blocksize)
} else {
pr_debug("%s: error: dax support not enabled\n",
bdevname(bdev, buf));
-   return -EOPNOTSUPP;
+   return false;
}
 
-   return 0;
+   return true;
 }
 EXPORT_SYMBOL_GPL(bdev_dax_supported);
 #endif
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 9627c3054b5c..c09289a42dc5 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -961,8 +961,7 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
-   err = bdev_dax_supported(sb->s_bdev, blocksize);
-   if (err) {
+   if (!bdev_dax_supported(sb->s_bdev, blocksize)) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
sbi->s_mount_opt &= ~EXT2_MOUNT_DAX;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 089170e99895..2e1622907f4a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3732,8 +3732,7 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
" that may contain inline data");
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
}
-   err = bdev_dax_supported(sb->s_bdev, blocksize);
-   if (err) {
+   if (!bdev_dax_supported(sb->s_bdev, blocksize)) {
ext4_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
sbi->s_mount_opt &= ~EXT4_MOUNT_DAX;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0effd46b965f..2c70a0a4f59f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1103,8 +1103,8 @@ xfs_ioctl_setattr_dax_invalidate(
if (fa->fsx_xflags & FS_XFLAG_DAX) {
if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
-   if (bdev_dax_supported(xfs_find_bdev_for_inode(VFS_I(ip)),
-   sb->s_blocksize) < 0)
+   if 

Re: [PATCH 07/11] mm, madvise_inject_error: fix page count leak

2018-05-24 Thread Dan Williams
On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi
 wrote:
> On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
>> The madvise_inject_error() routine uses get_user_pages() to lookup the
>> pfn and other information for injected error, but it fails to release
>> that pin.
>>
>> The dax-dma-vs-truncate warning catches this failure with the following
>> signature:
>>
>>  Injecting memory failure for pfn 0x208900 at process virtual address 
>> 0x7f3908d0
>>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>>  CPU: 37 PID: 9566 Comm: umount Tainted: GW  OE 4.17.0-rc6+ #1900
>>  [..]
>>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>>  RSP: 0018:c9000a9b3b30 EFLAGS: 00010002
>>  RAX: ea0008224000 RBX: 00208a00 RCX: 00208900
>>  RDX: 0001 RSI: 8804058c6160 RDI: 0008
>>  RBP: 0822000a R08: 0002 R09: 00208800
>>  R10:  R11: 00208801 R12: 8804058c6168
>>  R13:  R14: 0002 R15: 0001
>>  FS:  7f4548027fc0() GS:880431d4() knlGS:
>>  CS:  0010 DS:  ES:  CR0: 80050033
>>  CR2: 56316d5f8988 CR3: 0004298cc000 CR4: 000406e0
>>  Call Trace:
>>   __dax_invalidate_mapping_entry+0xab/0xe0
>>   dax_delete_mapping_entry+0xf/0x20
>>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>>   truncate_inode_pages_range+0x291/0x920
>>   ? kmem_cache_free+0x1f8/0x300
>>   ? lock_acquire+0x9f/0x200
>>   ? truncate_inode_pages_final+0x31/0x50
>>   ext4_evict_inode+0x69/0x740
>>
>> Cc: 
>> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
>> Cc: Michal Hocko 
>> Cc: Andi Kleen 
>> Cc: Wu Fengguang 
>> Signed-off-by: Dan Williams 
>> ---
>>  mm/madvise.c |   11 ---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922ea1a1..246fa4d4eee2 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>>
>>
>>   for (; start < end; start += PAGE_SIZE << order) {
>> + unsigned long pfn;
>>   int ret;
>>
>>   ret = get_user_pages_fast(start, 1, 0, );
>>   if (ret != 1)
>>   return ret;
>> + pfn = page_to_pfn(page);
>>
>>   /*
>>* When soft offlining hugepages, after migrating the page
>> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>>
>>   if (behavior == MADV_SOFT_OFFLINE) {
>>   pr_info("Soft offlining pfn %#lx at process virtual 
>> address %#lx\n",
>> - page_to_pfn(page), start);
>> + pfn, start);
>>
>>   ret = soft_offline_page(page, MF_COUNT_INCREASED);
>> + put_page(page);
>>   if (ret)
>>   return ret;
>>   continue;
>>   }
>> + put_page(page);
>
> We keep the page count pinned after the isolation of the error page
> in order to make sure that the error page is disabled and never reused.
> This seems not explicit enough, so some comment should be helpful.

As far as I can see this extra reference count to keep the page from
being should be taken internal to memory_failure(), not assumed from
the inject error path. I might be overlooking something, but I do not
see who is responsible for taking this extra reference in the case
where memory_failure() is called by the machine check code rather than
madvise_inject_error()?

>
> BTW, looking at the kernel message like "Memory failure: 0x208900:
> reserved kernel page still referenced by 1 users", memory_failure()
> considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
> If memory error handler recovers a dav_pagemap page in its special way,
> we can define a new action_page_types entry like MF_MSG_DAX.
> Reporting like "Memory failure: 0xX: recovery action for dax page:
> Failed" might be helpful for end user's perspective.

Sounds good, I'll take a look at this.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Questions about vNVDIMM on qemu/KVM

2018-05-24 Thread Dan Williams
On Thu, May 24, 2018 at 12:19 AM, Yasunori Goto  wrote:
>> On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto  
>> wrote:
>> > Hello,
>> >
>> > I'm investigating status of vNVDIMM on qemu/KVM,
>> > and I have some questions about it. I'm glad if anyone answer them.
>> >
>> > In my understanding, qemu/KVM has a feature to show NFIT for guest,
>> > and it will be still updated about platform capability with this patch set.
>> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html
>> >
>> > And libvirt also supports this feature with 
>> > https://libvirt.org/formatdomain.html#elementsMemory
>> >
>> >
>> > However, virtio-pmem is developing now, and it is better
>> > for archtectures to detect regions of NVDIMM without ACPI (like s390x)
>>
>> I think you are confusing virtio-pmem (patches from Pankaj) and
>> virtio-mem (patches from David)? ...or I'm confused.
>
> Probably, "I" am confusing.
> So, your clarification is very helpful for me.
>
>
>>
>> > In addition, It is also necessary to flush guest contents on vNVDIMM
>> > who has a backend-file.
>>
>> virtio-pmem is a mechanism to use host page cache as pmem in a guest.
>> It does not support high performance memory applications because it
>> requires fsync/msync. I.e. it is not DAX it is the traditional mmap
>> I/O model, but moving page cache management to the host rather than
>> duplicating it in guests.
>
> Ah, ok.
>
>
>>
>> > Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem?
>> > How do each roles become it if both NFIT and virtio-pmem will be 
>> > available?
>> > If my understanding is correct, both NFIT and virtio-pmem is used to
>> > detect vNVDIMM regions, but only one seems to be necessary
>>
>> We need both because they are different. Guest DAX should not be using
>> virtio-pmem.
>
> Hmm. Ok.
>
> But ,I would like understand one more thing.
> In the following mail, it seems that e820 bus will be used for fake DAX.
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-January/013926.html
>
> Could you tell me what is relationship between "fake DAX" in this mail
> and Guest DAX?
> Why e820 is necessary for this case?
>

It was proposed as a starting template for writing a new nvdimm bus
driver. All we need is a way to communicate both the address range and
the flush interface. This could be done with a new SPA Range GUID with
the NFIT, or a custom virtio-pci device that registers a special
nvdimm region with this property. My preference is whichever approach
minimizes the code duplication, because the pmem driver should be
re-used as much as possible.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-24 Thread Mikulas Patocka


On Tue, 22 May 2018, Dan Williams wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer  wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig  wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written 
> >> > atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive. I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

The ARM code can't "catch-up" with X86.

On X86 - non-temporal stores (i.e. memcpy_flushcache) are faster than 
cached write and cache flushing.

The ARM architecture doesn't have non-temporal stores. So, 
memcpy_flushcache on ARM does memcpy (that writes data to the cache) and 
then flushes the cache. But this eager cache flushig is slower than late 
cache flushing.

The optimal code sequence on ARM to write to persistent memory is to call 
memcpy, then do something else, and then call arch_wb_cache_pmem as late 
as possible. And this ARM-optimized code sequence is just horribly slow on 
X86.

This issue can't be "fixed" in ARM-specific source code. The ARM processor 
have such a characteristics that eager cache flushing is slower than late 
cache flushing - and that's it - you can't change processor behavior.

If you don't want '#if defined(CONFIG_X86_64)' in the dm-writecache 
driver, then just take the functions that are in this conditional block 
and move them to some generic linux header.

Mikulas
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


(AD)邀请函: P*M*C-生-产-物-料-计-划

2018-05-24 Thread 王主任

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Questions about vNVDIMM on qemu/KVM

2018-05-24 Thread Yasunori Goto
> On Tue, May 22, 2018 at 10:08 PM, Yasunori Goto  wrote:
> > Hello,
> >
> > I'm investigating status of vNVDIMM on qemu/KVM,
> > and I have some questions about it. I'm glad if anyone answer them.
> >
> > In my understanding, qemu/KVM has a feature to show NFIT for guest,
> > and it will be still updated about platform capability with this patch set.
> > https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html
> >
> > And libvirt also supports this feature with 
> > https://libvirt.org/formatdomain.html#elementsMemory
> >
> >
> > However, virtio-pmem is developing now, and it is better
> > for archtectures to detect regions of NVDIMM without ACPI (like s390x)
> 
> I think you are confusing virtio-pmem (patches from Pankaj) and
> virtio-mem (patches from David)? ...or I'm confused.

Probably, "I" am confusing.
So, your clarification is very helpful for me.


> 
> > In addition, It is also necessary to flush guest contents on vNVDIMM
> > who has a backend-file.
> 
> virtio-pmem is a mechanism to use host page cache as pmem in a guest.
> It does not support high performance memory applications because it
> requires fsync/msync. I.e. it is not DAX it is the traditional mmap
> I/O model, but moving page cache management to the host rather than
> duplicating it in guests.

Ah, ok.


> 
> > Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem?
> > How do each roles become it if both NFIT and virtio-pmem will be 
> > available?
> > If my understanding is correct, both NFIT and virtio-pmem is used to
> > detect vNVDIMM regions, but only one seems to be necessary
> 
> We need both because they are different. Guest DAX should not be using
> virtio-pmem.

Hmm. Ok.

But ,I would like understand one more thing.
In the following mail, it seems that e820 bus will be used for fake DAX.

https://lists.01.org/pipermail/linux-nvdimm/2018-January/013926.html

Could you tell me what is relationship between "fake DAX" in this mail
and Guest DAX? 
Why e820 is necessary for this case?

(Probably, it may be one of the reason why I'm confusing)


> 
> > Otherwize, is the NFIT bus just for keeping compatibility,
> > and virtio-pmem is promising way?
> >
> >
> > Q2) What bus is(will be?) created for virtio-pmem?
> > I could confirm the bus of NFIT is created with ,
> > and I heard other bus will be created for virtio-pmem, but I could not
> > find what bus is created concretely.
> > ---
> >   # ndctl list -B
> >   {
> >  "provider":"ACPI.NFIT",
> >  "dev":"ndbus0"
> >   }
> > ---
> >
> > I think it affects what operations user will be able to, and what
> > notification is necessary for vNVDIMM.
> > ACPI defines some operations like namespace controll, and notification
> > for NVDIMM health status or others.
> > (I suppose that other status notification might be necessary for 
> > vNVDIMM,
> >  but I'm not sure yet...)
> >
> > If my understanding is wrong, please correct me.
> 
> The current plan, per my understanding, is a virtio-pmem SPA UUID
> added to the virtual NFIT so that the guest driver can load the pmem
> driver but also hook up the virtio command ring for forwarding
> WRITE_{FUA,FLUSH} commands as host fsync operations.

Ok.

Thank you very much for your answer!

---
Yasunori Goto



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm