Re: [PATCH] dax: fix build warnings with FS_DAX and !FS_IOMAP

2017-01-11 Thread Jan Kara
On Tue 10-01-17 15:29:43, Ross Zwisler wrote:
> As reported by Arnd:
> 
> https://lkml.org/lkml/2017/1/10/756
> 
> Compiling with the following configuration:
> 
>   # CONFIG_EXT2_FS is not set
>   # CONFIG_EXT4_FS is not set
>   # CONFIG_XFS_FS is not set
>   # CONFIG_FS_IOMAP depends on the above filesystems, as is not set
>   CONFIG_FS_DAX=y
> 
> generates build warnings about unused functions in fs/dax.c:
> 
> fs/dax.c:878:12: warning: ‘dax_insert_mapping’ defined but not used 
> [-Wunused-function]
>  static int dax_insert_mapping(struct address_space *mapping,
> ^~
> fs/dax.c:572:12: warning: ‘copy_user_dax’ defined but not used 
> [-Wunused-function]
>  static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t 
> size,
> ^
> fs/dax.c:542:12: warning: ‘dax_load_hole’ defined but not used 
> [-Wunused-function]
>  static int dax_load_hole(struct address_space *mapping, void **entry,
> ^
> fs/dax.c:312:14: warning: ‘grab_mapping_entry’ defined but not used 
> [-Wunused-function]
>  static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
>   ^~
> 
> Now that the struct buffer_head based DAX fault paths and I/O path have
> been removed we really depend on iomap support being present for DAX.  Make
> this explicit by selecting FS_IOMAP if we compile in DAX support.
> 
> This allows us to remove conditional selections of FS_IOMAP when FS_DAX was
> present for ext2 and ext4, and to remove an #ifdef in fs/dax.c.
> 
> Signed-off-by: Ross Zwisler 
> Reported-by: Arnd Bergmann 
> Cc: Jan Kara 
> Cc: Christoph Hellwig 

Looks good. I agree that DAX without FS_IOMAP does not make sense. You can
add:

Reviewed-by: Jan Kara 

Honza
> ---
>  fs/Kconfig  | 1 +
>  fs/dax.c| 2 --
>  fs/ext2/Kconfig | 1 -
>  fs/ext4/Kconfig | 1 -
>  4 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index c2a377c..83eab52 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -38,6 +38,7 @@ config FS_DAX
>   bool "Direct Access (DAX) support"
>   depends on MMU
>   depends on !(ARM || MIPS || SPARC)
> + select FS_IOMAP
>   help
> Direct Access (DAX) can be used on memory-backed block devices.
> If the block device supports DAX and the filesystem supports DAX,
> diff --git a/fs/dax.c b/fs/dax.c
> index 5c74f60..083b950 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -969,7 +969,6 @@ int __dax_zero_page_range(struct block_device *bdev, 
> sector_t sector,
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> -#ifdef CONFIG_FS_IOMAP
>  static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
>  {
>   return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> @@ -1407,4 +1406,3 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
> unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
>  #endif /* CONFIG_FS_DAX_PMD */
> -#endif /* CONFIG_FS_IOMAP */
> diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
> index 36bea5a..c634874e 100644
> --- a/fs/ext2/Kconfig
> +++ b/fs/ext2/Kconfig
> @@ -1,6 +1,5 @@
>  config EXT2_FS
>   tristate "Second extended fs support"
> - select FS_IOMAP if FS_DAX
>   help
> Ext2 is a standard Linux file system for hard disks.
>  
> diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
> index 7b90691..e38039f 100644
> --- a/fs/ext4/Kconfig
> +++ b/fs/ext4/Kconfig
> @@ -37,7 +37,6 @@ config EXT4_FS
>   select CRC16
>   select CRYPTO
>   select CRYPTO_CRC32C
> - select FS_IOMAP if FS_DAX
>   help
> This is the next generation of the ext3 filesystem.
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/2] ext4: fix DAX write locking

2017-01-11 Thread Jan Kara
On Tue 10-01-17 16:48:07, Christoph Hellwig wrote:
> Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
> application, so we'll have to provide the traditional synchronіzation
> of overlapping writes as we do for buffered writes.
> 
> This was broken historically for DAX, but got fixed for ext2 and XFS
> as part of the iomap conversion.  Fix up ext4 as well.
> 
> Signed-off-by: Christoph Hellwig 

Makes sense. You can add:

Reviewed-by: Jan Kara 

Ted, can you please pick it up? Thanks!

Honza

> ---
>  fs/ext4/file.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d663d3d..a1e88aa 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -175,7 +175,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>  {
>   struct inode *inode = file_inode(iocb->ki_filp);
>   ssize_t ret;
> - bool overwrite = false;
>  
>   inode_lock(inode);
>   ret = ext4_write_checks(iocb, from);
> @@ -188,16 +187,9 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter 
> *from)
>   if (ret)
>   goto out;
>  
> - if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> - overwrite = true;
> - downgrade_write(&inode->i_rwsem);
> - }
>   ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>  out:
> - if (!overwrite)
> - inode_unlock(inode);
> - else
> - inode_unlock_shared(inode);
> + inode_unlock(inode);
>   if (ret > 0)
>   ret = generic_write_sync(iocb, ret);
>   return ret;
> -- 
> 2.1.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: fix build warnings with FS_DAX and !FS_IOMAP

2017-01-11 Thread Christoph Hellwig
Looks fine:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] dax: assert that i_rwsem is held exclusive for writes

2017-01-11 Thread Jan Kara
On Tue 10-01-17 16:48:08, Christoph Hellwig wrote:
> Make sure all callers follow the same locking protocol, given that DAX
> transparantly replaced the normal buffered I/O path.
> 
> Signed-off-by: Christoph Hellwig 

Looks good. You can add:

Reviewed-by: Jan Kara 

Probably also for Ted since it depends on the ext4 fix...

Honza
> ---
>  fs/dax.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5c74f60..04734da 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1061,8 +1061,12 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>   loff_t pos = iocb->ki_pos, ret = 0, done = 0;
>   unsigned flags = 0;
>  
> - if (iov_iter_rw(iter) == WRITE)
> + if (iov_iter_rw(iter) == WRITE) {
> + lockdep_assert_held_exclusive(&inode->i_rwsem);
>   flags |= IOMAP_WRITE;
> + } else {
> + lockdep_assert_held(&inode->i_rwsem);
> + }
>  
>   while (iov_iter_count(iter)) {
>   ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> -- 
> 2.1.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6] x86: fix kaslr and memmap collision

2017-01-11 Thread Thomas Gleixner
On Tue, 10 Jan 2017, Dave Jiang wrote:
> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
> +long simple_strtol(const char *cp, char **endp, unsigned int base);

What are those functions for? They are not used in that patch at all.

> +static void mem_avoid_memmap(void)
> +{
> + char arg[128];
> + int rc = 0;

What's the point of defining this variable here and zero initializing it?

> + /* see if we have any memmap areas */

Sentences start with an upper case letter. Everywhere!

> + if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {

You can spare an indentation level by just returning when the search fails.

> + int i = 0;
> + char *str = arg;
> +
> + while (str && (i < MAX_MEMMAP_REGIONS)) {

for (i = 0; str && (i < MAX_MEMMAP_REGIONS); i++) {

Please.

> + unsigned long long start, size;
> + char *k = strchr(str, ',');
> +
> + if (k)
> + *k++ = 0;
> +
> + rc = parse_memmap(str, &start, &size);
> + if (rc < 0)
> + break;
> + str = k;
> + /* a usable region that should not be skipped */
> + if (size == 0)
> + continue;
> +
> + mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].start = start;
> + mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].size = size;

So this makes no sense. You parse start/size as unsigned long long and
then store it in an unsigned long. Works on 64bit, but on 32bit this is
just wrong:

Assume a memmap above 4G, which then will create a avoid region below 4G
due to truncation to unsigned long.

Thanks,

tglx
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4] DAX: enable iostat for read/write

2017-01-11 Thread Kani, Toshimitsu
On Tue, 2017-01-10 at 20:36 -0800, Joe Perches wrote:
> > 
On Tue, 2017-01-10 at 17:11 -0700, Toshi Kani wrote:
> > DAX IO path does not support iostat, but its metadata IO path does.
> > Therefore, iostat shows metadata IO statistics only, which has been
> > confusing to users.
> 
> []
> > diff --git a/fs/dax.c b/fs/dax.c
> 
> []
> > @@ -1058,12 +1058,22 @@ dax_iomap_rw(struct kiocb *iocb, struct
> > iov_iter *iter,
> 
> []
> > +   if (blk_queue_io_stat(disk->queue)) {
> > +   int sec = iov_iter_count(iter) >> 9;
> > +
> > +   start = jiffies;
> > +   generic_start_io_acct(iov_iter_rw(iter),
> > +     (!sec) ? 1 : sec, &disk-
> > >part0);
> > +   }
> 
> There is a signed/unsigned conversion of sec
> It may be better to use something like:
> 
>   size_t sec  = iov_iter_count(iter) >> 9;
>   [...]
>   generic_start_io_acct(iov_iter_rw(iter),
>     min_t(unsigned long, 1, sec),
>     &disk->part0);

Good catch. I will change as you suggested, and use 'sector_t' for
'sec'. 

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6] x86: fix kaslr and memmap collision

2017-01-11 Thread Dave Jiang


On 01/11/2017 05:00 AM, Thomas Gleixner wrote:
> On Tue, 10 Jan 2017, Dave Jiang wrote:
>> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int 
>> base);
>> +long simple_strtol(const char *cp, char **endp, unsigned int base);
> 
> What are those functions for? They are not used in that patch at all.

My mistake. Those were used for something in earlier versions of the
patch and I forgot to take them out now they aren't being used. Will remove.

> 
>> +static void mem_avoid_memmap(void)
>> +{
>> +char arg[128];
>> +int rc = 0;
> 
> What's the point of defining this variable here and zero initializing it?

It was necessary when the function was returning int instead of void. I
will move.

> 
>> +/* see if we have any memmap areas */
> 
> Sentences start with an upper case letter. Everywhere!

Will fix.

> 
>> +if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
> 
> You can spare an indentation level by just returning when the search fails.

Will update.

> 
>> +int i = 0;
>> +char *str = arg;
>> +
>> +while (str && (i < MAX_MEMMAP_REGIONS)) {
> 
>   for (i = 0; str && (i < MAX_MEMMAP_REGIONS); i++) {
> 
> Please.

The reason it's a while is so that when we hit an instance of
memmap=nn@ss we can conditionally skip it without having to increment
the counter 'i'.

> 
>> +unsigned long long start, size;
>> +char *k = strchr(str, ',');
>> +
>> +if (k)
>> +*k++ = 0;
>> +
>> +rc = parse_memmap(str, &start, &size);
>> +if (rc < 0)
>> +break;
>> +str = k;
>> +/* a usable region that should not be skipped */
>> +if (size == 0)
>> +continue;
>> +
>> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].start = start;
>> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].size = size;
> 
> So this makes no sense. You parse start/size as unsigned long long and
> then store it in an unsigned long. Works on 64bit, but on 32bit this is
> just wrong:
> 
> Assume a memmap above 4G, which then will create a avoid region below 4G
> due to truncation to unsigned long.

Ok so it looks like an issue for 32bit on the original code? I'm
presuming none of the previous mem avoid regions are over 4G so that
wasn't an issue? I will update the data structure to use unsigned long
long.

Thanks for the review!

> 
> Thanks,
> 
>   tglx
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7] x86: fix kaslr and memmap collision

2017-01-11 Thread Dave Jiang
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
However it does not take into account the memmap= parameter passed in from
the kernel cmdline. This results in the kernel sometimes being put in
the middle of memmap. Teaching kaslr to not insert the kernel in
memmap defined regions. We will support up to 4 memmap regions. Any
additional regions will cause kaslr to disable. The mem_avoid set has
been augmented to add up to 4 unusable regions of memmaps provided by the
user to exclude those regions from the set of valid address range to insert
the uncompressed kernel image. The nn@ss ranges will be skipped by the
mem_avoid set since it indicates memory useable.

Signed-off-by: Dave Jiang 
Acked-by: Kees Cook 
Acked-by: Baoquan He 
---
 arch/x86/boot/boot.h |1 
 arch/x86/boot/compressed/kaslr.c |  140 +-
 arch/x86/boot/string.c   |   13 
 3 files changed, 151 insertions(+), 3 deletions(-)

v2:
Addressed comments from Ingo.
- Handle entire list of memmaps
v3:
Fix 32bit build issue
v4:
Addressed comments from Baoquan
- Not exclude nn@ss ranges
v5:
Addressed additional comments from Baoquan
- Update commit header and various coding style changes
v6:
Addressed comments from Kees
- Only fail for physical address randomization
v7:
Addressed comments from Thomas
- Dropped unused functions
- Made address and size in memmap_avoid unsigned long long
- Style fixes

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index e5612f3..9b42b6d 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -333,6 +333,7 @@ size_t strnlen(const char *s, size_t maxlen);
 unsigned int atou(const char *s);
 unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int 
base);
 size_t strlen(const char *s);
+char *strchr(const char *s, int c);
 
 /* tty.c */
 void puts(const char *);
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index a66854d..8b7c9e7 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -11,6 +11,7 @@
  */
 #include "misc.h"
 #include "error.h"
+#include "../boot.h"
 
 #include 
 #include 
@@ -52,15 +53,22 @@ static unsigned long get_boot_seed(void)
 #include "../../lib/kaslr.c"
 
 struct mem_vector {
-   unsigned long start;
-   unsigned long size;
+   unsigned long long start;
+   unsigned long long size;
 };
 
+/* Only supporting at most 4 unusable memmap regions with kaslr */
+#define MAX_MEMMAP_REGIONS 4
+
+static bool memmap_too_large;
+
 enum mem_avoid_index {
MEM_AVOID_ZO_RANGE = 0,
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
+   MEM_AVOID_MEMMAP_BEGIN,
+   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
MEM_AVOID_MAX,
 };
 
@@ -77,6 +85,123 @@ static bool mem_overlaps(struct mem_vector *one, struct 
mem_vector *two)
return true;
 }
 
+/**
+ * _memparse - Parse a string with mem suffixes into a number
+ * @ptr: Where parse begins
+ * @retptr: (output) Optional pointer to next char after parse completes
+ *
+ * Parses a string into a number.  The number stored at @ptr is
+ * potentially suffixed with K, M, G, T, P, E.
+ */
+static unsigned long long _memparse(const char *ptr, char **retptr)
+{
+   char *endptr;   /* Local pointer to end of parsed string */
+
+   unsigned long long ret = simple_strtoull(ptr, &endptr, 0);
+
+   switch (*endptr) {
+   case 'E':
+   case 'e':
+   ret <<= 10;
+   case 'P':
+   case 'p':
+   ret <<= 10;
+   case 'T':
+   case 't':
+   ret <<= 10;
+   case 'G':
+   case 'g':
+   ret <<= 10;
+   case 'M':
+   case 'm':
+   ret <<= 10;
+   case 'K':
+   case 'k':
+   ret <<= 10;
+   endptr++;
+   default:
+   break;
+   }
+
+   if (retptr)
+   *retptr = endptr;
+
+   return ret;
+}
+
+static int
+parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
+{
+   char *oldp;
+
+   if (!p)
+   return -EINVAL;
+
+   /* We don't care about this option here */
+   if (!strncmp(p, "exactmap", 8))
+   return -EINVAL;
+
+   oldp = p;
+   *size = _memparse(p, &p);
+   if (p == oldp)
+   return -EINVAL;
+
+   switch (*p) {
+   case '@':
+   /* Skip this region, usable */
+   *start = 0;
+   *size = 0;
+   return 0;
+   case '#':
+   case '$':
+   case '!':
+   *start = _memparse(p + 1, &p);
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static void mem_avoid_memmap(void)
+{
+   char arg[128];
+   int rc;
+   int i;
+   char *str;
+
+   /* See if we have any memmap areas */
+   rc = cmdline_find_option("memmap

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-11 Thread Stephen Bates
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm&m=148156541804940&w=2
[2] https://community.mellanox.com/docs/DOC-1119

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm